DS Lab @ UConn Health - Publications

Preprint

Xiang, Yuan, Song, Wudil, Aliyu, Wester, and Shepherd, "Double machine learning to estimate the effects of multiple treatments and their interactions", (2025), DOI:10.48550/arXiv.2505.12617

Causal inference literature has extensively focused on binary treatments, with relatively fewer methods developed for multi-valued treatments. In particular, methods for multiple simultaneously assigned treatments remain understudied despite their practical importance. This paper introduces two settings: (1) estimating the effects of multiple treatments of different types (binary, categorical, and continuous) and the effects of treatment interactions, and (2) estimating the average treatment effect across categories of multi-valued regimens. To obtain robust estimates for both settings, we propose a class of methods based on the Double Machine Learning (DML) framework. Our methods are well-suited for complex settings of multiple treatments/regimens, using machine learning to model confounding relationships while overcoming regularization and overfitting biases through Neyman orthogonality and cross-fitting. To our knowledge, this work is the first to apply machine learning for robust estimation of interaction effects in the presence of multiple treatments. We further establish the asymptotic distribution of our estimators and derive variance estimators for statistical inference. Extensive simulations demonstrate the performance of our methods. Finally, we apply the methods to study the effect of three treatments on HIV-associated kidney disease in an adult HIV cohort of 2455 participants in Nigeria.

Link

Jiang, Yin, Robson, Li, Li, and Song, "spCorr: flexible and scalable inference of spatially varying correlation in spatial transcriptomics", bioRxiv (2025), DOI:10.1101/2025.09.30.679684

Spatial transcriptomics has transformed our ability to explore gene expression within its tissue context, enabling us to dissect subtle yet biologically significant variations in situ. While numerous computational methods have been proposed for detecting Spatially Varying Genes (SVGs) expression by modeling each individual gene separately, much less effort has been devoted to understanding how correlations between genes change across space. Such Spatially Varying Correlations (SVCs) are critical for understanding biological processes such as gene regulatory mechanisms shaped by local tissue environments, yet existing tools remain limited for this task. To address this gap, we present spCorr, a flexible and scalable regression framework for studying SVCs. spCorr provides interpretable, spot-level estimates of gene correlation and detects gene pairs whose correlations vary across locations or between tissue domains. Through extensive simulations and real-data analyses, we show that spCorr achieves high detection power, reliably controls the False Discovery Rate (FDR), and is computationally efficient. Importantly, spCorr reveals biologically meaningful correlation patterns that highlight fine-scale tissue structures, gene module functions, and region-specific interactions, offering new opportunities to study coordinated gene regulation in spatial transcriptomics.Competing Interest StatementThe authors have declared no competing interest.U.S. National Science Foundation, DBI-1846216, DMS- 211375NIH Common Fund, https://ror.org/001d55x84, R01GM120507, R35GM14088

Link

Song, Chen, Lee, Li, Ge, and Li, "Synthetic control removes spurious discoveries from double dipping in single-cell and spatial transcriptomics data analyses", (2023), DOI:10.1101/2023.07.21.550107

Double dipping is a well-known pitfall in single-cell and spatial transcriptomics data analysis: after a clustering algorithm finds clusters as putative cell types or spatial domains, statistical tests are applied to the same data to identify differentially expressed (DE) genes as potential cell-type or spatial-domain markers. Because the genes that contribute to clustering are inherently likely to be identified as DE genes, double dipping can result in false-positive cell-type or spatial-domain markers, especially when clusters are spurious, leading to ambiguously defined cell types or spatial domains. To address this challenge, we propose ClusterDE, a statistical method designed to identify post-clustering DE genes as reliable markers of cell types and spatial domains, while controlling the false discovery rate (FDR) regardless of clustering quality. The core of ClusterDE involves generating synthetic null data as an in silico negative control that contains only one cell type or spatial domain, allowing for the detection and removal of spurious discoveries caused by double dipping. We demonstrate that ClusterDE controls the FDR and identifies canonical cell-type and spatial-domain markers as top DE genes, distinguishing them from housekeeping genes. ClusterDE’s ability to discover reliable markers, or the absence of such markers, can be used to determine whether two ambiguous clusters should be merged. Additionally, ClusterDE is compatible with state-of-the-art analysis pipelines like Seurat and Scanpy.

Link

Hemminger, Sanchez-Tam, De Ocampo, Wang, Underwood, Xie, Zhao, Song, Li, Dong, and Wollman, "Spatial Single-Cell Mapping of Transcriptional Differences Across Genetic Backgrounds in Mouse Brains", bioRxiv (2024), DOI:10.1101/2024.10.08.617260

Genetic variation can alter brain structure and, consequently, function. Comparative statistical analysis of mouse brains across genetic backgrounds requires spatial, single-cell, atlas-scale data, in replicates—a challenge for current technologies. We introduce Atlas-scale Transcriptome Localization using Aggregate Signatures (ATLAS), a scalable tissue mapping method. ATLAS learns transcriptional signatures from scRNAseq data, encodes them in situ with tens of thousands of oligonucleotide probes, and decodes them to infer cell types and imputed transcriptomes. We validated ATLAS by comparing its cell type inferences with direct MERFISH measurements of marker genes and quantitative comparisons to four other technologies. Using ATLAS, we mapped the central brains of five male and five female C57BL/6J (B6) mice and five male BTBR T+ tf/J (BTBR) mice, an idiopathic model of autism, collectively profiling over 40 million cells across over 400 coronal sections. Our analysis revealed over 40 significant differences in cell type distributions and identified 16 regional composition changes across male-female and B6-BTBR comparisons. ATLAS thus enables systematic comparative studies, facilitating organ-level structure-function analysis of disease models.Competing Interest StatementZH and RW are the co-founders of Biocartography Inc.

Link

Wang, Zhai, Song, and Li, "Review of computational methods for estimating cell potency from single-cell RNA-seq data, with a detailed analysis of discrepancies between method description and code implementation", arXiv (2023), DOI:10.48550/arXiv.2309.13518

In single-cell RNA sequencing (scRNA-seq) analysis, a key challenge is inferring hidden cellular dynamics from static cell snapshots. Various computational methods have been developed to address this, focusing on perspectives like pseudotime trajectories, RNA velocities, and estimating the differentiation potential of cells, often referred to as "cell potency." This review summarizes 14 methods for defining cell potency from scRNA-seq data, categorizing them into average-based, entropy-based, and correlation-based methods based on how they summarize gene expression levels into a potency measure. We highlight the key similarities and differences within and between these categories, offering a high-level intuition for each method. Additionally, we use unified mathematical notations to detail each method's methodology and summarize their usage complexities, including parameters, required inputs, and differences between published descriptions and software implementations. We conclude that cell potency estimation remains an open question without a consensus on the optimal approach, emphasizing the need for benchmark datasets and studies. This review aims to provide a foundation for future benchmark studies, while also addressing the broader challenge of comparing methods that infer cellular dynamics from scRNA-seq data through various perspectives, including pseudotime trajectories, RNA velocities, and cell potency.

Link

Peer reviewed

Lai, Song, Xia, and Li, "PseudotimeDE-fast: fast testing of differential gene expression along cell pseudotime", Bioinformatics (2025), DOI:10.1093/bioinformatics/btaf573

Summary: Identifying differentially expressed (DE) genes along cell pseudotime is crucial for understanding dynamic biological processes captured by single-cell RNA sequencing. However, existing DE methods either produce invalid P-values by ignoring the uncertainty in pseudotime inference or struggle to scale with the growing size of modern datasets. To address these limitations, we introduce PseudotimeDE-fast, a scalable method for detecting DE genes along pseudotime with well-calibrated P-values. Through comprehensive simulations and real-data analyses, we demonstrate that PseudotimeDE-fast delivers comparable or superior performance to existing approaches while offering substantial improvements in computational efficiency.

Link

Li, M.Patel, Song, Yasa, Cannoodt, Yan, Li, and Pinello, "Systematic benchmarking of computational methods to identify spatially variable genes", Genome Biology (2025), DOI:10.1186/s13059-025-03731-2

Background: Spatially resolved transcriptomics offers unprecedented insight by enabling the profiling of gene expression within the intact spatial context of cells, effectively adding a new and essential dimension to data interpretation. To efficiently detect spatial structure of interest, an essential step in analyzing such data involves identifying spatially variable genes (SVGs). Despite researchers having developed several computational methods to accomplish this task, the lack of a comprehensive benchmark evaluating their performance remains a considerable gap in the field. Results: Here, we systematically evaluate 14 methods using 96 spatial datasets and 6 metrics. We compare the methods regarding gene ranking and classification based on real spatial variation, statistical calibration, and computation scalability and investigate the impact of identified SVGs on downstream applications such as spatial domain detection. Finally, we explore the applicability of the methods to spatial ATAC-seq data to examine their effectiveness in identifying spatially variable peaks (SVPs). Overall, SPARK-X outperforms other benchmarked methods and Moran’s I achieves a competitive performance, representing a strong baseline for future method development. Moreover, our results reveal that most methods are poorly calibrated, and more specialized algorithms are needed to identify spatially variable peaks. Conclusions: Our benchmarking provides a detailed comparison of SVG detection methods and serves as a reference for both users and method developers.

Link

Zhang, Li, Song, Yukselen, Nanda, Kucukural, Li, Garber, and Walhout, "Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq", Nature Communications (2025), DOI:https://doi.org/10.1038/s41467-025-60154-0

The transcriptome provides a highly informative molecular phenotype to connect genotype to phenotype and is most frequently measured by RNA-sequencing (RNA-seq). Therefore, an ultimate goal is to perturb every gene and measure changes in the transcriptome. However, this remains challenging, especially in intact organisms due to different experimental and computational challenges. Here, we present ‘Worm Perturb-Seq (WPS)’, which provides high-resolution RNA-seq profiles for hundreds of replicate perturbations at a time in a living animal. WPS introduces multiple experimental advances that combine strengths of bulk and single cell RNA-seq, and that further provides an analytical framework, EmpirDE, that leverages the unique power of the large WPS datasets. EmpirDE identifies differentially expressed genes (DEGs) by using gene-specific empirical null distributions, rather than control conditions alone, thereby systematically removing technical biases and improving statistical rigor. We applied WPS to 103 Caenhorhabditis elegans nuclear hormone receptors (NHRs) to delineate a Gene Regulatory Network (GRN) and found that this GRN presents a striking ‘pairwise modularity’ where pairs of NHRs regulate shared target genes. We envision that the experimental and analytical advances of WPS should be useful not only for C. elegans , but will be broadly applicable to other models, including human cells.

Link

Wang, Ge, Song, and Li, "Comment on “Data Fission: Splitting a Single Data Point” Data Fission for Unsupervised Learning: A Discussion on Post-Clustering Inference and the Challenges of Debiasing", Journal of the American Statistical Association (2025), DOI:10.1080/01621459.2024.2412191

Abstract not available.

Link

Song, Liu, Dai, McMyn, Wang, Yang, Krejci, Vasilyev, Untermoser, Loregger, Song, Williams, Rosen, Cheng, Chao, Kale, Zhang, Diao, Bürckstümmer, Siliciano, Li, Siliciano, Huangfu, and Li, "Decoding heterogeneous single-cell perturbation responses", Nature Cell Biology (2025), DOI:10.1038/s41556-025-01626-9

Abstract not available.

Link

Kock, Kimes, Gisselbrecht, Inukai, Phanor, Anderson, Ramakrishnan, Lipper, Song, Kurland, Rogers, Jeong, Blacklow, Irizarry, and Bulyk, "DNA binding analysis of rare variants in homeodomains reveals homeodomain specificity-determining residues", Nature Communications (2024), DOI:10.1038/s41467-024-47396-0

Homeodomains (HDs) are the second largest class of DNA binding domains (DBDs) among eukaryotic sequence-specific transcription factors (TFs) and are the TF structural class with the largest number of disease-associated mutations in the Human Gene Mutation Database (HGMD). Despite numerous structural studies and large-scale analyses of HD DNA binding specificity, HD-DNA recognition is still not fully understood. Here, we analyze 92 human HD mutants, including disease-associated variants and variants of uncertain significance (VUS), for their effects on DNA binding activity. Many of the variants alter DNA binding affinity and/or specificity. Detailed biochemical analysis and structural modeling identifies 14 previously unknown specificity-determining positions, 5 of which do not contact DNA. The same missense substitution at analogous positions within different HDs often exhibits different effects on DNA binding activity. Variant effect prediction tools perform moderately well in distinguishing variants with altered DNA binding affinity, but poorly in identifying those with altered binding specificity. Our results highlight the need for biochemical assays of TF coding variants and prioritize dozens of variants for further investigations into their pathogenicity and the development of clinical diagnostics and precision therapies.

Link

Song, Wang, Yan, Liu, Sun, and Li, "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics", Nature Biotechnology (2024), DOI:10.1038/s41587-023-01772-1

We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Link

Yan, Song, and Li, "scReadSim: a single-cell RNA-seq and ATAC-seq read simulator", Nature Communications (2023), DOI:10.1038/s41467-023-43162-w

Benchmarking single-cell RNA-seq (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) computational tools demands simulators to generate realistic sequencing reads. However, none of the few read simulators aim to mimic real data. To fill this gap, we introduce scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads (in a FASTQ or BAM file) by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data. Moreover, scReadSim provides ground truths, including unique molecular identifier (UMI) counts for scRNA-seq and open chromatin regions for scATAC-seq. In particular, scReadSim allows users to design cell-type-specific ground-truth open chromatin regions for scATAC-seq data generation. In benchmark applications of scReadSim, we show that UMI-tools achieves the top accuracy in scRNA-seq UMI deduplication, and HMMRATAC and MACS3 achieve the top performance in scATAC-seq peak calling.

Link

Cui, Song, Wong, and Li, "Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime", Bioinformatics (2022), DOI:10.1093/bioinformatics/btac423

Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models.Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes.The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM.Supplementary data are available at Bioinformatics online.

Link

Jiang, Sun, Song, and Li, "Statistics or biology: the zero-inflation controversy about scRNA-seq data", Genome Biology (2022), DOI:10.1186/s13059-022-02601-5

Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.

Link

Song, Xi, Li, and Wang, "scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data", Bioinformatics (2022), DOI:10.1093/bioinformatics/btac271

The number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data.scSampler is implemented in Python and is published under the MIT source license. It can be installed by “pip install scsampler” and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler.Supplementary data are available at Bioinformatics online.

Link

Sun, Song, Li, and Li, "Simulating Single-Cell Gene Expression Count Data with Preserved Gene Correlations by scDesign2", Journal of Computational Biology (2022), DOI:10.1089/cmb.2021.0440

scDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.

Link

Ge, Chen, Song, McDermott, Woyshner, Manousopoulou, Wang, Li, Wang, and Li, "Clipper: p-value-free FDR control on high-throughput data from two conditions", Genome Biology (2021), DOI:10.1186/s13059-021-02506-9

High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.

Link

Song, Li, Hemminger, Wollman, and Li, "scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling", Bioinformatics (2021), DOI:10.1093/bioinformatics/btab273

Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data.The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997.Supplementary data are available at Bioinformatics online.

Link

Sun, Song, Li, and Li, "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured", Genome Biology (2021), DOI:10.1186/s13059-021-02367-2

A pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.

Link

Song and Li, "PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data", Genome Biology (2021), DOI:10.1186/s13059-021-02341-y

To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.

Link

Miller, Hayashi, Song, and Wiens, "Explaining the ocean's richest biodiversity hotspot and global patterns of fish diversity", Proceedings of the Royal Society B: Biological Sciences (2018), DOI:10.1098/rspb.2018.1314

For most marine organisms, species richness peaks in the Central Indo-Pacific region and declines longitudinally, a striking pattern that remains poorly understood. Here, we used phylogenetic approaches to address the causes of richness patterns among global marine regions, comparing the relative importance of colonization time, number of colonization events, and diversification rates (speciation minus extinction). We estimated regional richness using distributional data for almost all percomorph fishes (17 435 species total, including approximately 72% of all marine fishes and approximately 33% of all freshwater fishes). The high diversity of the Central Indo-Pacific was explained by its colonization by many lineages 5.3–34 million years ago. These relatively old colonizations allowed more time for richness to build up through in situ diversification compared to other warm-marine regions. Surprisingly, diversification rates were decoupled from marine richness patterns, with clades in low-richness cold-marine habitats having the highest rates. Unlike marine richness, freshwater diversity was largely derived from a few ancient colonizations, coupled with high diversification rates. Our results are congruent with the geological history of the marine tropics, and thus may apply to many other organisms. Beyond marine biogeography, we add to the growing number of cases where colonization and time-for-speciation explain large-scale richness patterns instead of diversification rates.

Link

Song, Wang, Song, Zhou, Xu, Yang, Yang, and Lu, "Increased novel single nucleotide polymorphisms in weedy rice populations associated with the change of farming styles: Implications in adaptive mutation and evolution", Journal of Systematics and Evolution (2017), DOI:10.1111/jse.12230

Abstract Substantial genetic variation is found in weedy rice (Oryza sativa f. spontanea Roshev.) populations from different rice-planting regions with the change of farming styles. To determine the association of such genetic variation with rice farming changes is critical for understanding the adaptive evolution of weedy rice. We studied weedy-rice specific novel single nucleotide polymorphisms (SNPs) by genome-wide comparison between DNA sequences of weedy and cultivated rice, in addition to polymerase chain reaction fingerprinting at 22 selected novel SNP loci in weedy rice populations. A great number of novel SNPs were identified across the weedy rice genome. High frequencies of the novel SNPs were determined at the 22 selected loci, although with considerable variation among weedy rice populations in different rice-planting regions. The highest frequency (~57%) of novel SNPs was identified in weedy rice populations from Jiangsu that experienced the most dramatic changes in rice farming styles, including the shift from transplanting to direct seeding, and from indica to japonica varieties. The lowest frequency (~29%) was detected in weedy rice populations from Northeast China, where rice farming has undergone relatively less change. The association between frequencies of novel SNPs in weedy rice populations and the extent of changes in rice farming styles suggests the critical role of adaptive mutation and accumulation of the mutation influenced by human activities in the rapid evolution of weedy rice.

Link