Genetic variation can alter brain structure and, consequently, function. Comparative statistical analysis of mouse brains across genetic backgrounds requires spatial, single-cell, atlas-scale data, in replicates—a challenge for current technologies. We introduce Atlas-scale Transcriptome Localization using Aggregate Signatures (ATLAS), a scalable tissue mapping method. ATLAS learns transcriptional signatures from scRNAseq data, encodes them in situ with tens of thousands of oligonucleotide probes, and decodes them to infer cell types and imputed transcriptomes. We validated ATLAS by comparing its cell type inferences with direct MERFISH measurements of marker genes and quantitative comparisons to four other technologies. Using ATLAS, we mapped the central brains of five male and five female C57BL/6J (B6) mice and five male BTBR T+ tf/J (BTBR) mice, an idiopathic model of autism, collectively profiling over 40 million cells across over 400 coronal sections. Our analysis revealed over 40 significant differences in cell type distributions and identified 16 regional composition changes across male-female and B6-BTBR comparisons. ATLAS thus enables systematic comparative studies, facilitating organ-level structure-function analysis of disease models.Competing Interest StatementZH and RW are the co-founders of Biocartography Inc.
LinkSpatially resolved transcriptomics offers unprecedented insight by enabling the profiling of gene expression within the intact spatial context of cells, effectively adding a new and essential dimension to data interpretation. To efficiently detect spatial structure of interest, an essential step in analyzing such data involves identifying spatially variable genes. Despite researchers having developed several computational methods to accomplish this task, the lack of a comprehensive benchmark evaluating their performance remains a considerable gap in the field. Here, we present a systematic evaluation of 14 methods using 60 simulated datasets generated by four different simulation strategies, 12 real-world transcriptomics, and three spatial ATAC-seq datasets. We find that spatialDE2 consistently outperforms the other benchmarked methods, and Moran’s I achieves competitive performance in different experimental settings. Moreover, our results reveal that more specialized algorithms are needed to identify spatially variable peaks.Competing Interest StatementThe authors have declared no competing interest.
LinkUnderstanding diverse responses of individual cells to the same perturbation is central to many biological and biomedical problems. Current methods, however, do not precisely quantify the strength of perturbation responses and, more importantly, reveal new biological insights from heterogeneity in responses. Here we introduce the perturbation-response score (PS), based on constrained quadratic optimization, to quantify diverse perturbation responses at a single-cell level. Applied to single-cell transcriptomes of large-scale genetic perturbation datasets (e.g., Perturb-seq), PS outperforms existing methods for quantifying partial gene perturbation responses. In addition, PS presents two major advances. First, PS enables large-scale, single-cell-resolution dosage analysis of perturbation, without the need to titrate perturbation strength. By analyzing the dose-response patterns of over 2,000 essential genes in Perturb-seq, we identify two distinct patterns, depending on whether a moderate reduction in their expression induces strong downstream expression alterations. Second, PS identifies intrinsic and extrinsic biological determinants of perturbation responses. We demonstrate the application of PS in contexts such as T cell stimulation, latent HIV-1 expression, and pancreatic cell differentiation. Notably, PS unveiled a previously unrecognized, cell-type-specific role of coiled-coil domain containing 6 (CCDC6) in guiding liver and pancreatic lineage decisions, where CCDC6 knockouts drive the endoderm cell differentiation towards liver lineage, rather than pancreatic lineage. The PS approach provides an innovative method for dose-to-function analysis and will enable new biological discoveries from single-cell perturbation datasets.One sentence summary We present a method to quantify diverse perturbation responses and discover novel biological insights in single-cell perturbation datasets.
LinkIn single-cell RNA sequencing (scRNA-seq) analysis, a key challenge is inferring hidden cellular dynamics from static cell snapshots. Various computational methods have been developed to address this, focusing on perspectives like pseudotime trajectories, RNA velocities, and estimating the differentiation potential of cells, often referred to as "cell potency." This review summarizes 14 methods for defining cell potency from scRNA-seq data, categorizing them into average-based, entropy-based, and correlation-based methods based on how they summarize gene expression levels into a potency measure. We highlight the key similarities and differences within and between these categories, offering a high-level intuition for each method. Additionally, we use unified mathematical notations to detail each method's methodology and summarize their usage complexities, including parameters, required inputs, and differences between published descriptions and software implementations. We conclude that cell potency estimation remains an open question without a consensus on the optimal approach, emphasizing the need for benchmark datasets and studies. This review aims to provide a foundation for future benchmark studies, while also addressing the broader challenge of comparing methods that infer cellular dynamics from scRNA-seq data through various perspectives, including pseudotime trajectories, RNA velocities, and cell potency.
LinkIn typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as “double dipping”: the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.Competing Interest StatementThe authors have declared no competing interest.
LinkHomeodomains (HDs) are the second largest class of DNA binding domains (DBDs) among eukaryotic sequence-specific transcription factors (TFs) and are the TF structural class with the largest number of disease-associated mutations in the Human Gene Mutation Database (HGMD). Despite numerous structural studies and large-scale analyses of HD DNA binding specificity, HD-DNA recognition is still not fully understood. Here, we analyze 92 human HD mutants, including disease-associated variants and variants of uncertain significance (VUS), for their effects on DNA binding activity. Many of the variants alter DNA binding affinity and/or specificity. Detailed biochemical analysis and structural modeling identifies 14 previously unknown specificity-determining positions, 5 of which do not contact DNA. The same missense substitution at analogous positions within different HDs often exhibits different effects on DNA binding activity. Variant effect prediction tools perform moderately well in distinguishing variants with altered DNA binding affinity, but poorly in identifying those with altered binding specificity. Our results highlight the need for biochemical assays of TF coding variants and prioritize dozens of variants for further investigations into their pathogenicity and the development of clinical diagnostics and precision therapies.
LinkWe present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
LinkBenchmarking single-cell RNA-seq (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) computational tools demands simulators to generate realistic sequencing reads. However, none of the few read simulators aim to mimic real data. To fill this gap, we introduce scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads (in a FASTQ or BAM file) by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data. Moreover, scReadSim provides ground truths, including unique molecular identifier (UMI) counts for scRNA-seq and open chromatin regions for scATAC-seq. In particular, scReadSim allows users to design cell-type-specific ground-truth open chromatin regions for scATAC-seq data generation. In benchmark applications of scReadSim, we show that UMI-tools achieves the top accuracy in scRNA-seq UMI deduplication, and HMMRATAC and MACS3 achieve the top performance in scATAC-seq peak calling.
LinkModeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models.Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes.The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM.Supplementary data are available at Bioinformatics online.
LinkResearchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.
LinkThe number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data.scSampler is implemented in Python and is published under the MIT source license. It can be installed by “pip install scsampler” and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler.Supplementary data are available at Bioinformatics online.
LinkscDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.
LinkHigh-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
LinkSingle-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data.The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997.Supplementary data are available at Bioinformatics online.
LinkA pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.
LinkTo investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.
LinkFor most marine organisms, species richness peaks in the Central Indo-Pacific region and declines longitudinally, a striking pattern that remains poorly understood. Here, we used phylogenetic approaches to address the causes of richness patterns among global marine regions, comparing the relative importance of colonization time, number of colonization events, and diversification rates (speciation minus extinction). We estimated regional richness using distributional data for almost all percomorph fishes (17 435 species total, including approximately 72% of all marine fishes and approximately 33% of all freshwater fishes). The high diversity of the Central Indo-Pacific was explained by its colonization by many lineages 5.3–34 million years ago. These relatively old colonizations allowed more time for richness to build up through in situ diversification compared to other warm-marine regions. Surprisingly, diversification rates were decoupled from marine richness patterns, with clades in low-richness cold-marine habitats having the highest rates. Unlike marine richness, freshwater diversity was largely derived from a few ancient colonizations, coupled with high diversification rates. Our results are congruent with the geological history of the marine tropics, and thus may apply to many other organisms. Beyond marine biogeography, we add to the growing number of cases where colonization and time-for-speciation explain large-scale richness patterns instead of diversification rates.
LinkAbstract Substantial genetic variation is found in weedy rice (Oryza sativa f. spontanea Roshev.) populations from different rice-planting regions with the change of farming styles. To determine the association of such genetic variation with rice farming changes is critical for understanding the adaptive evolution of weedy rice. We studied weedy-rice specific novel single nucleotide polymorphisms (SNPs) by genome-wide comparison between DNA sequences of weedy and cultivated rice, in addition to polymerase chain reaction fingerprinting at 22 selected novel SNP loci in weedy rice populations. A great number of novel SNPs were identified across the weedy rice genome. High frequencies of the novel SNPs were determined at the 22 selected loci, although with considerable variation among weedy rice populations in different rice-planting regions. The highest frequency (~57%) of novel SNPs was identified in weedy rice populations from Jiangsu that experienced the most dramatic changes in rice farming styles, including the shift from transplanting to direct seeding, and from indica to japonica varieties. The lowest frequency (~29%) was detected in weedy rice populations from Northeast China, where rice farming has undergone relatively less change. The association between frequencies of novel SNPs in weedy rice populations and the extent of changes in rice farming styles suggests the critical role of adaptive mutation and accumulation of the mutation influenced by human activities in the rapid evolution of weedy rice.
Link