The Dilemma with Gene Differential Expression Analysis in Single Cell/Nuclei Data
- Shahzaib Ali
- Nov 24, 2024
- 5 min read
Differential Gene Expression (DGE) Analysis is a technique used to identify genes that are differentially expressed between two or more conditions, such as knockout vs. wild-type in bulk RNA-seq or comparing different cell types or clusters in single-cell RNA-seq. While the general principle of identifying genes that show significant differences in expression remains the same, the methods used in bulk RNA-seq and single-cell RNA-seq differ considerably due to the nature of the data in each case.
Bulk RNA-seq vs. Single-cell RNA-seq
In bulk RNA-seq, the RNA from a large number of cells (often millions) is pooled together to generate an average expression profile for the tissue or sample of interest. This approach assumes that the gene expression levels are representative of the entire cell population, and the focus is on detecting differences in gene expression between conditions, such as a knockout mutant versus a wild-type sample.
The most commonly used tools for Differential Expression (DE) analysis in bulk RNA-seq are DESeq2 and edgeR. These methods are based on parametric models that assume a certain distribution of the data (often assuming a negative binomial distribution for RNA-seq counts). Here’s a breakdown of how they work:
DESeq2: This tool uses a negative binomial distribution to model count data and is designed to handle overdispersion (where the variance exceeds the mean, which is typical in RNA-seq data). DESeq2 uses a log2 fold change to measure the magnitude of expression differences and a Wald test or likelihood ratio test to assess statistical significance. DESeq2 normalizes the data by estimating size factors to account for differences in sequencing depth across samples.
edgeR: Similar to DESeq2, edgeR uses a negative binomial distribution to model count data and tests for differential expression using either a likelihood ratio test (LRT) or a exact test. edgeR also normalizes count data by estimating library size factors and uses empirical Bayes methods to shrink variance estimates, improving statistical power for small datasets or lowly expressed genes.
These methods perform well when the data are relatively well-behaved (e.g., the RNA-seq counts are not overly sparse), and they assume that gene expression differences between groups are not influenced by other confounding factors like batch effects or technical variation.
In contrast, single-cell RNA-seq generates expression profiles for individual cells, often producing data for tens of thousands to millions of single cells. This high level of granularity allows for the identification of rare cell types and the study of cell-to-cell heterogeneity, which is particularly valuable in fields like developmental biology, cancer research, and immunology. However, this high-dimensional data also presents unique challenges.
Single-cell RNA-seq data are sparse, meaning that most cells will have zero counts for many genes. This is due to the limited amount of RNA captured per cell and the dropout events where mRNA transcripts are not detected. Additionally, single-cell data are highly variable, with some genes showing extreme fluctuations in expression from one cell to another. Therefore, traditional parametric methods like those used in bulk RNA-seq (e.g., DESeq2 and edgeR) are not well-suited to single-cell RNA-seq data.
Instead, non-parametric tests and specialized models are often used to analyze single-cell RNA-seq data. One of the most popular approaches is the Wilcoxon rank-sum test, which is a non-parametric test that compares the distribution of expression values between two groups (e.g., normal vs. cancer cells, or one cluster vs. another). Since single-cell RNA-seq data tend to be zero-inflated (many genes have zero expression in a large portion of cells), the Wilcoxon test is particularly useful because it does not assume any specific distribution and is robust to outliers and sparse data.
Negative Binomial Distribution: Some methods, like Seurat, Monocle, and MAST, use variations of the negative binomial distribution to model gene expression in single cells. These models account for both overdispersion (high variance relative to the mean) and dropout events (when a gene is not detected in a cell even though it may be expressed). This allows these methods to provide more accurate estimates of gene expression differences, especially for lowly expressed genes.
Wilcoxon Rank-Sum Test: As mentioned earlier, the Wilcoxon rank-sum test is widely used in single-cell RNA-seq differential expression analysis. It ranks the expression values of each gene in both groups and compares the sums of the ranks. This test is particularly effective in scenarios where the data is sparse and non-normally distributed, which is a common feature of single-cell RNA-seq.
Challenges of Differential Expression in Imbalanced Groups
One major challenge in single-cell RNA-seq differential expression analysis arises when comparing groups of very unequal sizes. For example, if one group contains only 100 cells (e.g., a rare cancer cell type) and the other group contains 20,000 cells (e.g., a more common cell type), the larger group may dominate the differential expression analysis, leading to false positives and biased results.
In such cases, the Wilcoxon rank-sum test (or other statistical methods) can be influenced by the group size imbalance, as larger groups tend to have more variability and contribute more ranks to the final sum. This can lead to the identification of genes that are more variable in the larger group, even if the actual differential expression is not biologically significant.
Addressing the Group Size Imbalance: Bootstrapping
To address this issue, a method called bootstrapping can be applied. Bootstrapping involves repeatedly drawing random samples from the larger group to match the size of the smaller group and performing differential expression analysis on each sample. For example, if the smaller group has 100 cells and the larger group has 20,000 cells, you would randomly sample 100 cells from the larger group and compare it with the 100 cells from the smaller group. This process is repeated many times (e.g., 1,000 iterations), and the log fold changes for each gene are averaged across all iterations.
The advantage of bootstrapping is that it accounts for the size disparity between groups and helps to reduce false positives by averaging the expression differences over many random samples. This approach allows for the identification of rare cell types and biomarkers that are unique to the smaller group, providing a more reliable estimate of differential expression.
However, one limitation of bootstrapping is that only genes that appear in every iteration are reported. If a gene is differentially expressed in one iteration but not in others, it will not be included in the final list of differentially expressed genes. This means that only genes that consistently appear across multiple iterations are considered significant.


Conclusion: The Need for Better Tools
In summary, while bulk RNA-seq and single-cell RNA-seq both aim to identify differential gene expression, they involve different statistical methods due to the nature of the data. Bulk RNA-seq uses parametric tests like DESeq2 and edgeR, which assume a negative binomial distribution and work well when the sample sizes are balanced and the data is not overly sparse. On the other hand, single-cell RNA-seq data are sparse, highly variable, and often zero-inflated, making non-parametric tests like the Wilcoxon rank-sum test more appropriate for identifying differential expression.
When dealing with imbalanced group sizes in single-cell RNA-seq (e.g., rare vs. abundant cell types), bootstrapping can be a useful technique to reduce false positives and provide more reliable results. However, this method also has its limitations, such as the requirement that only genes appearing in every bootstrap iteration are reported, potentially missing out on genes that are only differentially expressed in a subset of the samples.
As the field of single-cell RNA-seq continues to evolve, the development of more sophisticated tools to address these challenges will be crucial for improving the accuracy and reproducibility of differential expression analysis in highly imbalanced datasets.



Comments