Blog

5 scRNA-seq Analysis Mistakes I See in Almost Every Lab (And How to Fix Them)

By Cory Henn, MSCS · March 2026

Side-by-side UMAP plots of intestinal epithelial scRNA-seq data: left panel shows 31 Leiden clusters by number, right panel shows the same cells annotated by cell type including enterocytes, goblet cells, Paneth cells, T cells, and cancer stem cells

Real intestinal epithelial scRNA-seq data: 31 Leiden clusters (left) resolved into named cell types after marker-based annotation (right). Even a clean, well-separated UMAP like this one does not make distances between clusters meaningful. The visual separation tells you populations are distinct; it does not tell you how distinct, or in what biological sense.

I've spent the last several years sitting at the intersection of wet lab immunology and computational analysis. I've prepped scRNA-seq libraries by hand and then written the code to analyze the results. That dual perspective has shown me a pattern: the same handful of analytical mistakes show up in lab after lab, across institutions, across career stages. None of these are career-ending errors, but each one quietly degrades your results or leads you to a conclusion your data does not actually support.

Mistake 1: Applying a One-Size-Fits-All QC Threshold

QCSeuratScanpyThresholds

The default move: filter cells with fewer than 200 genes, more than 5,000 genes, and more than 10% mitochondrial reads. These thresholds show up in tutorials, Seurat vignettes, and that one lab protocol everyone passes around. They are often wrong for your specific experiment.

QC thresholds are dataset-dependent. A 10% mitochondrial cutoff might be reasonable for PBMCs but far too aggressive for metabolically active tissue-resident macrophages or intestinal epithelial cells. If you are studying immune cells in an inflammatory microenvironment, you might be throwing away your most biologically interesting cells.

The Fix

Always plot the distributions first. Look at violin plots of nFeature, nCount, and percent.mt for your specific dataset. Set thresholds based on what you see, not what a tutorial told you. Use adaptive thresholds (like scater's isOutlier() or Scanpy's automatic detection) that adjust to the actual distribution of your data. Document your rationale.

Mistake 2: Ignoring Batch Effects Until the End

Batch EffectsHarmonyscVIIntegration

You run your first sample through the pipeline. Beautiful clusters. Clear cell types. Then you add the second sample, and suddenly your UMAP looks like two islands with a canyon between them. Your biological signal is now confounded with technical variation.

Batch effects, whether from library prep day, sequencing lane, reagent lot, or even the person running the protocol, can be as large as or larger than the biological differences you are trying to detect. If you do not account for them early, everything downstream is compromised.

The Fix

Build batch correction into your pipeline from the start. For straightforward cases, Harmony integration is fast and effective. For more complex experimental designs, consider scVI or scANVI, which learn batch-corrected latent representations while preserving biological variation. Always compare your integrated embedding to an uncorrected version to verify you are removing technical noise, not biological signal.

Mistake 3: Over-Trusting Automated Cell Type Annotation

AnnotationSingleRCellTypistValidation

Automated annotation tools (SingleR, CellTypist, scType, or LLM-based approaches) are genuinely useful for a first-pass sense of what is in your data. The mistake is treating their output as ground truth without validation.

These tools are reference-dependent. If your cells do not match the training reference well, because they are from a disease state, a non-standard tissue, or a species with limited reference data, the annotations can be confidently wrong. I have seen effector memory T cells labeled as naive, and activated macrophages called dendritic cells, because the automated tool did not have the right reference context.

The Fix

Use automated tools as a starting point, then validate with known marker genes. Check that your CD8+ T cell cluster actually expresses CD8A and CD8B and not CD4. Flag low-confidence annotations for manual review. For immunology datasets, curate a marker gene panel from the literature that reflects the biology of your system, not just generic PBMC markers.

Mistake 4: Running Differential Expression on Pseudoreplicates

PseudobulkDESeq2StatisticsReplication

This is the most statistically consequential mistake on this list, and it is alarmingly common. You have scRNA-seq data from 3 treated mice and 3 control mice and want to find differentially expressed genes. You run FindMarkers() or scanpy.tl.rank_genes_groups() comparing all treated cells vs. all control cells.

The problem: you are treating individual cells as independent observations when they are not. Cells from the same animal share genetics, environment, and technical processing. By treating every cell as an independent replicate, you massively inflate your effective sample size and get false discovery rates that can exceed 50%.

The Fix

Use pseudobulk approaches for differential expression between conditions. Aggregate counts at the biological replicate level (per sample, per cell type), then use established bulk RNA-seq tools (DESeq2, edgeR, limma-voom) that properly model biological variability. Tools like Libra, muscat, or scran's pseudoBulkDGE make this straightforward.

Mistake 5: Treating the UMAP as a Quantitative Result

UMAPVisualizationTrajectoryInterpretation

UMAPs are beautiful. They are also one of the most misinterpreted visualizations in modern biology. The mistake: drawing quantitative conclusions from UMAP geometry, inferring that clusters are "close" or "far" based on their visual distance, or concluding that a trajectory exists because cells form a gradient on the plot.

UMAP is a dimensionality reduction technique optimized for preserving local neighborhood structure. Global distances on a UMAP are not meaningful. Two clusters sitting next to each other may not be more similar than clusters on opposite ends of the plot. The shape, density, and spacing of clusters are all artifacts of the algorithm's parameters.

The Fix

Use UMAPs for what they are good at: visualization and qualitative exploration. For quantitative relationships between cell states, use the underlying high-dimensional data or the graph structure directly. Report marker gene expression, distance metrics in PCA space, or formal trajectory inference tools (Monocle3, scVelo, CellRank) rather than pointing at UMAP shapes. Always report your UMAP parameters.

The Bottom Line

None of these mistakes mean your science is bad. They mean the field moves fast, the tools are complex, and most researchers learned computational biology by necessity rather than training. The good news: every one of these is fixable, usually without re-running your experiment.

If any of this sounds familiar, or if you have a dataset that has been sitting in a folder waiting for analysis, I can help.

Book a free 30-minute call and we will figure out the best path forward for your data.

Book a Free 30-Minute Call

← Back to home