Which Method Do You Actually Use?

By Cory Henn · April 2026 · No linear algebra required

This is Part 4 of a 4-part series on dimensionality reduction for biological data. Part 1 · Part 2 · Part 3

If you've made it through Parts 1–3, you've got a solid conceptual handle on what dimensionality reduction is doing and why it matters. You know why your 20,000-gene dataset needs compression, how PCA finds the axes of greatest variance, and why t-SNE and UMAP exist to catch the nonlinear structure that PCA flattens out.

Now comes the part everyone actually wants: how do you choose?

The one question that drives everything

Before you pick a method, answer this: What are you trying to do with the output?

Not “what is my data?” — that matters too, but it's downstream. The primary question is what job this plot needs to do. Because dimensionality reduction is always a tool serving a purpose, and the right tool depends on the purpose.

There are really only four things you're ever trying to do:

Explore — You have a new dataset and you're figuring out what's in it
Communicate — You need a figure for a paper, a talk, or a collaborator
Quantify — You need reduced dimensions as input to a downstream analysis (clustering, trajectory inference, a classifier)
Diagnose — You're checking whether your data has problems (batch effects, technical artifacts, outliers)

If you're exploring a new dataset

Start with PCA. Always.

I know UMAP is more exciting and the clusters look cleaner. Do PCA first anyway. PCA gives you information that UMAP hides. A PCA plot where your samples all pile into a single blob tells you something important: that there is no strong linear structure in your data, or that your variation is dominated by something you did not expect (a technical effect, a covariate you forgot about, a batch issue). A UMAP of that same data might look beautiful, with gorgeous clusters, and lead you completely astray.

PCA is your sanity check. Run it. Look at the variance explained per component. If the first two PCs explain 50% of the variance, that's a healthy, information-rich dataset. If they explain 8%, your signal is spread thin across many weak axes, and that is worth knowing before you project into 2D.

The practical workflow for exploration

Run PCA. Check the elbow plot. Check PC1 vs PC2, colored by everything you have (batch, condition, cell type, sample ID).
If PCA shows structure: great, you already have something. Now run UMAP for a better-looking version.
If PCA shows a blob: investigate before moving on. You might have a batch problem, or your feature selection needs work.

If you're making a figure for a paper or talk

Use UMAP. Full stop.

UMAP produces the clearest visual separation of distinct cell populations, it's fast enough to run on datasets of any reasonable size, and it is what reviewers and readers expect to see in single-cell papers right now. t-SNE is fine and sometimes still preferred in certain subfields, but if you do not have a specific reason to use it, UMAP is the default.

A few things to nail before you finalize that figure:

Run UMAP with at least two or three different n_neighbors values (try 10, 30, 50) and look at how the output changes. You should see that the core structure is stable even as the parameter changes. If your clusters completely rearrange when you change n_neighbors, that is a red flag worth investigating before the figure goes in a paper.

Label your clusters with what they actually are, not “cluster 0” through “cluster 11.” And do not annotate distances between clusters as if they mean something. UMAP preserves local neighborhoods, not global distances. Two clusters being far apart on a UMAP plot tells you nothing about how different those populations are at the transcript level.

If you need dimensions for a downstream analysis

This is where people most often reach for the wrong tool.

If you are planning to cluster, run trajectory inference (Monocle, Slingshot, PAGA), or train a classifier on your reduced-dimension data: use PCA components, not UMAP coordinates.

This surprises people, because UMAP looks so much more organized. But UMAP compresses your data aggressively into 2D. You are throwing away an enormous amount of information to make something look pretty. That is fine for visualization. It is a problem when you are using the coordinates as quantitative input.

PCA components are better-behaved for downstream analysis because they are linear, continuous, and do not have the topology-preservation artifacts that make UMAP coordinates unreliable for measuring distance. The standard workflow in Seurat and Scanpy does this correctly by default: they cluster on PCA space (the neighbors graph is built from PCs), then visualize the clusters on UMAP. If you are manually building something, follow that same pattern.

If you're diagnosing data problems

This is the PCA-only zone.

Batch effects, technical artifacts, sample swap, contamination: PCA will surface all of these faster than any other method, because it is linear and interpretable. When something weird is driving your data, it shows up as structure in the early principal components.

The diagnostic workflow

Run PCA.
Color your PCA plot by everything that is not biology: sequencing depth, mitochondrial gene fraction, batch ID, sample collection date, library prep technician, ambient RNA level.
If any of those covariates explain structure in PC1–PC5, you have a problem to fix before you do any biology.

UMAP will hide these problems. The manifold learning will pick up on whatever structure exists in the data, including technical structure, and present it as beautiful clusters. You might not realize until much later that “Cluster 4” is actually “samples from the Tuesday batch.” PCA does not hide anything. That is what makes it the right tool for diagnosis.

Side-by-side PCA and UMAP plots of two real PBMC datasets merged without batch correction. PCA (left) cleanly separates the two batches on PC1, which explains 11% of variance. UMAP (right) embeds both datasets on the same canvas, making it impossible to distinguish biological structure from batch artifact.

Two real PBMC datasets merged without batch correction. PCA immediately flags batch as the dominant source of variation — PC1 alone explains 11% of variance and completely separates the two datasets. UMAP embeds them on the same canvas, leaving an ambiguous picture where you cannot tell which structure is biology and which is batch artifact.

Reproduce this figure on GitHub

The quick-reference decision table

What you need	Use this	Avoid this
First look at a new dataset	PCA	UMAP (too early)
Paper or talk figure	UMAP	t-SNE (slower, less global structure)
Input to clustering	PCA components	UMAP coordinates
Input to trajectory inference	Let the pipeline decide	Raw UMAP coordinates
Diagnosing batch effects	PCA	UMAP (gives ambiguous picture)
Small dataset (<500 cells)	PCA or t-SNE	UMAP (needs more data to work well)
Checking if two conditions differ	PCA	(UMAP is fine too, just less interpretable)

One thing that unifies all of this

Every dimensionality reduction method is making a trade-off. PCA trades flexibility for interpretability. UMAP trades interpretability for the ability to capture complex structure. t-SNE trades speed for fine-grained local structure.

None of them are showing you your data. They are all showing you a projection of your data: a shadow cast from a high-dimensional object onto a lower-dimensional surface. The shadow is informative. The shadow is not the object.

The best analysts do not pick one method and commit to it. They run multiple methods, compare what they agree on, and get skeptical about anything that only shows up in one projection. When PCA and UMAP both agree that there are three major populations in your data, that is a robust finding. When a cluster only appears in UMAP and vanishes in PCA, that warrants investigation before it goes in the paper. Use the tools. Understand what they are showing you. Do not let the pretty picture do your thinking for you.

Series Complete: What We Covered

Part 1 — Why your data has too many dimensions and what goes wrong in high-dimensional space
Part 2 — How PCA works and the decisions that actually matter
Part 3 — Why t-SNE and UMAP exist and the mistakes people make with nonlinear methods
Part 4 — The practical decision framework for choosing the right method

Coming Up Next

The Reproducible Immunology series continues. Next up: flow cytometry compensation — what it is actually doing mathematically, why it goes wrong, and how to catch it when it does.

I'm Cory Henn, an immunologist and data scientist who helps biotech teams and academic PIs make sense of complex biological data. If you have a dataset that needs answers, I offer free 30-minute discovery calls.

Book a free 30-minute call and we will figure out the best path forward for your data.

Book a Free 30-Minute Call

← Back to blog