Building a Single-Cell RNA-seq Analysis Pipeline with Scanpy: From PBMC Clustering to Trajectory Discovery

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

The Scanpy-based PBMC-3k analysis pipeline implements a rigorous computational framework for processing single-cell transcriptomics data. It utilizes the Leiden algorithm and PAGA connectivity modeling to transform raw gene counts into annotated immune cell trajectories.

Why This Matters

Bioinformatics pipelines often struggle with technical noise like mitochondrial contamination and cell doublets which can mask true biological variance. This Scanpy-based workflow addresses these issues by integrating Scrublet for doublet removal and regression techniques to isolate biological signals from technical artifacts. Engineers must implement these multi-step filtering and normalization strategies to ensure that downstream clustering and pseudotime analysis reflect actual cellular differentiation rather than experimental bias. By leveraging PAGA and diffusion maps, researchers can move beyond static clusters to understand the dynamic connectivity between immune cell states.

Key Insights

Quality Control Metrics: Identifying mitochondrial (MT-) and ribosomal (RPS/RPL) gene signals is essential for filtering low-quality cells and debris.
Doublet Mitigation: Implementing Scrublet through Scanpy prevents the formation of artificial clusters caused by cell multiplets.
Feature Selection: Identifying highly variable genes (HVG) via dispersion-based ranking significantly reduces noise before PCA dimensionality reduction.
Leiden Clustering: Graph-based clustering on neighborhood graphs allows for high-resolution partitioning of immune lineages.
Trajectory Modeling: Utilizing PAGA and Diffusion Pseudotime enables the visualization of continuous progression patterns across cell states.

Working Examples

Initial setup, mitochondrial gene identification, and basic cell/gene filtering.

import scanpy as sc; adata = sc.datasets.pbmc3k(); adata.var['mt'] = adata.var_names.str.startswith('MT-'); sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True); sc.pp.filter_cells(adata, min_genes=200); sc.pp.filter_genes(adata, min_cells=3)

Doublet detection with Scrublet, normalization, and highly variable gene selection.

sc.pp.scrublet(adata); adata = adata[~adata.obs['predicted_doublet'], :].copy(); sc.pp.normalize_total(adata, target_sum=1e4); sc.pp.log1p(adata); sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

Dimensionality reduction, clustering, and PAGA trajectory initialization.

sc.tl.pca(adata, svd_solver='arpack'); sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40); sc.tl.umap(adata); sc.tl.leiden(adata, resolution=0.5); sc.tl.paga(adata, groups='leiden'); sc.tl.diffmap(adata)

Practical Applications

Diagnostic Immune Profiling: Annotating immune populations using canonical markers like CD79A for B-cells and NKG7 for NK cells while avoiding mitochondrial contamination pitfalls.
Pharmacogenomics: Calculating interferon-response scores (e.g., using ISG15, IFIT1) to measure cellular reaction to treatments across different clusters.
Developmental Modeling: Applying diffusion pseudotime to map cell state transitions; a common anti-pattern is relying solely on UMAP which can obscure global lineage connectivity.

References:

https://www.marktechpost.com/2026/05/08/how-to-build-a-single-cell-rna-seq-analysis-pipeline-with-scanpy-for-pbmc-clustering-annotation-and-trajectory-discovery/

On This Page

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Build an End-to-End Single Cell RNA Sequencing Pipeline with Scanpy

The Bottleneck Was Never Generation: Building Governed Agentic Systems

AI Initiatives Demand Quality Data and Realistic Expectations