Our study follows a standard benchmark design, consisting of test datasets, feature selection methods to be evaluated and metrics for measuring performance (Extended Data Fig. 1). The complete benchmarking pipeline is implemented as a Nextflow50 workflow (Extended Data Fig. 2) available from GitHub51 and archived on Zenodo52. Summaries of the specific methods, metrics, datasets and processing steps are provided in the following sections. Please refer to the supplementary methods, pipeline code, original publications and package documentation for further information.
Evaluated methods
We selected a range of feature selection methods covering approaches from standard analysis workflows and alternative methods proposed for scRNA-seq data. To be considered, a method must be implemented in a publicly available package that we could reliably install and run. Some methods can automatically determine the number of features to select, but for most others this must be specified. A few methods can consider batch labels during selection, but for most, this requires manually splitting the data, computing feature sets on each batch and combining the results. We have used the default settings or what is recommended in any accompanying documentation for most methods, but for a subset of highly used methods, we evaluated variants. Any preprocessing steps required before feature selection are considered part of the method. We used the steps suggested in the documentation for each method as they are recommended by the authors and represent the most likely real-world usage.
Simple control methods
We include all features and random feature sets in the evaluation as control methods. We expect that using feature sets selected by real methods improves performance over using all features and any randomly selected sets. To control for variability in selecting random features, we always include five random feature sets selected with different seeds and average metric scores over the five sets.
Excess variability methods
The most common approach to feature selection in RNA-seq analysis tool boxes such as scanpy10 and Seurat11 is to select highly variable features, those that show excess variability beyond what is expected. This approach assumes that extra variability results from differences in gene expression between cell populations or states and that selecting these features will identify those important to the cells in the sample.
We benchmark the following excess variability methods: features with the highest variance, the fitting method from Brennecke et al.38 (implemented in scran53 v.1.26.0), variants from Seurat11 (v.4.3.0) (Seurat-dispersion, Seurat-MVP11 and Seurat-VST25), variants from scanpy10 (v.1.9.1) (scanpy-Seurat, scanpy-SeuratV3 and scanpy-Cell Ranger) and the approach from ‘Orchestrating Single-Cell Analysis with Bioconductor’35 using batchelor54 (v.1.14.0) and scran. For scanpy methods we used both standard and batch-aware variants.
Methods based on other statistical features
Other feature statistics can also be used for feature selection including selecting features with the highest mean expression, Anticor27 (v.0.1.8), which selects features with excess negative correlations, NBumi which selects features with excess zeros (M3Drop v.1.24.0)29 and DUBStepR (commit 76aa3948), which uses stepwise regression of a binned correlation matrix28.
For Anticor, we disabled the filtering of predefined gene pathways as it requires gene identifiers, which are not available for all datasets. For NBumi, we select features with an adjusted P value <0.01 unless this results in fewer than 500 features, in which case the 500 features with the lowest P values were used.
Model-based methods
Model-based methods fit an appropriate distributional model to the dataset. Features are then selected by looking for those significantly different from the fitted model. These include scTransform55 (v.0.3.5, accessed via Seurat), analytic Pearson residuals56 (implemented in scanpy) and scry (v.1.10.0)57.
Embedding-based methods
Dimensionality reduction is a commonly used preprocessing step in scRNA-seq analysis. Some feature selection methods either use sophisticated embedding methods or look for features that vary across an embedding. scPNMF (commit 47d5b10c) performs a modified PNMF, where an alternative initialization is used and selects features associated with informative bases31, and singleCellHaystack (v.0.3.4) uses Kullback-Leibler divergence to find features that are expressed in subsets of nonrandomly positioned cells36. For singleCellHaystack, we first select features using Seurat-VST and perform a 50-dimensional PCA as input.
Graph-based methods
Another common step in scRNA-seq analysis is to build a nearest-neighbor graph of cells, typically using positions in an embedded space. Some methods operate on these graphs. Hotspot (v.1.0.0) looks for features with a high local auto-correlation within a graph58 and triku (v.2.1.4) uses a neighborhood graph to distinguish features that are expressed in a few cells randomly across a dataset from those that are expressed in a few related cells30. For both, we use a graph based on a PCA of all features as input.
Supervised methods
We focus on evaluating unsupervised feature selection methods, as cell labels are typically not available before the integration process; however, at least some level of cell labels may be available, particularly for atlas-building projects that combine previously annotated public datasets. As an example supervised method, we include marker genes selected using the Wilcoxon rank-sum test (as implemented in scanpy) followed by a filtering procedure to remove features expressed in less than 10% of cells within a label, expressed in more than 80% of cells outside the label or with a P value >0.1. The remaining features are sorted by estimated log fold change and the top 200 features are selected per label. The final feature set is the intersection of the features selected for each label.
We also included known transcription factors downloaded from The Human Transcription Factors59 website (https://humantfs.ccbr.utoronto.ca/index.php) selecting 1,639 genes where the ‘Is TF?’ field was equal to ‘Yes’. The intersection of this list with the genes in each dataset was used. This method cannot be applied to the splat dataset as it does not contain real gene names.
Stable expression methods
The opposite of highly variable features are those stably expressed or varying less than expected. The scSEGIndex method in the scMerge package (v.1.1.4.0) fits a gamma-Gaussian mixture model to each feature23. The parameters of this model and other features, such as the proportion of zero counts, are used to rank features and calculate a stability index. We used these stable features as a negative control and they should perform poorly for integration as they should not capture either technical noise or biological signal.
Evaluation metrics
We implemented a wide array of metrics designed to evaluate different aspects of creating and using an integrated scRNA-seq reference. Some metrics require a ground truth cell label, while others are unsupervised and measure whether the structure in a single sample is maintained. All metrics are designed so that a raw score of 0 represents the worst possible performance and a raw score of 1 the best possible performance.
Integration (Batch)
Integration (Batch) metrics measure the mixing between batches in the reference. Cells of the same cell type should be thoroughly mixed and neighborhoods should be equally likely to contain cells from any batch. The batch ASW3, Batch PCR3, graph connectivity3 and graph-based iLISI3,13 are implemented in scIB3 (v.1.1.4) using scikit-learn60 (v.1.1.2). The kBET metric17 is accessed from the kBET R package (commit a10ffeaa) via scIB. To calculate an overall score for the Seurat mixing metric14 we divided the cell scores by the maximum neighborhood size, took the mean across cells and subtracted from 1 so higher scores are better. For the CMS metric18 in the CellMixS package (v.1.14.0) we use 1 minus the proportion of cells with a P value <0.1.
Integration (Bio)
Integration (Bio) metrics measure whether biological signals (primarily cell labels) are conserved after integration. Unlike batch correction metrics, where perfect scores can be obtained by mapping cells to a single point, biological conservation metrics require that cell labels are separated after integration. The label ASW3, graph-based cLISI3,13, cell cycle conservation3, ARI3, NMI3, Isolated labels ASW3 and Isolated labels FI3 metrics are implemented in scIB using scikit-learn. bARI16 and bNMI metrics are available from balanced_clustering (commit a2ae3a4d). For the Seurat local structure metric14 we used the average over all cells as the final score and for ldfDiff18 we took the absolute distance and set an upper bound to get a cell score and used 1 minus the mean cell score as the overall score. The cell cycle metric3 scores cells11 using genes from Tirosh et al.61 with ENSEMBL IDs obtained from Biomart62 using the biomaRt package63. It cannot be calculated for the splat dataset as it does not contain cell cycle effects. For metrics that require clusters (ARI, NMI, bARI and bNMI), we performed Leiden clustering with the resolution parameter set to values between 0.1 and 2 in steps of 0.1 using scanpy via scIB and selected the resolution with the best metric score.
Mapping quality
Mapping quality metrics assesses how well the reference represents the query and is able to merge it into the same space. For perfect mapping, cell types present in both the reference and query should be mixed, as should batches within the query. At the same time, biology within the query should be preserved. The cell distance metric calculates the Mahalonobis distance between each mapped query cell and the distribution of the corresponding label in the reference12. To create a bound for the distance we calculate the distance for every cell in the reference for a label and take the 90th quantile. The final score is 1 minus the proportion of mapped cells outside the boundary. The label distance considers labels as a whole rather than individual cells12. The Mahalonobis distance is calculated between the centroid of the label in the query and the matching label in the reference. Labels are skipped if they have fewer than 20 cells in the query or are not in the reference. We used the maximum distance of query cells to their label centroid as a boundary. Distances to the matching reference label are then scaled using this value and set to 1 if they exceed the maximum distance. The final score is the mean across cell types.
mLISI is the same as iLISI but measures mixing between the query and reference (also known as ref_query LISI12) and qLISI measures mixing between query batches after mapping (also known as query_donors LISI12).
kNN correlation measures how well cell neighborhoods are maintained12. For each query batch, a PCA is performed and the Euclidean distances to the 100 nearest neighbors of each cell are calculated. The distances to the same neighbors in the joint integrated embedding are also calculated and the Spearman correlation is computed. After adjusting the correlations to the range 0 to 1, the mean of cells in each batch is calculated and the final score is the mean across batches. For particularly bad integrations (that is small random feature sets), a cell may be equally distant from all neighbors, in which case the correlation cannot be calculated and it is assigned a score of 0.
The reconstruction metric assesses a generative model’s ability to represent query cells by sampling from the posterior distribution and measuring the cosine distance between the mean posterior expression profile and the true cell expression profile64. We adjusted the distances to be in the range 0 to 1 and took 1 minus the mean distance as the final score. This metric cannot be calculated for Symphony integrations as it is not a generative method.
Classification
The classification (or label transfer) metrics measure how well a classifier trained on the reference can correctly predict labels for query cells. We use standard classification metrics: accuracy, F1 score, Jaccard index, Matthews correlation coefficient (adjusted to [0, 1]) and macro-averaged area under the precision-recall curve as implemented by scikit-learn. For F1 and the Jaccard index we use micro, macro and rarity-weighted19 averages over labels.
Unseen population prediction
Unseen population metrics focus on novel biology in the query by measuring how mapping has affected cell labels present in the query but deliberately left out of the reference. These should be maintained as separate populations but an integration that does not properly capture variation may merge them with other labels.
The unseen uncertainty metric uses the output of the label transfer classifier and measures poor classification of unseen cell by calculating 1 minus the mean probability of the assigned class for query cells from unseen populations. Unseen cell distance is based on the cell distance metric but calculated only for unseen query populations. As the label does not exist in the reference, we calculate distances to each cell’s nearest reference population and subtract the final score from 1 so that higher distances (greater separation from the reference) give higher scores. Unseen label distance applies similar changes to the label distance metric by calculating distances to the nearest reference label centroid.
We use the milopy65 (commit be1a6cc8) implementation of the Milo differential abundance method15 as a metric to detect unseen populations by taking query or reference as the covariate of interest64. A neighborhood graph is calculated in the integrated embedding using a number of neighbors equal to five times the number of batches (up to a maximum of 200). Milo is then applied to a subset of cells (up to 20,000 cells or 10% of the datasets, whichever is higher). The score for each label is the proportion of cell neighborhoods significantly associated with the query (false discovery rate-adjusted P value <0.1). The overall score is the average of the proportions across all unseen labels. In rare cases for poor integrations where Milo cannot select cells from an unseen label, that label is assigned a score of 0.
Benchmarking datasets
We selected datasets representing different scenarios (tissues, technologies and developmental stages) where integration is a critical analysis step, including smaller-scale datasets and larger atlases. We chose query batches by selecting batches with shared characteristics different from the remaining samples, such as technology, time point or location. The unseen populations removed from the reference were chosen by looking for labels enriched in the query batches and selecting labels presenting different challenges, such as rare or perturbed cells. For each dataset, we use the cell labels assigned by the original authors.
scIB Pancreas
We downloaded the scIB pancreas dataset3 from figshare66. Cell labels were taken from the ‘celltype’ cell annotation column (12 reference labels) and batches from the ‘tech’ column. For the query, we used batches representing the CEL-seq and CEL-seq2 technologies with the ‘activated_stellate’ label treated as an unseen population. The prepared dataset contained 18,319 features, 12,731 reference cells (seven batches) and 3,243 query cells (two batches).
NeurIPS 2021
We downloaded the NeurIPS 2021 CITE-seq dataset67,68 from the Gene Expression Omnibus (GEO)69 (GSE194122) and used only the gene expression features. Cell labels were taken from the ‘cell_type’ annotation and batch labels from the ‘batch’ annotation. We considered samples from Site 4 as the query with the ‘CD8+ T naive’ and ‘Proerythroblast’ labels treated as unseen query populations. After preparation, the dataset contained 13,953 features, 70,061 reference cells (nine batches) with 42 reference labels and 16,715 query cells (three batches).
Fetal liver hematopoiesis
We downloaded the fetal liver hematopoiesis70 dataset from CellAtlas.io71 using batch labels from the ‘fetal.ids’ annotation and cell labels from the ‘cell.ids’ annotation. Three samples from different developmental stages were treated as the query with ‘Kupffer Cell’, ‘NK’, ‘ILC precursor’ and ‘Early lymphoid_T lymphocyte’ as unseen populations. The prepared dataset contains 26,686 features, 62,384 reference cells (11 batches and 23 reference labels) and 26,449 query cells (three batches).
Reed breast
We downloaded the version of the Reed breast dataset32 released with the preprint72 from the Chan Zuckerberg CELLxGENE: Discover Census (https://cellxgene.cziscience.com/)73 (dataset ID 0ba636a1-4754-4786-a8be-7ab3cf760fd6, Census version 2023-07-05) using the cellxgene-census package (v.1.0.1) and subsetted to cells with a BRCA status of either wild-type (‘WT’ or ‘assumed_WT’) or ‘BRCA1’. Donor ID was used as the batch label, with cell labels taken from the ‘level2’ annotation. We excluded a subset of cells labeled as doublets, as it is not clear how they should be treated by metrics. Wild-type cells were used to create the reference and BRCA1 cells were used as the query. The ‘BSL2’, ‘CD8T 1’, ‘CD8T 2’, ‘CD8T 3’, ‘FB5’, ‘LEC1’ and ‘LEC2’ labels were used as unseen labels. After preparation, the dataset contained 33,691 features, 337,339 reference cells (24 batches and 32 reference labels) and 197,649 query cells (17 batches).
Single-cell Eye in a Disk
We downloaded the single-cell Eye in a Disk (scEiaD) dataset74 from the plae: PLatform for Analysis of scEiad website (https://plae.nei.nih.gov/) and selected the human cells derived from tissue samples where the organ was specified as ‘Eye’. We removed cells that did not have a cell label or were labeled as doublets and batches with fewer than 500 cells remaining, as these caused some metrics to produce unreliable results. Cell labels were taken from the ‘CellType_predict’ annotation (harmonized labels from a classifier) and the ‘batch’ annotation was used for batches. We split batches using cell capture technology, with 10x v.2 taken as the reference and 10x v.3 and Drop-seq batches as the query. The ‘B-Cell’, ‘Blood Vessel’, ‘Macrophage’, ‘Pericyte’, ‘Smooth Muscle Cell’ and ‘T/NK-Cell’ labels are unseen populations. After preparation, the dataset contained 19,560 features, 360,270 reference cells (69 batches and 41 reference labels) and 48,496 query cells (18 batches).
Human endoderm
We downloaded the Human endoderm dataset34 from Mendeley Data75. Individuals were treated as batches with labels obtained from the ‘Cell_type’ annotation. A small number of cells labeled as ‘Undefined’ were removed. Samples from weeks 12–15 were selected as the query with ‘Basal like’, ‘Ciliated’, ‘Hepatocyte’, ‘Mesenchyme subtype 4’ and ‘T cell/NK cell 1’ labels treated as query-specific. The prepared dataset consisted of 27,855 features, 100,580 reference cells (ten batches and 21 reference labels) and 44,784 query cells (four batches).
Human Lung Cell Atlas
We downloaded the core Human Lung Cell Atlas dataset33 from the Chan Zuckerberg CELLxGENE: Discover Census (dataset ID 066943a2-fdac-4b29-b348-40cede398e4e, Census version 2023-07-25) and used the ‘dataset’ annotation as defined by the authors as batch labels with the ‘ann_finest_level’ annotation as labels. Datasets from organ donors were treated as the reference and healthy and diseased samples from living donors made up the query. ‘Multiciliated (nasal)’, ‘Club (nasal)’, ‘Goblet (subsegmental)’, ‘SMG serous (nasal)’, ‘SMG serous (bronchial)’, ‘SMG mucous’, ‘EC aerocyte capillary’, ‘Peribronchial fibroblasts’, ‘Smooth muscle’, ‘Smooth muscle FAM83D+’, ‘B cells’, ‘DC2’, ‘Alveolar Mph CCL3+’ and ‘Mast cells’ labels are unseen populations. After preparation, the dataset included 27,987 features, 314,573 reference cells (nine batches and 47 reference labels) and 251,400 query cells (five batches).
HLCA (immune)
The HLCA (immune) dataset takes the full HLCA dataset and uses the coarsest level of annotation to select cells in the immune compartment. The batches and labels are the same as the full HLCA dataset, but after subsetting, only ‘B cells’, ‘DC2’, ‘Alveolar Mph CCL3+’ and ‘Mast cells’ remain as unseen labels. We also removed some batches with insufficient cells. The prepared dataset has 26,618 features, 155,385 reference cells (seven batches and 16 reference labels) and 52,795 query cells (two batches).
HLCA (epithelial)
The HLCA (epithelial) dataset is a second subset of the HLCA dataset focusing on the epithelial compartment. This subset consists of 27,673 features, 118,374 reference cells (eight batches and 17 reference labels) and 162,875 query cells (five batches) with ‘Multiciliated (nasal)’, ‘Club (nasal)’, ‘Goblet (subsegmental)’, ‘SMG serous (nasal)’, ‘SMG serous (bronchial)’ and ‘SMG mucous’ remaining as unseen labels.
splat
Simulations address some limitations of real data by providing a definite ground truth. We generated a dataset using a modified version of the splat simulation in the Splatter package26 designed to represent a scenario where a tissue is measured using three different technologies (two batches each) in two conditions. These ‘technologies’ measure a medium number of cells at medium depth (Batch1 and Batch2), a low number of cells at high depth (Batch3 and Batch4) and a high number of cells at low depth (Batch5 and Batch6), with the low-depth samples used as the query. The simulation contains ten cell labels, including a progenitor differentiating along two trajectories (one with an ‘Intermediate’ cell type only present in the query) and six discrete cell types that differ in number of cells, number of differentially expressed genes and number of detected features. The discrete groups include a ‘Rare’ population and a ‘Perturbed’ state, which are only present in the query. To increase the variability in the simulation, we added additional label-specific noise factors to the model, which were applied before generating counts. The splat dataset contains 9,984 features, 30,041 reference cells (four batches and seven reference labels) and 69,936 query cells (two batches).
Benchmarking pipeline
To improve reproducibility, make sure that results are up-to-date as code is updated and easily take advantage of computing resources, we built a pipeline using Nextflow50 (Extended Data Fig. 2). The pipeline takes a dataset, applies standard preprocessing and splits it into reference and query samples. The feature selection methods are applied to the reference, and selected features used for integration. After integration, the query is mapped to the reference, and a cell label classifier is trained. The reference and query, ground truth cell labels and transferred labels are provided to metrics. The metric scores are then scaled, aggregated and ranked. Pipeline stages use both Python (v.3.9.13) and R76 (v.4.2.2), including packages from Bioconductor77. The Python anndata package78 (v.0.8.0) was used to store data and save it as H5AD files between pipeline stages. The zellkonverter package (v.1.8.0) was used to load data into R via the reticulate (v.1.26) interface where it was stored as SingleCellExperiment35 (v.1.20.0) or SeuratObject (v.4.1.3) objects.
Dataset preprocessing
The preprocessing step includes basic quality control filtering of cells using scanpy and storing information (such as batch and label) in standard locations. We removed cells with fewer than 100 total counts or expressing fewer than 100 features. The dataset is split into a reference and query based on the batch labels. Labels with fewer than 20 cells are removed from both the reference and query, as some metrics can behave unpredictably with small cell numbers. Labels defined as unseen populations are also removed from the reference. The final preprocessing step removes any features not expressed in the reference.
Integration and query mapping
The base model we use for integration is scVI24 available in scvi-tools79 (v.0.17.1). This model uses a conditional variational autoencoder and allows the mapping of query samples using architecture surgery80. We also train a scANVI model37 a semi-supervised extension of scVI where cell labels are used to finetune the network. These models take raw count data as input, so we did not consider the interaction between feature selection and normalization methods.
As an alternative approach based on correcting a PCA space, we included integration with Harmony13 followed by query mapping using Symphony12. We provide Harmony with normalized expression values rather than raw counts as suggested by the documentation. Counts are first normalized to counts per 10,000, then log-transformed. The dataset is subset to the selected features and scaled with a maximum value of 10 (per feature) and 30 principal components are provided to Harmony. For Symphony, log-transformed normalized query data are provided (scaling is performed during mapping). Data preprocessing steps are performed using functions in scanpy and integration and query mapping are performed using harmonypy81 (v.0.0.9) and symphonypy82 (v.0.2.1).
Label transfer
We trained a multinomial logistic regression classifier on the integrated reference using scikit-learn, taking the position of each cell in the integrated embedding space as input and the ground truth cell labels as the output. Labels are transferred to the query by providing the mapped embedding coordinates to the trained classifier, predicting the probability for each reference label and recording the label with the highest probability.
Metric selection
For metric selection we used different numbers of randomly selected features across all test datasets. We also included feature sets of different sizes from the scanpy-Seurat method to evaluate the relationship with the number of features as random gene sets have no inherent ordering (the first features selected are no more informative than the last features selected). We evaluated the behavior of individual metric scores and the relationships between them. Metrics were removed if they could not distinguish between feature sets (have an insufficient dynamic range), were overly correlated (Pearson correlation) with the number of features, were associated with technical dataset features or showed undesirable correlation patterns.
Selecting a number of features
We evaluated different numbers of features for methods in Seurat and scanpy as well as high variance or high mean expression. We calculated z-scores across methods and datasets to see how performance changed with the number of features. To reduce the computational cost, we limited this part of the analysis by methods rather than datasets as it allowed us to see the effect of the number of features across datasets. The number of features used for the benchmark (2,000) was chosen by considering trends over methods, datasets and metric types.
Analysis of results
The relative rather than absolute performance of methods and the aggregation across metrics are most informative. All metrics produced scores in the range of 0 to 1 (with higher being better), but they have different real dynamic ranges. To scale each metric for each dataset we used a set of reference methods to establish the effective range of each metric. These are all features, randomly selected features, stably expressed features from scSEGIndex and batch-aware features from scanpy-Cell Ranger as an example of current standard practice3,22. Depending on the metric, using all features performs either well or poorly, while random and stably expressed features result in high batch-correction scores but poor biological conservation. The baseline methods were used to establish a range for each metric (for a dataset), and then all scores were scaled relative to that range. Scaling using baseline methods provides ranges that are more interpretable and are not affected by adding or removing methods.
The scaled metric scores were aggregated by taking the mean for each category. This level of aggregation gives a summarized performance for each of the methods for each task. An overall score for each dataset is obtained using a weighted mean of the task scores.
$$\begin{array}{l}\mathrm {{Overall}}=\frac{1}{2}\times\left(\frac{\mathrm {{Int}.{Batch}}}{2}+\frac{\mathrm {{Int}.{Bio}}}{2}\right)+\frac{1}{2}\\\qquad\qquad\times\left(\frac{{\mathrm {Mapping}}}{3}+\frac{\mathrm {{Class}.}}{3}+\frac{\mathrm {{Unseen}}}{3}\right)\,\end{array}$$
Methods were ranked at the level of metric categories, datasets and over the whole benchmark. These rankings let us evaluate which methods perform better at different tasks or scenarios. We also checked for consistency between integration approaches and variants of feature selection methods.
Further analysis examined the similarity between methods by considering the overlap in selected sets calculated using the Jaccard index. We also compared between the full HLCA dataset and subsets representing the immune and epithelial compartments.
Final figures were produced using the ggplot2 package83 (v.3.5.0) and assembled using patchwork (v.1.2.0). Data processing was performed using tidyverse84 (v.2.0.0) packages.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.