Working with >1M cells#

Scirpy scales to millions of cells on a single workstation. This page is a work-in-progess collection with advice how to work with large datasets.

Use an up-to-date version!

Scalability has been a major focus of recent developments in Scirpy. Make sure you use the latest version when working with large datasets to take advantage of all speedups.

Distance metrics#

Computing pairwise sequence distances the major bottleneck for large datasets in the scirpy workflow. Here is some advice on how to maximize the speed of this step:

Choose an appropriate distance metric for `pp.ir_dist`#

Some distance metrics are significantly faster than others. Here are the distance metrics, roughly ordered by speed:

identity > gpu_hamming > hamming = normalized_hamming > tcrdist > levenshtein > fastalignment > alignment

TCRdist, fastalignment and alignment produce very similar distance matrices, but tcrdist is by far the fastest. For this reason, we’d always recommend to go with tcrdist, when looking for a metric taking into account a substitution matrix.

Multi-machine parallelization with dask#

The hamming, normalized_hamming, tcrdist, levenshtein, fastalignment, and alignment metrics are parallelized using joblib. This makes it very easy to switch the backend to dask to distribute jobs across a multi machine cluster. Note that this comes with a considerable overhead for communication between the workers. It’s only worthwhile when processing on a single machine becomes infeasible.

from dask.distributed import Client, LocalCluster
import joblib

# substitute this with a multi-machine cluster...
cluster = LocalCluster(n_workers=16)
client = Client(cluster)

with joblib.parallel_config(backend="dask", n_jobs=200, verbose=10):
    ir.pp.ir_dist(
        mdata,
        metric="tcrdist",
        n_jobs=1, # jobs per worker
        n_blocks = 20, # number of blocks sent to dask
    )

Using GPU acceleration for hamming distance#

The Hamming distance metric supports GPU acceleration via cupy.

First, install the optional cupy dependency:

!pip install scirpy[cupy]

Then simply run

ir.pp.ir_dist(mdata, metric="gpu_hamming")

to take advantage of GPU acceleration.

Working with >1M cells

Contents

Working with >1M cells#

Distance metrics#

Choose an appropriate distance metric for pp.ir_dist#

Multi-machine parallelization with dask#

Using GPU acceleration for hamming distance#

Choose an appropriate distance metric for `pp.ir_dist`#