scirpy.tl.define_clonotype_clusters#

scirpy.tl.define_clonotype_clusters(adata, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, same_j_gene=False, within_group='receptor_type', key_added=None, partitions='connected', resolution=1, n_iterations=5, distance_key=None, inplace=True, n_jobs=-1, chunksize=2000, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices')#

Define clonotype clusters.

As opposed to define_clonotypes() which employs a more stringent definition of clonotypes, this function flexibly defines clonotype clusters based on amino acid or nucleic acid sequence identity or similarity.

Requires running ir_dist() with the same sequence and metric values first.

Definition of clonotype(-clusters) follows roughly the following procedure:

Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of receptor_arms, dual_ir, same_v_gene and within_group.
Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.

Parameters:

adata (Union[AnnData, MuData, DataHandler]) – AnnData or MuData object that contains AIRR information.
sequence (Literal['aa', 'nt'] (default: 'aa')) – The sequence parameter used when running scirpy.pp.ir_dist()
metric (Union[Literal['alignment', 'fastalignment', 'identity', 'levenshtein', 'hamming', 'gpu_haming', 'normalized_hamming', 'tcrdist'], DistanceCalculator] (default: 'identity')) – The metric parameter used when running scirpy.pp.ir_dist()
receptor_arms (Literal['VJ', 'VDJ', 'all', 'any'] (default: 'all')) –
One of the following options:
- "VJ" - only consider VJ sequences
- "VDJ" - only consider VDJ sequences
- "all" - both VJ and VDJ need to match
- "any" - either VJ or VDJ need to match
If "any", two distances are combined by taking their minimum. If "all", two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any") the smaller distance is relevant. If we require both receptors to match ("all"), the larger distance is relevant.
dual_ir (Literal['primary_only', 'all', 'any'] (default: 'any')) –
One of the following options:
- "primary_only" - only consider most abundant pair of VJ/VDJ chains
- "any" - consider both pairs of VJ/VDJ sequences. Distance must be below cutoff for any of the chains.
- "all" - consider both pairs of VJ/VDJ sequences. Distance must be below cutoff for all of the chains.
Distances are combined as for receptor_arms.

See also Dual IR.
same_v_gene (bool (default: False)) –
Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.

v genes are matched based on the behaviour defined with receptor_arms and dual_ir.

within_group

Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in adata.obs that contains a grouping, or to None, if you want no constraints.

key_added

The column name under which the clonotype clusters and cluster sizes will be stored in adata.obs and under which the clonotype network will be stored in adata.uns.

Defaults to cc_{sequence}_{metric}, e.g. cc_aa_levenshtein, where cc stands for “clonotype cluster”.

The clonotype sizes will be stored in {key_added}_size, e.g. cc_aa_levenshtein_size.

The clonotype x clonotype network will be stored in {key_added}_dist, e.g. cc_aa_levenshtein_dist.

partitions

How to find graph partitions that define a clonotype. Possible values are leiden, for using the “Leiden” algorithm, fastgreedy for using the “Fastgreedy” algorithm and connected to find fully connected sub-graphs.

The difference is that the Leiden and Fastgreedy algorithms further divide fully connected subgraphs into highly-connected modules.

“Leiden” finds the community structure of the graph using the Leiden algorithm of Traag, van Eck & Waltman.

“Fastgreedy” finds the community structure of the graph according to the algorithm of Clauset et al based on the greedy optimization of modularity.

resolution

resolution parameter for the leiden algorithm.

n_iterations

n_iterations parameter for the leiden algorithm.

distance_key

Key in adata.uns where the sequence distances are stored. This defaults to ir_dist_{sequence}_{metric}.

inplace

If True, adds the results to anndata, otherwise returns them.

n_jobs

Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than 2 * chunksize a single worker thread will be used to avoid overhead.

chunksize

Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.

airr_mod

Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

airr_key

Key under which the AIRR information is stored in adata.obsm as an awkward array.

chain_idx_key

Key under which the chain indices are stored in adata.obsm. If chain indices are not present, index_chains() is run with default parameters.

Return type:

tuple[Series, Series, dict] | None

Returns:

clonotype

A Series containing the clonotype id for each cell. Will be stored in adata.obs[key_added] if inplace is True

clonotype_size

A Series containing the number of cells in the respective clonotype for each cell. Will be stored in adata.obs[f"{key_added}_size"] if inplace is True.

distance_result

A dictionary containing

distances: A sparse, pairwise distance matrix between unique receptor configurations
cell_indices: A dict of lists, containing the adata.obs_names (cell indices) for each row in the distance matrix.

If inplace is True, this is added to adata.uns[key_added].

scirpy.tl.define_clonotype_clusters

Contents

scirpy.tl.define_clonotype_clusters#