scirpy.tl.define_clonotype_clusters#
- scirpy.tl.define_clonotype_clusters(adata, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, same_j_gene=False, within_group='receptor_type', key_added=None, partitions='connected', resolution=1, n_iterations=5, distance_key=None, inplace=True, n_jobs=-1, chunksize=2000, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices')#
Define clonotype clusters.
As opposed to
define_clonotypes()which employs a more stringent definition of clonotypes, this function flexibly defines clonotype clusters based on amino acid or nucleic acid sequence identity or similarity.Requires running
ir_dist()with the samesequenceandmetricvalues first.- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms,dual_ir,same_v_geneandwithin_group.Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.
- Parameters:
adata (
Union[AnnData,MuData,DataHandler]) – AnnData or MuData object that contains AIRR information.sequence (
Literal['aa','nt'] (default:'aa')) – The sequence parameter used when runningscirpy.pp.ir_dist()metric (
Union[Literal['alignment','fastalignment','identity','levenshtein','hamming','gpu_haming','normalized_hamming','tcrdist'],DistanceCalculator] (default:'identity')) – The metric parameter used when runningscirpy.pp.ir_dist()receptor_arms (
Literal['VJ','VDJ','all','any'] (default:'all')) –- One of the following options:
If
"any", two distances are combined by taking their minimum. If"all", two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any") the smaller distance is relevant. If we require both receptors to match ("all"), the larger distance is relevant.dual_ir (
Literal['primary_only','all','any'] (default:'any')) –- One of the following options:
Distances are combined as for
receptor_arms.See also Dual IR.
same_v_gene (
bool(default:False)) –Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_armsanddual_ir.
- within_group
Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in
adata.obsthat contains a grouping, or toNone, if you want no constraints.- key_added
The column name under which the clonotype clusters and cluster sizes will be stored in
adata.obsand under which the clonotype network will be stored inadata.uns.Defaults to
cc_{sequence}_{metric}, e.g.cc_aa_levenshtein, whereccstands for “clonotype cluster”.The clonotype sizes will be stored in
{key_added}_size, e.g.cc_aa_levenshtein_size.The clonotype x clonotype network will be stored in
{key_added}_dist, e.g.cc_aa_levenshtein_dist.
- partitions
How to find graph partitions that define a clonotype. Possible values are
leiden, for using the “Leiden” algorithm,fastgreedyfor using the “Fastgreedy” algorithm andconnectedto find fully connected sub-graphs.The difference is that the Leiden and Fastgreedy algorithms further divide fully connected subgraphs into highly-connected modules.
“Leiden” finds the community structure of the graph using the Leiden algorithm of Traag, van Eck & Waltman.
“Fastgreedy” finds the community structure of the graph according to the algorithm of Clauset et al based on the greedy optimization of modularity.
- resolution
resolutionparameter for the leiden algorithm.- n_iterations
n_iterationsparameter for the leiden algorithm.- distance_key
Key in
adata.unswhere the sequence distances are stored. This defaults toir_dist_{sequence}_{metric}.- inplace
If
True, adds the results to anndata, otherwise returns them.- n_jobs
Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksizea single worker thread will be used to avoid overhead.- chunksize
Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- airr_mod
Name of the modality with AIRR information is stored in the
MuDataobject. if anAnnDataobject is passed to the function, this parameter is ignored.- airr_key
Key under which the AIRR information is stored in adata.obsm as an awkward array.
- chain_idx_key
Key under which the chain indices are stored in adata.obsm. If chain indices are not present,
index_chains()is run with default parameters.
- Return type:
- Returns:
- clonotype
A Series containing the clonotype id for each cell. Will be stored in
adata.obs[key_added]ifinplaceisTrue- clonotype_size
A Series containing the number of cells in the respective clonotype for each cell. Will be stored in
adata.obs[f"{key_added}_size"]ifinplaceisTrue.- distance_result
- A dictionary containing
distances: A sparse, pairwise distance matrix between unique receptor configurationscell_indices: A dict of lists, containing theadata.obs_names(cell indices) for each row in the distance matrix.
If
inplaceisTrue, this is added toadata.uns[key_added].