scirpy.tl.define_clonotype_clusters#
- scirpy.tl.define_clonotype_clusters(adata, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, same_j_gene=False, within_group='receptor_type', key_added=None, partitions='connected', resolution=1, n_iterations=5, distance_key=None, inplace=True, n_jobs=-1, chunksize=2000, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices')#
Define clonotype clusters.
As opposed to
define_clonotypes()
which employs a more stringent definition of clonotypes, this function flexibly defines clonotype clusters based on amino acid or nucleic acid sequence identity or similarity.Requires running
ir_dist()
with the samesequence
andmetric
values first.- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.
- Parameters:
adata (
Union
[AnnData
,MuData
,DataHandler
]) – AnnData or MuData object that contains AIRR information.sequence (
Literal
['aa'
,'nt'
] (default:'aa'
)) – The sequence parameter used when runningscirpy.pp.ir_dist()
metric (
Union
[Literal
['alignment'
,'fastalignment'
,'identity'
,'levenshtein'
,'hamming'
,'normalized_hamming'
,'tcrdist'
],DistanceCalculator
] (default:'identity'
)) – The metric parameter used when runningscirpy.pp.ir_dist()
receptor_arms (
Literal
['VJ'
,'VDJ'
,'all'
,'any'
] (default:'all'
)) –- One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.dual_ir (
Literal
['primary_only'
,'all'
,'any'
] (default:'any'
)) –- One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
same_v_gene (
bool
(default:False
)) –Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.
- within_group
Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in
adata.obs
that contains a grouping, or toNone
, if you want no constraints.- key_added
The column name under which the clonotype clusters and cluster sizes will be stored in
adata.obs
and under which the clonotype network will be stored inadata.uns
.Defaults to
cc_{sequence}_{metric}
, e.g.cc_aa_levenshtein
, wherecc
stands for “clonotype cluster”.The clonotype sizes will be stored in
{key_added}_size
, e.g.cc_aa_levenshtein_size
.The clonotype x clonotype network will be stored in
{key_added}_dist
, e.g.cc_aa_levenshtein_dist
.
- partitions
How to find graph partitions that define a clonotype. Possible values are
leiden
, for using the “Leiden” algorithm,fastgreedy
for using the “Fastgreedy” algorithm andconnected
to find fully connected sub-graphs.The difference is that the Leiden and Fastgreedy algorithms further divide fully connected subgraphs into highly-connected modules.
“Leiden” finds the community structure of the graph using the Leiden algorithm of Traag, van Eck & Waltman.
“Fastgreedy” finds the community structure of the graph according to the algorithm of Clauset et al based on the greedy optimization of modularity.
- resolution
resolution
parameter for the leiden algorithm.- n_iterations
n_iterations
parameter for the leiden algorithm.- distance_key
Key in
adata.uns
where the sequence distances are stored. This defaults toir_dist_{sequence}_{metric}
.- inplace
If
True
, adds the results to anndata, otherwise returns them.- n_jobs
Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.- chunksize
Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- airr_mod
Name of the modality with AIRR information is stored in the
MuData
object. if anAnnData
object is passed to the function, this parameter is ignored.- airr_key
Key under which the AIRR information is stored in adata.obsm as an awkward array.
- chain_idx_key
Key under which the chain indices are stored in adata.obsm. If chain indices are not present,
index_chains()
is run with default parameters.
- Return type:
- Returns:
- clonotype
A Series containing the clonotype id for each cell. Will be stored in
adata.obs[key_added]
ifinplace
isTrue
- clonotype_size
A Series containing the number of cells in the respective clonotype for each cell. Will be stored in
adata.obs[f"{key_added}_size"]
ifinplace
isTrue
.- distance_result
- A dictionary containing
distances
: A sparse, pairwise distance matrix between unique receptor configurationscell_indices
: A dict of arrays, containing theadata.obs_names
(cell indices) for each row in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.