scirpy.tl.define_clonotypes#
- scirpy.tl.define_clonotypes(adata, *, key_added='clone_id', distance_key=None, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', **kwargs)#
Define clonotypes based on CDR3 nucleic acid sequence identity.
As opposed to
define_clonotype_clusters()
which employs a more flexible definition of clonotype clusters, this function stringently defines clonotypes based on nucleic acid sequence identity. Technically, this function is an alias todefine_clonotype_clusters()
with different default parameters.- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.
- Parameters:
adata (
Union
[AnnData
,MuData
,DataHandler
]) – AnnData or MuData object that contains AIRR information.receptor_arms –
- One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.dual_ir –
- One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
same_v_gene –
Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.within_group – Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in
adata.obs
that contains a grouping, or toNone
, if you want no constraints.key_added (
str
(default:'clone_id'
)) – The column name under which the clonotype clusters and cluster sizes will be stored inadata.obs
and under which the clonotype network will be stored inadata.uns
inplace – If
True
, adds the results to anndata, otherwise return them.n_jobs – Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.chunksize – Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
airr_mod (default:
'airr'
) – Name of the modality with AIRR information is stored in theMuData
object. if anAnnData
object is passed to the function, this parameter is ignored.airr_key (default:
'airr'
) – Key under which the AIRR information is stored in adata.obsm as an awkward array.chain_idx_key (default:
'chain_indices'
) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present,index_chains()
is run with default parameters.
- Return type:
- Returns:
- clonotype
A Series containing the clonotype id for each cell. Will be stored in
adata.obs[key_added]
ifinplace
isTrue
- clonotype_size
A Series containing the number of cells in the respective clonotype for each cell. Will be stored in
adata.obs[f"{key_added}_size"]
ifinplace
isTrue
.- distance_result
- A dictionary containing
distances
: A sparse, pairwise distance matrix between unique receptor configurationscell_indices
: A dict of arrays, containing theadata.obs_names
(cell indices) for each row in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.