scirpy.pp.ir_dist#
- scirpy.pp.ir_dist(adata, reference=None, *, metric='identity', cutoff=None, sequence='nt', key_added=None, inplace=True, n_jobs=-1, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices', **kwargs)#
Computes a sequence-distance metric between all unique VJ CDR3 sequences and between all unique VDJ CDR3 sequences.
This is a required proprocessing step for clonotype definition and clonotype networks and for querying reference databases.
Calculates the full pairwise distance matrix.
Important
Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).
That means, a
distance > cutoff
is represented as0
, adistance == 0
is represented as1
, adistance == 1
is represented as2
and so on.Only returns distances
<= cutoff
. Larger distances are eliminated from the sparse matrix.Distances are non-negative.
- Parameters:
adata (
Union
[AnnData
,MuData
,DataHandler
]) – AnnData or MuData object that contains AIRR information.reference (
Union
[AnnData
,MuData
,DataHandler
,None
] (default:None
)) – AnotherAnnData
object, can be either a second dataset with IR information or a epitope database. If specified, will compute distances between the sequences inadata
and the sequences inreference
. Otherwise computes pairwise distances of the sequences inadata
.metric (
Union
[Literal
['alignment'
,'fastalignment'
,'identity'
,'levenshtein'
,'hamming'
,'gpu_haming'
,'normalized_hamming'
,'tcrdist'
],DistanceCalculator
] (default:'identity'
)) –- You can choose one of the following metrics:
identity
– 1 for identical sequences, 0 otherwise. SeeIdentityDistanceCalculator
. This metric implies a cutoff of 0.levenshtein
– Levenshtein edit distance. SeeLevenshteinDistanceCalculator
.tcrdist
– Distance based on pairwise sequence alignments between TCR CDR3 sequences based on the tcrdist metric. SeeTCRdistDistanceCalculator
.hamming
– Hamming distance for CDR3 sequences of equal length. SeeHammingDistanceCalculator
.gpu_hamming
– Hamming distance for CDR3 sequences of equal length calculated with a GPU. SeeGPUHammingDistanceCalculator
.normalized_hamming
– Normalized Hamming distance (in percent) for CDR3 sequences of equal length. SeeHammingDistanceCalculator
.alignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.fastalignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. Faster implementation ofalignment
with some loss. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.any instance of
DistanceCalculator
.
cutoff (
Optional
[int
] (default:None
)) – All distances> cutoff
will be replaced by0
and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set toNone
, the cutoff will be10
for thealignment
andfastalignment
metric, and2
forlevenshtein
andhamming
. For the identity metric, the cutoff is ignored and always set to0
.sequence (
Literal
['aa'
,'nt'
] (default:'nt'
)) – Compute distances based on amino acid (aa
) or nucleotide (nt
) sequences.key_added (
Optional
[str
] (default:None
)) – Dictionary key under which the results will be stored inadata.uns
ifinplace=True
. Defaults toir_dist_{sequence}_{metric}
orir_dist_{name}_{sequence}_{metric}
ifreference
is specified. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,key_added
needs to be specified manually.inplace (
bool
(default:True
)) – If true, store the result inadata.uns
. Otherwise return a dictionary with the results.n_jobs (
int
(default:-1
)) – Number of cores to use for distance calculation.joblib.Parallel
is used internally. Via thejoblib.parallel_config
context manager, you can set another backend (e.g.dask
) and adjust other configuration options. The metricshamming
,normalized_hamming
, andtcrdist
utilizenumba
for parallelization with multithreading instead.airr_mod (
str
(default:'airr'
)) – Name of the modality with AIRR information is stored in theMuData
object. if anAnnData
object is passed to the function, this parameter is ignored.airr_key (
str
(default:'airr'
)) – Key under which the AIRR information is stored in adata.obsm as an awkward array.chain_idx_key (
str
(default:'chain_indices'
)) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present,index_chains()
is run with default parameters.airr_mod_ref (
str
(default:'airr'
)) – Likeairr_mod
, but forreference
.airr_key_ref (
str
(default:'airr'
)) – Likeairr_key
, but forreference
.chain_idx_key_ref (
str
(default:'chain_indices'
)) – Likechain_idx_key
, but forreference
.**kwargs – Arguments are passed to the respective
DistanceCalculator
class. Check out the distance calculator for the respective metric to see parameters specific to individual distance calculators that can be passed via kwargs.
- Return type:
- Returns:
Depending on the value of
inplace
either returns nothing or a dictionary with sparse, pairwise distance matrices for allVJ
andVDJ
sequences.