scirpy.pp.ir_dist#
- scirpy.pp.ir_dist(adata, reference=None, *, metric='identity', cutoff=None, sequence='nt', key_added=None, inplace=True, n_jobs=-1, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices')#
Computes a sequence-distance metric between all unique VJ CDR3 sequences and between all unique VDJ CDR3 sequences.
This is a required proprocessing step for clonotype definition and clonotype networks and for querying reference databases.
Calculates the full pairwise distance matrix.
Important
Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).
That means, a
distance > cutoff
is represented as0
, adistance == 0
is represented as1
, adistance == 1
is represented as2
and so on.Only returns distances
<= cutoff
. Larger distances are eliminated from the sparse matrix.Distances are non-negative.
- Parameters:
adata (
Union
[AnnData
,MuData
,DataHandler
]) – AnnData or MuData object that contains AIRR information.reference (
Union
[AnnData
,MuData
,DataHandler
,None
] (default:None
)) – AnotherAnnData
object, can be either a second dataset with IR information or a epitope database. If specified, will compute distances between the sequences inadata
and the sequences inreference
. Otherwise computes pairwise distances of the sequences inadata
.metric (
Union
[Literal
['alignment'
,'fastalignment'
,'identity'
,'levenshtein'
,'hamming'
],DistanceCalculator
] (default:'identity'
)) –- You can choose one of the following metrics:
identity
– 1 for identical sequences, 0 otherwise. SeeIdentityDistanceCalculator
. This metric implies a cutoff of 0.levenshtein
– Levenshtein edit distance. SeeLevenshteinDistanceCalculator
.hamming
– Hamming distance for CDR3 sequences of equal length. SeeHammingDistanceCalculator
.alignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.fastalignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. Faster implementation ofalignment
with some loss. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.any instance of
DistanceCalculator
.
cutoff (
Optional
[int
] (default:None
)) – All distances> cutoff
will be replaced by0
and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set toNone
, the cutoff will be10
for thealignment
andfastalignment
metric, and2
forlevenshtein
andhamming
. For the identity metric, the cutoff is ignored and always set to0
.sequence (
Literal
['aa'
,'nt'
] (default:'nt'
)) – Compute distances based on amino acid (aa
) or nucleotide (nt
) sequences.key_added (
Optional
[str
] (default:None
)) – Dictionary key under which the results will be stored inadata.uns
ifinplace=True
. Defaults toir_dist_{sequence}_{metric}
orir_dist_{name}_{sequence}_{metric}
ifreference
is specified. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,key_added
needs to be specified manually.inplace (
bool
(default:True
)) – If true, store the result inadata.uns
. Otherwise return a dictionary with the results.n_jobs (
int
(default:-1
)) – Number of cores to use for distance calculation. Passed on toscirpy.ir_dist.metrics.DistanceCalculator
.joblib.Parallel
is used internally. Via thejoblib.parallel_config
context manager, you can set another backend (e.g.dask
) and adjust other configuration options.airr_mod (
str
(default:'airr'
)) – Name of the modality with AIRR information is stored in theMuData
object. if anAnnData
object is passed to the function, this parameter is ignored.airr_key (
str
(default:'airr'
)) – Key under which the AIRR information is stored in adata.obsm as an awkward array.chain_idx_key (
str
(default:'chain_indices'
)) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present,index_chains()
is run with default parameters.airr_mod_ref (
str
(default:'airr'
)) – Likeairr_mod
, but forreference
.airr_key_ref (
str
(default:'airr'
)) – Likeairr_key
, but forreference
.chain_idx_key_ref (
str
(default:'chain_indices'
)) – Likechain_idx_key
, but forreference
.
- Return type:
- Returns:
Depending on the value of
inplace
either returns nothing or a dictionary with sparse, pairwise distance matrices for allVJ
andVDJ
sequences.