scirpy.ir_dist.sequence_dist#
- scirpy.ir_dist.sequence_dist(seqs, seqs2=None, *, metric='identity', cutoff=None, n_jobs=-1, **kwargs)#
Calculate a sequence x sequence distance matrix.
Calculates the full pairwise distance matrix.
Important
Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).
That means, a
distance > cutoff
is represented as0
, adistance == 0
is represented as1
, adistance == 1
is represented as2
and so on.Only returns distances
<= cutoff
. Larger distances are eliminated from the sparse matrix.Distances are non-negative.
When
seqs
orseqs2
includes non-unique values, the function internally uses only unique sequences to calculate the distances. Note that, if the input arrays contain large numbers of duplicated values (i.e. hundreds each), this will lead to large “dense” blocks in the sparse matrix. This will result in slow processing and high memory usage.- Parameters:
seqs (
Sequence
[str
]) – Numpy array of nucleotide or amino acid sequences. Note that not all distance metrics support nucleotide sequences.seqs2 (
Optional
[Sequence
[str
]] (default:None
)) – Second array sequences. When omitted,sequence_dist
computes the square matrix ofunique_seqs
.metric (
Union
[Literal
['alignment'
,'fastalignment'
,'identity'
,'levenshtein'
,'hamming'
,'normalized_hamming'
,'tcrdist'
],DistanceCalculator
] (default:'identity'
)) –- You can choose one of the following metrics:
identity
– 1 for identical sequences, 0 otherwise. SeeIdentityDistanceCalculator
. This metric implies a cutoff of 0.levenshtein
– Levenshtein edit distance. SeeLevenshteinDistanceCalculator
.tcrdist
– Distance based on pairwise sequence alignments between TCR CDR3 sequences based on the tcrdist metric. SeeTCRdistDistanceCalculator
.hamming
– Hamming distance for CDR3 sequences of equal length. SeeHammingDistanceCalculator
.normalized_hamming
– Normalized Hamming distance (in percent) for CDR3 sequences of equal length. SeeHammingDistanceCalculator
.alignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.fastalignment
– Distance based on pairwise sequence alignments using the BLOSUM62 matrix. Faster implementation ofalignment
with some loss. This option is incompatible with nucleotide sequences. SeeFastAlignmentDistanceCalculator
.any instance of
DistanceCalculator
.
cutoff (
Optional
[int
] (default:None
)) – All distances> cutoff
will be replaced by0
and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set toNone
, the cutoff will be10
for thealignment
andfastalignment
metric, and2
forlevenshtein
andhamming
. For the identity metric, the cutoff is ignored and always set to0
.n_jobs (
int
(default:-1
)) –Number of CPU cores to use when running a DistanceCalculator that supports paralellization.
A cutoff of 0 implies the
identity
metric.kwargs – Additional parameters passed to the
DistanceCalculator
.
- Return type:
- Returns:
Symmetrical, sparse pairwise distance matrix.