scirpy.ir_dist.metrics.FastAlignmentDistanceCalculator#
- class scirpy.ir_dist.metrics.FastAlignmentDistanceCalculator(cutoff=10, *, n_jobs=-1, block_size=None, subst_mat='blosum62', gap_open=11, gap_extend=11, estimated_penalty=None)#
Calculates distance between sequences based on pairwise sequence alignment.
The distance between two sequences is defined as \(S_{1,2}^{max} - S_{1,2}\), where \(S_{1,2}\) is the alignment score of sequences 1 and 2 and \(S_{1,2}^{max}\) is the max. achievable alignment score of sequences 1 and 2. \(S_{1,2}^{max}\) is defined as \(\min(S_{1,1}, S_{2,2})\).
The use of alignment-based distances is heavily inspired by [DFGH+17].
High-performance sequence alignments are calculated leveraging the parasail library ([Dai16]).
- To speed up the computation, we pre-filter sequence pairs based on
differences in sequence length
the number of different characters, based on an estimate of the mismatch penalty (
estimated_penalty
).
The filtering based on
estimated_penalty
is a heuristic and may lead to false negatives, i.e. sequence pairs that are actually below the distance cutoff, but are removed during pre-filtering. Sensible values forestimated_penalty
are depending on the substitution matrix, but higher values lead to a higher false negative rate.We provide default values for BLOSUM and PAM matrices. The provided default values were obtained by testing different alues on the Wu dataset ([WMdA+20]) and selecting those that provided a reasonable balance between speedup and loss. Loss stayed well under 10% in all our test cases with the default values, and speedup increases with the number of cells.
While the length-based filter is always active, the filter based on different characters can be disabled by setting the estimated penalty to zero. Using length-based filtering only, there won’t be any false negatives unless with a substitution matrix in which a substitution results in a higher score than the corresponding match. Using the length-based filter only results in a substancially reduced speedup compared to combining it with the
estimated_penalty
heuristic.- Choosing a cutoff:
Alignment distances need to be viewed in the light of the substitution matrix. The alignment distance is the difference between the actual alignment score and the max. achievable alignment score. For instance, a mutation from Leucine (
L
) to Isoleucine (I
) results in a BLOSUM62 score of2
. AnL
aligned withL
achieves a score of4
. The distance is, therefore,2
.On the other hand, a single Tryptophane (
W
) mutating into, e.g. Proline (P
) already results in a distance of15
.We are still lacking empirical data up to which distance a CDR3 sequence still is likely to recognize the same antigen, but reasonable cutoffs are
<15
.- Choosing an expected penalty:
The choice of an expected penalty is likely influenced by similar considerations as the other parameters. Essentially, this can be thought of as a superficial (dis)similarity measure. A higher value more strongly penalizes mismatching characters and is more in line with looking for closely related sequence pairs, while a lower value is more forgiving and better suited when looking for more distantly related sequence pairs.
- Parameters:
cutoff (
int
(default:10
)) – Will eleminate distances > cutoff to make efficient use of sparse matrices. The default cutoff is10
.n_jobs (
int
(default:-1
)) – Number of jobs to use for the pairwise distance calculation, passed tojoblib.Parallel
. If -1, use all CPUs (only for ParallelDistanceCalculators). Via thejoblib.parallel_config
context manager, another backend (e.g.dask
) can be selected.block_size (
Optional
[int
] (default:None
)) – Deprecated. This is now set incalc_dist_mat
.subst_mat (
str
(default:'blosum62'
)) – Name of parasail substitution matrixgap_open (
int
(default:11
)) – Gap open penaltygap_extend (
int
(default:11
)) – Gap extend penatlyestimated_penalty (
Optional
[float
] (default:None
)) – Estimate of the average mismatch penalty
Attributes table#
The sparse matrix dtype. |
Methods table#
|
Calculate the distance matrix. |
|
Mirror a triangular matrix at the diagonal to make it a square matrix. |
Attributes#
- FastAlignmentDistanceCalculator.DTYPE = 'uint8'#
The sparse matrix dtype. Defaults to uint8, constraining the max distance to 255.
Methods#
- FastAlignmentDistanceCalculator.calc_dist_mat(seqs, seqs2=None, *, block_size=None)#
Calculate the distance matrix.
See
DistanceCalculator.calc_dist_mat()
.- Parameters:
seqs (
Sequence
[str
]) – array containing CDR3 sequences. Must not contain duplicates.seqs2 (
Optional
[Sequence
[str
]] (default:None
)) – second array containing CDR3 sequences. Must not contain duplicates either.block_size (
Optional
[int
] (default:None
)) – The width of a block that’s sent to a worker. A block containsblock_size ** 2
elements. IfNone
the block size is determined automatically based on the problem size.
- Return type:
- Returns:
Sparse pairwise distance matrix.
- static FastAlignmentDistanceCalculator.squarify(triangular_matrix)#
Mirror a triangular matrix at the diagonal to make it a square matrix.
The input matrix must be upper triangular to begin with, otherwise the results will be incorrect. No guard rails!
- Return type: