scirpy.ir_dist.metrics.AlignmentDistanceCalculator#

class scirpy.ir_dist.metrics.AlignmentDistanceCalculator(**kwargs)#

Calculates distance between sequences based on pairwise sequence alignment.

The distance between two sequences is defined as \(S_{1,2}^{max} - S_{1,2}\), where \(S_{1,2}\) is the alignment score of sequences 1 and 2 and \(S_{1,2}^{max}\) is the max. achievable alignment score of sequences 1 and 2. \(S_{1,2}^{max}\) is defined as \(\min(S_{1,1}, S_{2,2})\).

The use of alignment-based distances is heavily inspired by [DFGH+17].

High-performance sequence alignments are calculated leveraging the parasail library ([Dai16]).

Choosing a cutoff:

Alignment distances need to be viewed in the light of the substitution matrix. The alignment distance is the difference between the actual alignment score and the max. achievable alignment score. For instance, a mutation from Leucine (L) to Isoleucine (I) results in a BLOSUM62 score of 2. An L aligned with L achieves a score of 4. The distance is, therefore, 2.

On the other hand, a single Tryptophane (W) mutating into, e.g. Proline (P) already results in a distance of 15.

We are still lacking empirical data up to which distance a CDR3 sequence still is likely to recognize the same antigen, but reasonable cutoffs are <15.

Parameters:
  • cutoff – Will eleminate distances > cutoff to make efficient use of sparse matrices. The default cutoff is 10.

  • n_jobs – Number of jobs to use for the pairwise distance calculation, passed to joblib.Parallel. If -1, use all CPUs (only for ParallelDistanceCalculators). Via the joblib.parallel_config context manager, another backend (e.g. dask) can be selected.

  • block_size – Deprecated. This is now set in calc_dist_mat.

  • subst_mat – Name of parasail substitution matrix

  • gap_open – Gap open penalty

  • gap_extend – Gap extend penatly

Attributes table#

DTYPE

The sparse matrix dtype.

Methods table#

calc_dist_mat(seqs[, seqs2, block_size])

Calculate the distance matrix.

squarify(triangular_matrix)

Mirror a triangular matrix at the diagonal to make it a square matrix.

Attributes#

AlignmentDistanceCalculator.DTYPE = 'uint8'#

The sparse matrix dtype. Defaults to uint8, constraining the max distance to 255.

Methods#

AlignmentDistanceCalculator.calc_dist_mat(seqs, seqs2=None, *, block_size=None)#

Calculate the distance matrix.

See DistanceCalculator.calc_dist_mat().

Parameters:
  • seqs (Sequence[str]) – array containing CDR3 sequences. Must not contain duplicates.

  • seqs2 (Optional[Sequence[str]] (default: None)) – second array containing CDR3 sequences. Must not contain duplicates either.

  • block_size (Optional[int] (default: None)) – The width of a block that’s sent to a worker. A block contains block_size ** 2 elements. If None the block size is determined automatically based on the problem size.

Return type:

csr_matrix

Returns:

Sparse pairwise distance matrix.

static AlignmentDistanceCalculator.squarify(triangular_matrix)#

Mirror a triangular matrix at the diagonal to make it a square matrix.

The input matrix must be upper triangular to begin with, otherwise the results will be incorrect. No guard rails!

Return type:

csr_matrix