scirpy.ir_dist.sequence_dist

Contents

scirpy.ir_dist.sequence_dist#

scirpy.ir_dist.sequence_dist(seqs, seqs2=None, *, metric='identity', cutoff=None, n_jobs=-1, **kwargs)#

Calculate a sequence x sequence distance matrix.

Calculates the full pairwise distance matrix.

Important

  • Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).

  • That means, a distance > cutoff is represented as 0, a distance == 0 is represented as 1, a distance == 1 is represented as 2 and so on.

  • Only returns distances <= cutoff. Larger distances are eliminated from the sparse matrix.

  • Distances are non-negative.

When seqs or seqs2 includes non-unique values, the function internally uses only unique sequences to calculate the distances. Note that, if the input arrays contain large numbers of duplicated values (i.e. hundreds each), this will lead to large “dense” blocks in the sparse matrix. This will result in slow processing and high memory usage.

Parameters:
  • seqs (Sequence[str]) – Numpy array of nucleotide or amino acid sequences. Note that not all distance metrics support nucleotide sequences.

  • seqs2 (Optional[Sequence[str]] (default: None)) – Second array sequences. When omitted, sequence_dist computes the square matrix of unique_seqs.

  • metric (Union[Literal['alignment', 'fastalignment', 'identity', 'levenshtein', 'hamming'], DistanceCalculator] (default: 'identity')) –

    You can choose one of the following metrics:

  • cutoff (Optional[int] (default: None)) – All distances > cutoff will be replaced by 0 and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set to None, the cutoff will be 10 for the alignment and fastalignment metric, and 2 for levenshtein and hamming. For the identity metric, the cutoff is ignored and always set to 0.

  • n_jobs (int (default: -1)) –

    Number of CPU cores to use when running a DistanceCalculator that supports paralellization.

    A cutoff of 0 implies the identity metric.

  • kwargs – Additional parameters passed to the DistanceCalculator.

Return type:

csr_matrix

Returns:

Symmetrical, sparse pairwise distance matrix.