scirpy.pp.ir_dist

Contents

scirpy.pp.ir_dist#

scirpy.pp.ir_dist(adata, reference=None, *, metric='identity', cutoff=None, sequence='nt', key_added=None, inplace=True, n_jobs=None, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices')#

Computes a sequence-distance metric between all unique VJ CDR3 sequences and between all unique VDJ CDR3 sequences.

This is a required proprocessing step for clonotype definition and clonotype networks and for querying reference databases.

Calculates the full pairwise distance matrix.

Important

  • Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).

  • That means, a distance > cutoff is represented as 0, a distance == 0 is represented as 1, a distance == 1 is represented as 2 and so on.

  • Only returns distances <= cutoff. Larger distances are eliminated from the sparse matrix.

  • Distances are non-negative.

Parameters:
  • adata (Union[AnnData, MuData, DataHandler]) – AnnData or MuData object that contains AIRR information.

  • reference (Union[AnnData, MuData, DataHandler, None] (default: None)) – Another AnnData object, can be either a second dataset with IR information or a epitope database. If specified, will compute distances between the sequences in adata and the sequences in reference. Otherwise computes pairwise distances of the sequences in adata.

  • metric (Union[Literal['alignment', 'identity', 'levenshtein', 'hamming'], DistanceCalculator] (default: 'identity')) –

    You can choose one of the following metrics:

  • cutoff (Optional[int] (default: None)) – All distances > cutoff will be replaced by 0 and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set to None, the cutoff will be 10 for the alignment metric, and 2 for levenshtein and hamming. For the identity metric, the cutoff is ignored and always set to 0.

  • sequence (Literal['aa', 'nt'] (default: 'nt')) – Compute distances based on amino acid (aa) or nucleotide (nt) sequences.

  • key_added (Optional[str] (default: None)) – Dictionary key under which the results will be stored in adata.uns if inplace=True. Defaults to ir_dist_{sequence}_{metric} or ir_dist_{name}_{sequence}_{metric} if reference is specified. If metric is an instance of scirpy.ir_dist.metrics.DistanceCalculator, {metric} defaults to custom. {name} is taken from reference.uns["DB"]["name"]. If reference does not have a "DB" entry, key_added needs to be specified manually.

  • inplace (bool (default: True)) – If true, store the result in adata.uns. Otherwise return a dictionary with the results.

  • n_jobs (Optional[int] (default: None)) – Number of cores to use for distance calculation. Passed on to scirpy.ir_dist.metrics.DistanceCalculator.

  • airr_mod (str (default: 'airr')) – Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

  • airr_key (str (default: 'airr')) – Key under which the AIRR information is stored in adata.obsm as an awkward array.

  • chain_idx_key (str (default: 'chain_indices')) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present, index_chains() is run with default parameters.

  • airr_mod_ref (str (default: 'airr')) – Like airr_mod, but for reference.

  • airr_key_ref (str (default: 'airr')) – Like airr_key, but for reference.

  • chain_idx_key_ref (str (default: 'chain_indices')) – Like chain_idx_key, but for reference.

Return type:

Optional[dict]

Returns:

Depending on the value of inplace either returns nothing or a dictionary with sparse, pairwise distance matrices for all VJ and VDJ sequences.