scirpy.pp.ir_dist

Contents

scirpy.pp.ir_dist#

scirpy.pp.ir_dist(adata, reference=None, *, metric='identity', cutoff=None, sequence='nt', key_added=None, inplace=True, n_jobs=-1, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices')#

Computes a sequence-distance metric between all unique VJ CDR3 sequences and between all unique VDJ CDR3 sequences.

This is a required proprocessing step for clonotype definition and clonotype networks and for querying reference databases.

Calculates the full pairwise distance matrix.

Important

  • Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).

  • That means, a distance > cutoff is represented as 0, a distance == 0 is represented as 1, a distance == 1 is represented as 2 and so on.

  • Only returns distances <= cutoff. Larger distances are eliminated from the sparse matrix.

  • Distances are non-negative.

Parameters:
  • adata (Union[AnnData, MuData, DataHandler]) – AnnData or MuData object that contains AIRR information.

  • reference (Union[AnnData, MuData, DataHandler, None] (default: None)) – Another AnnData object, can be either a second dataset with IR information or a epitope database. If specified, will compute distances between the sequences in adata and the sequences in reference. Otherwise computes pairwise distances of the sequences in adata.

  • metric (Union[Literal['alignment', 'fastalignment', 'identity', 'levenshtein', 'hamming'], DistanceCalculator] (default: 'identity')) –

    You can choose one of the following metrics:

  • cutoff (Optional[int] (default: None)) – All distances > cutoff will be replaced by 0 and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set to None, the cutoff will be 10 for the alignment and fastalignment metric, and 2 for levenshtein and hamming. For the identity metric, the cutoff is ignored and always set to 0.

  • sequence (Literal['aa', 'nt'] (default: 'nt')) – Compute distances based on amino acid (aa) or nucleotide (nt) sequences.

  • key_added (Optional[str] (default: None)) – Dictionary key under which the results will be stored in adata.uns if inplace=True. Defaults to ir_dist_{sequence}_{metric} or ir_dist_{name}_{sequence}_{metric} if reference is specified. If metric is an instance of scirpy.ir_dist.metrics.DistanceCalculator, {metric} defaults to custom. {name} is taken from reference.uns["DB"]["name"]. If reference does not have a "DB" entry, key_added needs to be specified manually.

  • inplace (bool (default: True)) – If true, store the result in adata.uns. Otherwise return a dictionary with the results.

  • n_jobs (Optional[int] (default: -1)) – Number of cores to use for distance calculation. Passed on to scirpy.ir_dist.metrics.DistanceCalculator. joblib.Parallel is used internally. Via the joblib.parallel_config context manager, you can set another backend (e.g. dask) and adjust other configuration options.

  • airr_mod (str (default: 'airr')) – Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

  • airr_key (str (default: 'airr')) – Key under which the AIRR information is stored in adata.obsm as an awkward array.

  • chain_idx_key (str (default: 'chain_indices')) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present, index_chains() is run with default parameters.

  • airr_mod_ref (str (default: 'airr')) – Like airr_mod, but for reference.

  • airr_key_ref (str (default: 'airr')) – Like airr_key, but for reference.

  • chain_idx_key_ref (str (default: 'chain_indices')) – Like chain_idx_key, but for reference.

Return type:

Optional[dict]

Returns:

Depending on the value of inplace either returns nothing or a dictionary with sparse, pairwise distance matrices for all VJ and VDJ sequences.