scirpy.tl.ir_query

Contents

scirpy.tl.ir_query#

scirpy.tl.ir_query(adata, reference, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, match_columns=None, key_added=None, distance_key=None, inplace=True, n_jobs=-1, chunksize=2000, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices')#

Query a referece database for matching immune cell receptors.

Warning

This is an experimental function that may change in the future.

The reference database can either be an immune cell receptor database, or simply another scRNA-seq dataset with some annotations in .obs. This function maps all cells to all matching entries from the reference.

Requires running ir_dist() with the same values for reference, sequence and metric first.

This function is essentially an extension of define_clonotype_clusters() to two AnnData objects and follows the same logic:

Definition of clonotype(-clusters) follows roughly the following procedure:
  1. Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.

  2. Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of receptor_arms, dual_ir, same_v_gene and within_group.

Parameters:
  • adata (Union[AnnData, MuData, DataHandler]) – AnnData or MuData object that contains AIRR information.

  • reference (Union[AnnData, MuData, DataHandler]) – Another AnnData object, can be either a second dataset with IR information or a epitope database. Must be the same object used for running scirpy.pp.ir_dist().

  • sequence (Literal['aa', 'nt'] (default: 'aa')) – The sequence parameter used when running scirpy.pp.ir_dist()

  • metric (Union[Literal['alignment', 'fastalignment', 'identity', 'levenshtein', 'hamming'], DistanceCalculator] (default: 'identity')) – The metric parameter used when running scirpy.pp.ir_dist()

  • receptor_arms (Literal['VJ', 'VDJ', 'all', 'any'] (default: 'all')) –

    One of the following options:
    • "VJ" - only consider VJ sequences

    • "VDJ" - only consider VDJ sequences

    • "all" - both VJ and VDJ need to match

    • "any" - either VJ or VDJ need to match

    If "any", two distances are combined by taking their minimum. If "all", two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any") the smaller distance is relevant. If we require both receptors to match ("all"), the larger distance is relevant.

  • dual_ir (Literal['any', 'primary_only', 'all'] (default: 'any')) –

    One of the following options:
    • "primary_only" - only consider most abundant pair of VJ/VDJ chains

    • "any" - consider both pairs of VJ/VDJ sequences. Distance must be below cutoff for any of the chains.

    • "all" - consider both pairs of VJ/VDJ sequences. Distance must be below cutoff for all of the chains.

    Distances are combined as for receptor_arms.

    See also Dual IR.

  • same_v_gene (bool (default: False)) –

    Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.

    v genes are matched based on the behaviour defined with receptor_arms and dual_ir.

  • match_columns (Union[Sequence[str], str, None] (default: None)) – One or multiple columns in adata.obs that must match between query and reference. Use this to e.g. enforce matching cell-types or HLA-types.

  • key_added (Optional[str] (default: None)) – Dictionary key under which the resulting distance matrix will be stored in adata.uns if inplace=True. Defaults to ir_query_{name}_{sequence}_{metric}. If metric is an instance of scirpy.ir_dist.metrics.DistanceCalculator, {metric} defaults to custom. {name} is taken from reference.uns["DB"]["name"]. If reference does not have a "DB" entry, key_added needs to be specified manually.

  • distance_key (Optional[str] (default: None)) – Key in adata.uns where the results of ir_dist() are stored. Defaults to ir_dist_{name}_{sequence}_{metric}. If metric is an instance of scirpy.ir_dist.metrics.DistanceCalculator, {metric} defaults to custom. {name} is taken from reference.uns["DB"]["name"]. If reference does not have a "DB" entry, distance_key needs to be specified manually.

  • inplace (bool (default: True)) – If True, store the result in adata.uns. Otherwise return a dictionary with the results.

  • n_jobs (Optional[int] (default: -1)) – Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than 2 * chunksize a single worker thread will be used to avoid overhead.

  • chunksize (int (default: 2000)) – Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.

  • airr_mod (str (default: 'airr')) – Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

  • airr_key (str (default: 'airr')) – Key under which the AIRR information is stored in adata.obsm as an awkward array.

  • chain_idx_key (str (default: 'chain_indices')) – Key under which the chain indices are stored in adata.obsm. If chain indices are not present, index_chains() is run with default parameters.

  • airr_mod_ref (str (default: 'airr')) – Like airr_mod, but for reference.

  • airr_key_ref (str (default: 'airr')) – Like airr_key, but for reference.

  • chain_idx_key_ref (str (default: 'chain_indices')) – Like chain_idx_key, but for reference.

Return type:

Optional[dict]

Returns:

A dictionary containing
  • distances: A sparse distance matrix between unique receptor configurations in adata aund unique receptor configurations in reference.

  • cell_indices: A dict of arrays, containing the the adata.obs_names (cell indices) for each row in the distance matrix.

  • cell_indices_reference: A dict of arrays, containing the reference.obs_names for each column in the distance matrix.

If inplace is True, this is added to adata.uns[key_added].