scirpy.tl.mutational_load

scirpy.tl.mutational_load#

scirpy.tl.mutational_load(adata, *, regions=('full', 'v', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3'), airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', sequence_key='sequence_alignment', germline_key='germline_alignment_d_mask', junction_key='junction', ignore_chars=('.', 'N'))#

Calculates absolute and relative mutational load of receptor sequences based on germline alignment.

Receptor sequences MUST be IMGT-aligned and the corresponding germline sequence MUST be available (See sequence_key and germline_key parameters).

IMGT-alignments can be obtained by using the interoperability with Dandelion.

Region boundaries are implemented as described in the shazam documentation which follows the IMGT unique numbering scheme.

Parameters:
  • adata (Union[AnnData, MuData, DataHandler]) – AnnData or MuData object that contains AIRR information.

  • regions (Sequence[Literal['full', 'v', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3']] (default: ('full', 'v', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3'))) –

    Specify for which regions to calculate the mutational load. By default, calculate it for all regions. The segments follow the definition described in the shazam documentation.

    • full: the full sequence without any sub-regions/divisions

    • v: Only V_segment (Nucleotides 1 to 312)

    • fwr1: Positions 1 to 78.

    • cdr1: Positions 79 to 114.

    • fwr2: Positions 115 to 165.

    • cdr2: Positions 166 to 195.

    • fwr3: Positions 196 to 312.

    • cdr3: Positions 313 to (313 + juncLength - 6) since the junction sequence includes (on the left) the last codon from FWR3 and (on the right) the first codon from FWR4.

    • fwr4: Positions (313 + juncLength - 6 + 1) to the end of the sequence.

  • airr_mod (default: 'airr') – Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

  • airr_key (default: 'airr') – Key under which the AIRR information is stored in adata.obsm as an awkward array.

  • chain_idx_key (default: 'chain_indices') – Key to select chain indices

  • sequence_key (str (default: 'sequence_alignment')) – Awkward array key to access sequence alignment information. The sequence must be IMGT-aligned.

  • germline_key (str (default: 'germline_alignment_d_mask')) – Awkward array key to access germline alignment information. This must be the TMGT germline reference. It is recommended to mask the d-segment with N`s (see `Yaari et al. (2015))

  • junction_key (str (default: 'junction')) – Awkward array key to access the nucleotide junction sequence. This information is required to obtain the junction length required to calculate the coordinates of the cdr3 and fwr4 regions.

  • ignore_chars (Sequence[str] (default: ('.', 'N'))) –

    A list of characters to ignore while calculating differences. The default is to ignore the following:

    • "N": masked or degraded nucleotide. For instance, it is recommended to mask the D-segment, because of lower sequence quality

    • ".": “IMGT-gaps”, distinct from “normal gaps (‘-‘)”. It is beneficial to ignore these, because sometimes sequence alignments are “clipped” at the beginning, which would inflate the mutaiton count.

Return type:

None

Returns:

A value for each chain is stored in the awkward array used as input (typically adata.obsm["airr"]) under the keys "{region}_mutation_count" and "{region}_mutation_freq" for each region specified in the `regions parameter. The mutational load for the "full" region is stored in mutation_count and mutation_freq, respectively (i.e. without the {region} prefix). Use scirpy.get.airr() to retrieve the values as a Dataframe.