scirpy.tl.mutational_load#
- scirpy.tl.mutational_load(adata, *, regions=('full', 'v', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3'), airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', sequence_key='sequence_alignment', germline_key='germline_alignment_d_mask', junction_key='junction', ignore_chars=('.', 'N'))#
Calculates absolute and relative mutational load of receptor sequences based on germline alignment.
Receptor sequences MUST be IMGT-aligned and the corresponding germline sequence MUST be available (See
sequence_key
andgermline_key
parameters).IMGT-alignments can be obtained by using the interoperability with Dandelion.
Region boundaries are implemented as described in the shazam documentation which follows the IMGT unique numbering scheme.
- Parameters:
adata (
Union
[AnnData
,MuData
,DataHandler
]) – AnnData or MuData object that contains AIRR information.regions (
Sequence
[Literal
['full'
,'v'
,'fwr1'
,'fwr2'
,'fwr3'
,'fwr4'
,'cdr1'
,'cdr2'
,'cdr3'
]] (default:('full', 'v', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3')
)) –Specify for which regions to calculate the mutational load. By default, calculate it for all regions. The segments follow the definition described in the shazam documentation.
full
: the full sequence without any sub-regions/divisionsv
: Only V_segment (Nucleotides 1 to 312)fwr1
: Positions 1 to 78.cdr1
: Positions 79 to 114.fwr2
: Positions 115 to 165.cdr2
: Positions 166 to 195.fwr3
: Positions 196 to 312.cdr3
: Positions 313 to (313 + juncLength - 6) since the junction sequence includes (on the left) the last codon from FWR3 and (on the right) the first codon from FWR4.fwr4
: Positions (313 + juncLength - 6 + 1) to the end of the sequence.
airr_mod (default:
'airr'
) – Name of the modality with AIRR information is stored in theMuData
object. if anAnnData
object is passed to the function, this parameter is ignored.airr_key (default:
'airr'
) – Key under which the AIRR information is stored in adata.obsm as an awkward array.chain_idx_key (default:
'chain_indices'
) – Key to select chain indicessequence_key (
str
(default:'sequence_alignment'
)) – Awkward array key to access sequence alignment information. The sequence must be IMGT-aligned.germline_key (
str
(default:'germline_alignment_d_mask'
)) – Awkward array key to access germline alignment information. This must be the TMGT germline reference. It is recommended to mask the d-segment with N`s (see `Yaari et al. (2015))junction_key (
str
(default:'junction'
)) – Awkward array key to access the nucleotide junction sequence. This information is required to obtain the junction length required to calculate the coordinates of thecdr3
andfwr4
regions.ignore_chars (
Sequence
[str
] (default:('.', 'N')
)) –A list of characters to ignore while calculating differences. The default is to ignore the following:
"N"
: masked or degraded nucleotide. For instance, it is recommended to mask the D-segment, because of lower sequence quality"."
: “IMGT-gaps”, distinct from “normal gaps (‘-‘)”. It is beneficial to ignore these, because sometimes sequence alignments are “clipped” at the beginning, which would inflate the mutaiton count.
- Return type:
- Returns:
A value for each chain is stored in the awkward array used as input (typically
adata.obsm["airr"]
) under the keys"{region}_mutation_count"
and"{region}_mutation_freq" for each region specified in the `regions
parameter. The mutational load for the"full"
region is stored inmutation_count
andmutation_freq
, respectively (i.e. without the{region}
prefix). Usescirpy.get.airr()
to retrieve the values as a Dataframe.