BCR analysis in Scirpy#
# not part of the tutorial, but temporarily surpresses distracting warnings
import warnings
from numba import NumbaDeprecationWarning
warnings.simplefilter(action="ignore", category=(FutureWarning, NumbaDeprecationWarning))
# import all necessary python packages for this tutorial
import muon as mu
import scanpy as sc
import scirpy as ir
import numpy as np
import matplotlib.pyplot as plt
sc.set_figure_params(
figsize=(5, 5),
)
sc.settings.verbosity = 2 # verbosity: errors (0), warnings (1), info (2), hints (3)
In this tutorial, we guide the user through single cell BCR analysis with Scirpy
. The functionality in this notebook requires Scirpy v0.19 or later. Additionally, we leverage the interoperability with Dandelion [SPD+23] for certain preprocessing steps.
This tutorial uses a dataset from Stephenson et al. [SRB+21]. The original dataset consists of 143 samples and > 60k B cells from patients with COVID-19 in different degrees of severity and three control groups. To simplify and speed up the computation, we included cells from the five most abundant COVID-19 positive samples per status category and randomly subsampled down to 5k.
sc.logging.print_header()
scanpy==1.10.3 anndata==0.10.8 umap==0.5.6 numpy==2.0.2 scipy==1.14.1 pandas==2.2.3 scikit-learn==1.5.2 statsmodels==0.14.4 igraph==0.11.6 pynndescent==0.5.13
Pre-processing/Re-annotation#
Unlike TCR data, it is not recommended to directly use the output of Cell Ranger while analysing scBCR-data. Best practice is to reannotate Cell Ranger gene annotation with IgBlast or IMGT/HighV-QUEST, because Cell Ranger currently does not support the IMGT unique numbering scheme [LPommieR+03], which is heavily used by many downstream analysis tools e.g. phylogenetic analysis and mutation inference.
Our data used in this notebook is already reannotated and pre-processed and cells have already passed transcriptome quality control as described by the original authors. Reannotation of BCR data is outside the scope of Scirpy. Instead we refer the reader to one of the following workflows which all rely on the Immcantation suite.
Dandelion [SPD+23] is a Python library for single-cell BCR analysis that also supports re-annotation as showcased in their tutorial. Dandelion has some overlap with Scirpy functionality and is interoperable with Scirpy by using the
scirpy.io.from_dandelion()
andscirpy.io.to_dandelion()
functions that allow to easily switch to Scirpy after the reannotation step.nf-core/airrflow [GMB+24] is a Nextflow pipeline that can process both bulk and single-cell TCR and BCR-seq data that generates a wide range of outputs, including QC metrics and analysis reports. It may be convenient to first check the outputs of airrflow to get an overview of the data, and then follow up with Scirpy for more custom analyses. The pipeline produces reannotated BCR data in AIRR format that can be imported into Scirpy using
scirpy.io.read_airr()
.Alternatively, the Immcantation suite can be used directly as showcased in their tutorial. Similar to airrflow, reannotation will result in an AIRR-compilant
.tsv
file, which can be loaded into Scirpy usingscirpy.io.read_airr()
.
Germline reconstruction#
Germline sequences, primarily required for calculating mutations, can enrich your pre-processed data. Although Scirpy
does not currently support germline sequence inference, it’s recommended to leverage its interoperability with Dandelion
for this purpose, as detailed in this tutorial on their website. This tutorial will later cover mutation calculation. Note that germline sequences can also be useful for other applications, like phylogenetic analysis, which are beyond the current scope of Scirpy
.
Importing data#
Data import into Scirpy
from various formats is straightforward, as described in the dedicated tutorial on data loading. In this case, we directly load the demo dataset:
mdata = ir.datasets.stephenson2021_5k()
mdata
is a MuData
object, which is a framework for multimodal data analysis with a strong focus on multi-omics. It essentially builds on top of AnnData
and its general usage is outlined in Scirpy’s data structure.
mdata
MuData object with n_obs × n_vars = 5000 × 24929 2 modalities gex: 5000 x 24929 obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'full_clustering', 'initial_clustering', 'Resample', 'Collection_Day', 'Sex', 'Age_interval', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'patient_id' var: 'feature_types' obsm: 'X_pca', 'X_pca_harmony', 'X_umap' layers: 'raw' airr: 5000 x 0 obs: 'receptor_type', 'receptor_subtype', 'chain_pairing' uns: 'chain_indices', 'scirpy_version' obsm: 'airr', 'chain_indices'
BCR Quality Control#
Although many AIRR analysis tools recommend to only include single paired AIR sequences, there is increasing evidence that allelically and isotypically included B cells are quite common [ZPWY23]. During the QC process, the user can decide to filter or keep these cases as appropriate for their analysis.
The demo dataset used here is already pre-filtered, and filtering steps are shown for the sake of completeness below. Typically, we suggest to remove cells flagged as orphan
, because they cannot be utilized for downstream analyses that require full AIR sequence information. Similarly, cells flagged as multichain
are likely undetected doublets and should be excluded.
Note
Standard Scirpy preprocessing workflow
scirpy.pp.index_chains()
: create chain indices for the receptor datascirpy.tl.chain_qc()
: add a summary about the immune cell-receptor compositions to.obs
filter as desired
For more information regarding preprocessing and filtering, especially for transcriptomics data, please refer to the single cell best practice book. If you want to gain a better understanding regarding the function calls shown below, please have a look on our dedicated T cell tutorial
ir.pp.index_chains(mdata, filter=["productive", "require_junction_aa"])
ir.tl.chain_qc(mdata)
mu.pp.filter_obs(
mdata, "airr:chain_pairing", lambda x: ~np.isin(x, ["orphan VDJ", "orphan VJ", "ambiguous", "multichain"])
)
Filtering chains...
Indexing VJ chains...
Indexing VDJ chains...
build result array
Stored result in `mdata.obs["airr:receptor_type"]`.
Stored result in `mdata.obs["airr:receptor_subtype"]`.
Stored result in `mdata.obs["airr:chain_pairing"]`.
Preprocessing Transcriptomics Data#
Certain analyses can combine transcriptomics and AIRR information. Gene expression (GEX) data needs to be filtered and preprocessed as with any other single-cell dataset. Please refer for a real dataset to the scverse tutorials and the single-cell best practice book for advice. The transcriptomics data in this demo dataset has already been prepared appropriately.
Define Clonotype Cluster for BCR#
The general workflow of defining BCR/TCR clonotype clusters is depicted below:
scirpy function |
objective |
---|---|
Compute sequence-based distance matrices for all VJ and VDJ sequences. |
|
Cluster cells by the similarity of their CDR3-sequences. |
|
Compute layout of the clonotype network. |
|
Plot clonotype network colored by different parameters. |
In Scirpy
we generally distinguish between clonotypes (group of cells that have the same anscestor and identical receptors) and clonotype clusters (group of cells that have similar receotors). Because of somatic hypermutation the above definition of clonotypes wouldn’t correctly represent clonal relationship in the case of BCR, and therefore we have to define certain criteria to cluster similar sequences into clonotype clusters.
Currently, there is no commonly agreed best practice on how to define clonotype cluster for BCR sequences, and it remains an active research field [YK15]. In Scirpy
, we implemented a simple, yet effective, network-based approach to define BCR clonotype clusters [BvanSchaikS+24] leveraging the same underlying functionality we had already in place for TCRs.
With scirpy.pp.ir_dist()
we can calculate the full pairwise-distance matrix between each junction sequence for each light and heavy chain in the dataset. For BCRs, we recommend the normalized_hamming
distance which calculates the hamming distance between junction sequences normalized to their respective length applied to nucleotides (sequence = nt
) rather than amino acids (sequence = aa
), because SHM acts on the nucleotide level [YK15].
To choose a distance cutoff, we can either use a generic threshold of e.g. 85% junction similarity (cutoff = 15
), or choose one based on the empirical distribution of distances.
To this end, setting min_dist_histogram = True
will display the distance-to-nearest distribution of the junction sequences between VJ-chains (light) and VDJ-chains (heavy), respectively. A cutoff is commonly selected such that it separates the bimodality between sequences with closely related sequences from sequences without closely related sequences [GAB+17].
However, as it is apparent by the following histogram this process can be subjective, and sometimes no clear bimodality can be observed. It is generally true that distance-to-nearest plots are bimodal, but it can be difficult to identify one distinct threshold, especially for small datasets[GAB+17]. In such cases, we suggest to use conservative fixed thresholds to minimize human bias.
ir.pp.ir_dist(mdata, metric="normalized_hamming", cutoff=15, sequence="nt", histogram=True)
Next up, calling scirpy.tl.define_clonotype_clusters()
will actually define the clonotype clusters based on the distances of the VJ and VDJ CDR3-sequences (calculated by scirpy.pp.ir_dist()
) and further parameters. To identify cells that originate from the same ancestor, we require them to have the same V-gene (same_v_gene = True
) and J-gene (same_j_gene = True
) segments [YK15]. Using receptor_arms = "all"
and dual_ir = "any"
ensures that both heavy and light chains need to match [DID+13], and is lenient about some cells having a secondary chains where others might have none. Setting within_group
to an .obs
column will ensure cells with different values do not cluster together (e.g. use patient_id to ensure there are no “public” clonotypes). Dealing with allelically/isotypically included B cells in sc-data is still in its upcoming may require some experimentation with the dual_ir
parameter.
Note
Re-annotation with IgBlast/IMGT-HighV-QUEST causes at the moment some issues in Scirpy as it commonly integrates allele information as part of the gene annotation (after the asterisk(*)) and will include ambiguous gene annotations (more than one annotation per gene). At the time, manually manipulating both v_call and j_call columns of the respective files prior to import into Scirpy seems to be an acceptable workaround.
These restrictions are used to construct a network where each node represents a unique junction sequence (clonotypes) and the size of the node represents how many identical sequences this node contains. Nodes that pass all specifications of scirpy.tl.define_clonotype_clusters()
will be connected by an edge.
Lastly, partitions
will specify the algorithm that is used for community detection in igraph. partitions = "fastgreedy"
will initiate a greedy hierarchical clustering algorithm, which might lead to some smaller subclusters, especially if cluster are weakly connected [CNM04]. In practice, this makes little difference in scirpy, since edges representing a distance exceeding the cutoff
threshold will automatically be elimitated to improve computational efficiency, already leading to disconnected partitions.
scirpy.tl.define_clonotype_clusters()
will add a clone_id
and a clone_id_size
column to .obs
, which can be customized with key_added
.
ir.tl.define_clonotype_clusters(
mdata,
sequence="nt",
metric="normalized_hamming",
receptor_arms="all",
dual_ir="any",
same_v_gene=True,
same_j_gene=True,
# within_group = "gex:patient_id", # disallow "public" clonotypes
partitions="fastgreedy",
key_added="clone_id_85_similarity",
)
Initializing lookup tables.
Computing clonotype x clonotype distances.
Stored result in `mdata.obs["airr:clone_id_85_similarity"]`.
Stored result in `mdata.obs["airr:clone_id_85_similarity_size"]`.
Calling scirpy.tl.clonotype_network()
will compute the layout for the clonotype network needed for visualization. Setting min_cells
to an appropriate value is especially important for big datasets to avoid overcrowding the plot.
ir.tl.clonotype_network(
mdata, sequence="nt", metric="normalized_hamming", min_cells=3, clonotype_key="clone_id_85_similarity"
)
scirpy.pl.clonotype_network()
will display a customizable clonotype network. color
can be set to any .obs
column. Here we see that most expanded clonotype clusters belong to either Moderate
and Severe
patients. IWe also observe that no clonotype cluster is shared across patients, which we could show systematically by checking the patient IDs stored inside mdata.obs["gex:patient_id]
Repertoire Analysis#
Clonotype Expansion#
Upon infection, some precursor lymphocytes that recognize an invading antigen proliferate into cells with various phenotypes that establish an appropriate immune response in a process called clonal expansion (also referred to as clonotype expansion) [AGS20]. Clonotype expansion is one way to characterize the immune response.
scirpy.tl.clonal_expansion()
adds a label (inside .obs
) showing if there are other cells with same clonotype id or not. We can identify different “levels of expansion” by adjusting the breakpoints
parameter. This new column is useful to access information about clonal expansion among different groups.
ir.tl.clonal_expansion(mdata, target_col="clone_id_85_similarity", breakpoints=(1, 2))
expanded = np.sum(mdata.obs["airr:clonal_expansion"] != "<= 1") / len(mdata)
print(f"{expanded:.2%}% of cells belong to an expanded (size >=2) clonotype cluster in this dataset.")
Stored result in `mdata.obs["airr:clonal_expansion"]`.
6.04%% of cells belong to an expanded (size >=2) clonotype cluster in this dataset.
This newly generated column can be visualized using scirpy.pl.group_abundance()
, which is designed to quickly visualize categorical data. Here it appears that expanded clonotypes are preferentially found in cells labelled as plasma cells and plasmablast. This makes sense as these cell types are primarily involved in the humoral immune response upon infection and proliferate rapidly [NHTC15].
_ = ir.pl.group_abundance(
mdata,
target_col="gex:full_clustering",
groupby="airr:clonal_expansion",
sort=["<= 1", "<= 2", "> 2"],
normalize=True,
)
Alternatively, clonal expansion can be visualized using scirpy.pl.clonal_expansion()
. Each bar will represent one group defined groupby
and each colour will identify one clonal expansion breakpoint defined by breakpoints
.
Clonotype Diversity#
TODO
Clonotype Abundance#
Sometimes it might be desired to investigate the biggest clonotype cluster, which can be accomplished with calling scirpy.pl.group_abundance()
. It is worth mentioning that bars may be normalized to any variable from .obs
by setting the normalize
parameter accordingly.
Here, we can see that each of the 10 largest clonotype cluster span across at least two cell type annotations.
Gene usage#
Gene annotation can be utilized to investigate gene usage for various categories like patient related information (e.g. status, disease, treatment) or cell type annotation. By using scirpy.get.airr_context()
as a context manager any AIRR variable stored inside the .obsm["airr"]
awkward array can be temporarily added to mdata.obs
.
Below, this was applied to visualize gene usage per patient-status. We observe that the majority of the most used genes belong to gene family IGHV3 and IGHV4.
with ir.get.airr_context(mdata, "v_call"):
ax_Status = ir.pl.group_abundance(
mdata,
groupby="VDJ_1_v_call",
target_col="gex:Status_on_day_collection_summary",
normalize="gex:patient_id",
max_cols=20,
figsize=(4, 4),
)
_ = ax_Status.set_xticklabels(ax_Status.get_xticklabels(), rotation=50, ha="right")
To quickly get an overview of the exact combination of V(D)J genes scirpy.pl.vdj_usage()
can be used. It will plot all gene combinations of the dataset as a Sankey-plot.
ir.pl.vdj_usage(
mdata,
vdj_cols=("VJ_1_v_call", "VJ_1_j_call", "VDJ_1_v_call", "VDJ_1_d_call", "VDJ_1_j_call"),
full_combination=False,
max_segments=None,
max_ribbons=30,
fig_kws={"figsize": (12, 8)},
)
A lot of this tutorial was focused on the primary heavy chain, as it is considered to be more diverse compared to the light chain and thus be potentially more interesting for analysis. Either way, analysing the light chain works similarly and may also yield insightful information. Further, it is possible to analyse Alellically included B-cells with a secondary heavy and/or light chain contig as well.
Spectratype analysis#
Spectratype analysis is another way to describe heterogeneity of the V(D)J sequences. The V(D)J-recombination mechanism brings together different gene segments and might introduce insertions and/or deletions of nucleotides between genes segments, which leads to a broad spectrum of lengths of the resulting transcript [MW17]. Scirpy offers scirpy.pl.spectratype()
that plots a distribution by summarizing the length of the junction (CDR3) region for all cells and chains.
Using scirpy.pl.spectratype()
we can show that while the spectratype distribution of the whole dataset is almost a normal distribution with a peak at 16 amino acids, there appears to be a preferential usage of spectratypes with length 20 for expanded clonotype clusters. At the same time, spectratypes of length 16 in expanded cells sppear to be far less abundant than expected from the overall spectratype distribution.
with ir.get.airr_context(mdata, "v_call"):
ir.pl.spectratype(mdata, chain=["VDJ_1"], color="VDJ_1_v_call", viztype="bar", fig_kws={"dpi": 120}, figsize=[6, 6])
with ir.get.airr_context(mdata, "v_call"):
ir.pl.spectratype(
mdata[
mdata.obs["airr:clonal_expansion"].isin(["<= 2", "> 2"]),
:,
],
chain=["VDJ_1"],
cdr3_col="junction_aa",
color="VDJ_1_v_call",
viztype="bar",
fig_kws={"dpi": 120},
figsize=[6, 6],
)
To visualize different spectratype distributions across different categories (like cells/patients) we can set viztype = "curve"
and the curve_layout = "shifted"
to “shifted”. This will visualize all distinct categories as a “ridge-plot”. The plot below suggests that plasmablasts could contribute towards the observed peak at 20 aa for expanded clonotype clusters as the junction sequence distribution of this particular cell type seems to be skewed towards longer spectratypes and plasmablasts were preferentially expanded.
ir.pl.spectratype(
mdata,
cdr3_col="junction_aa",
chain="VDJ_1",
color="gex:full_clustering",
viztype="curve",
curve_layout="shifted",
fig_kws={"figsize": [6, 6]},
kde_kws={"kde_norm": False},
)
Thanks to the seamless integration of scverse tools, we can also leverage Scanpy funtionctionalities for plotting. So instead of analysing/visualizing categorical data via “ridge”-plots as shown above we could also, e.g. create a violin plot using scanpy.pl.violin()
.
The context manager scirpy.get.obs_context()
allows to temporary add AIRR data into the .obs
column for plotting.
Below, we use a violin plot to visualize potential differences between patient-status groups.
with ir.get.obs_context(
mdata["gex"], {"junction_len": [len(a) for a in ir.get.airr(mdata, "junction_aa", "VDJ_1")]}
) as m:
sc.pl.violin(
m,
"junction_len",
groupby="Status_on_day_collection_summary",
stripplot=False,
inner="box",
rotation=45,
order=["Asymptomatic", "Mild", "Moderate", "Severe", "Critical"],
)
Logoplots#
Motif sequence analysis is a crucial analysis step and is often executed through generation of logoplots with respective tools. Here, we represent a Scirpy native wrapper function scirpy.pl.logoplot_cdr3_motif
to a well established python package called logomaker. With to_type
we can specify, how the logoplot is calculated and plot_default
offers a convenient “pre-styling” that can be easily adapted. As logomaker offers a lot of further customization opportunities and scirpy.pl.logoplot_cdr3_motif
, will allow any input that is accepted by logomaker we suggest to head to their documentation to maximize the use of this amazing package.
Note
The input to
scirpy.pl.logoplot_cdr3_motif
is expected to include only sequences that have the same junction length, because it is not possible to perform multiple sequence alignment with Scirpy. Therefore, the user has to manually filter the MuData object prior or while the function call.
The logoplot below shows the likelihood of any amino acid to occur among expanded clonotype clusters with a spectratype of 20.
with ir.get.obs_context(
mdata["airr"], {"junction_len": [len(a) for a in ir.get.airr(mdata, "junction_aa", "VDJ_1")]}
) as m:
ir.pl.logoplot_cdr3_motif(
m[(m.obs["junction_len"] == 20) & (m.obs["clonal_expansion"].isin(["<= 2", "> 2"]))],
chains=["VDJ_1"],
to_type="information",
)
The next logoplot is similar to the previous one, but should demonstrate that it is very easy to generate logoplots of certain clonotype clusters, as they always have the same junction length due to the nature of the used hamming distance. As expected there is only small variation inside the same clonotype cluster, but interestingly position 10 and 12 of the VDJ chain differs the most. It is easy
to show multiple chains next to each other using plt.subplots()
.
Analysing somatic hypermutation#
TODO