Analysis Tools for TIRTLseq Data • TIRTLtools

Package is in active development and may change frequently.

Overview

TIRTLtools is a suite of tools for analyzing T-cell receptor (TCR) repertoires created using TIRTL-seq (Throughput-Intensive Rapid TCR Library sequencing) (Pogorelyy and Kirk et al., bioRxiv 2024). We provide functions for analysis of paired TCR repertoires as well as single-chain bulk data.

In addition to various analysis and plotting functions, we provide an efficient batched GPU implementation of TCRdist (Dash et al., Nature 2017) that works with both NVIDIA and Apple Silicon GPUs. In testing, we were able to calculate pairwise TCRdist for a repertoire of ~1 million TCRs in a few hours using a MacBook Pro (16-core GPU, M4 Pro).

Input data for package

The package requires paired-chain and single-chain “pseudo-bulk” data files created by the TIRTLseq pipeline.

The pipeline reads MiXCR output files from TCR repertoire sequencing for individual wells in a multiwell plate (e.g. 96-well or 384-well plate). It uses two complementary algorithms, MAD-HYPE and T-SHELL, to pair alpha and beta chains. It outputs three files: a .tsv file containing alpha and beta chain pairs predicted by each algorithm (file ending in “TIRTLoutput.tsv”), and two .tsv files containing aggregated single-chain “pseudo-bulk” data for alpha or beta chains from all wells on a plate (files ending in “pseudobulk_TRA.tsv” and “pseudobulk_TRB.tsv”).

MAD-HYPE (Multicell Analytical Deconvolution for High Yield Paired-chain Evaluation, Holec and Berleant et al., 2019) is a Bayesian method that computes the likelihood of alpha and beta chain matches based on their co-occurence patterns in wells on a plate. We re-implemented the MAD-HYPE algorithm for GPU computation, increasing efficiency and speed by two orders of magnitude.

T-SHELL (TCR ɑβ Sequence Highly Efficient Linkage Learning, Pogorelyy and Kirk et al., bioRxiv 2024) is an algorithm we developed that uses the correlation between TCRalpha and TCRbeta clonotype relative frequencies across wells in a plate (rather than their presence or absence) to pair them. It works well in the case of the most abundant clones, which are found in all or almost all wells, making it impossible for the MAD-HYPE algorithm to pair them using their co-occurrence patterns.

Example input data for the TIRTLseq pipeline can be found on Zenodo. For example, the aph17pre_exp_clones.tar.gz (172 MB) and aph17post_exp_clones.tar.gz (1.1 GB) files contain well-level TCR clonotype data (MiXCR output) for pre- and post-antigen-independent T cell expansion experiments.

Package functions

Data loading and filtering:

load_tirtlseq() – load paired TCR and single-chain data from TIRTL-seq experiments
filter_dataset() – subset a TIRTLseq dataset to only the requested samples

Repertoire analysis:

TCRdist() – calculate TCRdist via batched GPU calculation
cluster_tcrs() – cluster TCRs (with themselves and with TCRs from VDJ-db) based on pairwise TCRdist
diversity() – calculate repertoire diversity metrics

Plotting:

sample_overlap() – plot overlap/agreement between samples (in terms of most frequent clones)
plot_sample_vs_sample() – scatterplot of readFraction between two samples
rank_plot() – a line plot of clonotype rank vs. readFraction for different samples
plot_clusters() – plot top “n” clusters with the most observed TCRs – currently outputs a network graph, a umap, and a heatmap
plot_gene_usage() – bar plots of V-gene segment usage (or J-gene segment usage)
plot_num_partners() – plot number of partners for alpha and beta chains in paired data
plot_clonotype_indices() – stacked bar chart with fractions of reads for most frequent clonotypes.
plot_diversity() – bar plots of different diversity metrics

Helper functions:

add_single_chain_data() – add alpha- and beta-chain read counts/fractions to the paired TCR data
identify_non_functional_seqs() – mark cdr3 sequences that have stop codons (*) or frame shifts (_)
identify_paired() – identify which alpha and beta chains in the pseudo-bulk data are found in pairs
get_all_div_metrics() – returns all options for diversity metrics (currently: simpson, gini, gini.simpson, inv.simpson, shannon, berger.parker, richness, d50, dXX, renyi, hill)
get_all_tcrs() – aggregate all of the TCRs from all samples in a dataset.

Installation

You can install the package directly from GitHub:

remotes::install_github("NicholasClark/TIRTLtools")

Alternatively, you may clone the repository to a local folder

git clone https://github.com/NicholasClark/TIRTLtools.git

and load it using devtools:

devtools::load_all("./TIRTLtools") ### location of cloned repo on your machine

TCRdist GPU implementation

This function computes the similarity of two T-cell receptors (TCRs) using TCRdist (Dash et al., Nature 2017). The function can efficiently calculate pairwise similarities of thousands or millions of TCRs using a GPU if it is available. It relies on python code using cupy (Nvidia GPUs) or mlx (Apple Silicon GPUs), which is called through the reticulate R package.

The following example computes TCRdist values for all pairs of annotated TCRs in the VDJdb database. It outputs all values with TCRdist <= 90.

#library(TIRTLtools)
devtools::load_all("./TIRTLtools") ### location of cloned repo on your machine
library(dplyr)

### Load an example dataset (annotated TCRs from VDJ-db)
tcr1 = TIRTLtools::vdj_db
### note: You may replace this with a file of your choice - needs to have columns "va", "vb", "cdr3a", and "cdr3b"

as_tibble(tcr1)
# # A tibble: 26,213 × 15
#       complex.id cdr3a         va       ja    cdr3b vb    jb    Species MHC.A MHC.B MHC.class Epitope Epitope.gene
#         <int>   <chr>         <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr> <chr>     <chr>   <chr>
# 1          5 CAVRDGGTGFQKLVF   TRAV3*01 TRAJ… CASR… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      HPVGEA… EBNA1
# 2          6 CAARGIGSGTYKYIF   TRAV13-… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      HPVGEA… EBNA1
# 3          8 CAVRDGTANNLFF     TRAV3*01 TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 4          9 CAFMKDAGGTSYGKLTF TRAV38-… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 5         10 CAGGGSQGNLIF      TRAV27*… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 6         11 CALAGSQGNLIF      TRAV6*01 TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 7         12 CAGADGGSQGNLIF    TRAV27*… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 8         13 CAAGGSQGNLIF      TRAV27*… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 9         14 CAAPPSNTGKLIF     TRAV3*01 TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 10         15 CAVRDLLTNAGKSTF   TRAV3*01 TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# # ℹ 26,203 more rows
# # ℹ 2 more variables: Epitope.species <chr>, Reference <chr>
# # ℹ Use `print(n = ...)` to see more rows
result = TCRdist(tcr1 = tcr1, tcrdist_cutoff = 90, chunk_size = 5000)
# Checking for available GPU...
#
# Apple Silicon GPU detected:
#   Apple Silicon GPU (M1/M2/M3)
# Checking for GPU-related Python modules...
#
# 'mlx' is installed (for Apple Silicon GPUs).
# Loading mlx to perform TCRdist
# total number of chunks (rows): 5
# total number of chunks (cols): 5
# Processing chunk (rows) 0
# Processing chunk (rows) 5000
# Processing chunk (rows) 10000
# Processing chunk (rows) 15000
# Processing chunk (rows) 20000
# Processing chunk (rows) 25000
# Time taken: 2.970693 seconds

edge_df = result[['TCRdist_df']] ### table of TCRdist values <= cutoff
meta_df = result[['tcr1']] ### table of input data with indices

as_tibble(edge_df)
# # A tibble: 101,797 × 3
#       edge1_0index edge2_0index TCRdist
#           <int>        <int>   <int>
# 1            5            4      87
# 2            7            4       3
# 3            7            5      90
# 4           12            4      15
# 5           12            5      78
# 6           12            7      12
# 7           13            4      78
# 8           13            6      72
# 9           13            7      81
# 10           14            6      45
# # ℹ 101,787 more rows
# # ℹ Use `print(n = ...)` to see more rows

as_tibble(meta_df)
# # A tibble: 26,213 × 16
#   complex.id cdr3a      va    ja    cdr3b vb    jb    Species MHC.A MHC.B MHC.class Epitope Epitope.gene
#       <int>   <chr>     <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr> <chr>     <chr>   <chr>
# 1          5 CAVRDGGTG… TRAV… TRAJ… CASR… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      HPVGEA… EBNA1
# 2          6 CAARGIGSG… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      HPVGEA… EBNA1
# 3          8 CAVRDGTAN… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 4          9 CAFMKDAGG… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 5         10 CAGGGSQGN… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 6         11 CALAGSQGN… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 7         12 CAGADGGSQ… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 8         13 CAAGGSQGN… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 9         14 CAAPPSNTG… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# 10         15 CAVRDLLTN… TRAV… TRAJ… CASS… TRBV… TRBJ… HomoSa… HLA-… B2M   MHCI      GILGFV… M
# # ℹ 26,203 more rows
# # ℹ 3 more variables: Epitope.species <chr>, Reference <chr>, tcr_index <dbl>
# # ℹ Use `print(n = ...)` to see more rows

#' TCRdist function output:
#' A list with two items:
#'    'TCRdist_df' - A dataframe containing all TCRdist values less than the cutoff (default = 90).
#'                  Contains the indices of the two TCRs and the TCRdist value.
#'    'tcr1' - A dataframe with the input data, along with a column "tcr_index" with the index of each TCR

References

For details on our pairing pipeline, see the TIRTLseq preprint (Pogorelyy and Kirk et al.) and our github repository.

For details on the MAD-HYPE algorithm, see Holec and Berleant et al., 2019.

For details on MiXCR, see their website and publications in Nature Methods (2015) and Nature Biotechnology (2017).

For details on TCRdist, see Dash et al., Nature 2017

Written by Mikhail Pogorelyy and Nicholas Clark, Thomas Lab, St. Jude Children’s Research Hospital

TIRTLtools R package