GPU implementation of TCRdist, a distance/similarity metric for pairs of TCRs

An efficient, batched version of TCRdist that is compatible with both NVIDIA and Apple Silicon GPUs.

Usage

TCRdist(
  tcr1 = NULL,
  tcr2 = NULL,
  params = NULL,
  submat = NULL,
  tcrdist_cutoff = 90,
  chunk_size = 1000,
  print_chunk_size = 10,
  print_res = TRUE,
  only_lower_tri = TRUE,
  return_data = TRUE,
  write_to_tsv = FALSE
)

Arguments

tcr1: a data frame with one TCR per row. It must have the columns "va", "vb", "cdr3a", and "cdr3b"
tcr2: (optional) another data frame of TCRs. If supplied, TCRdist will be calculated for every combination of one TCR from tcr1 and one TCR from tcr2. Otherwise, it will calculate TCRdist for every pair of TCRs in tcr1.
params: (optional) a table of valid parameters for amino acids and va/vb segments. (default is NULL, which uses TIRTLtools::params)
submat: (optional) a substitution matrix with mismatch penalties for each combination of amino acids or va/vb segments (default is NULL, which uses TIRTLtools::submat).
tcrdist_cutoff: (optional) discard all TCRdist values above this cutoff (default is 90).
chunk_size: (optional) what size chunks to use in calculation of TCRdist (default 1000). If set at n, we calculate pairwise TCRdist for n x n TCRs at once. This may be as high as allowable by GPU memory (in our testing, a chunk_size of 1000 to 5000 provided the fastest runtime and chunk_size of over 7500 resulted in memory errors on some GPUs).
print_chunk_size: (optional) print a line of output for every n TCRs processed (default 1000)
print_res: (optional) print summary of results (default is TRUE)
only_lower_tri: (optional) return one TCRdist value for each pair (like the lower triangle of a symmetric matrix). Default is TRUE.

Value

A list with entries:

$TCRdist_df - a data frame with three columns: "edge1_0index", "edge2_0index", and "TCRdist". The first two columns contain the indices (0-indexed) of the TCRs for each pair. The last column contains the TCRdist if it is below the cutoff. The output is sparse in that it only contains pairs that have TCRdist <= cutoff.

$tcr1 - a data frame of the TCRs supplied to the function. It contains an additional column "tcr_index" with the (0-indexed) index of each TCR.

$tcr2 - a similar data frame for tcr2, if it was supplied.

Details

This function calculates pairwise TCRdist (Dash et al., Nature 2017) for a set of TCRs (or between two sets of TCRs) and returns a sparse output with the TCRdist and indices of all pairs that have TCRdist less than or equal to a desired cutoff (default cutoff is 90).

The function uses the `reticulate` package to call a python script that uses `cupy` (NVIDIA GPUs), `mlx` (Apple Silicon GPUs), or `numpy` (no GPU) to calculate TCRdist efficiently.

Examples

# example code
# data = load_tirtlseq("your_directory/")
# df = get_all_tcrs(data, chain="paired", remove_duplicates = TRUE)
# out = TCRdist(df, tcrdist_cutoff = 90)