Skip to content
/ ALICE Public

Detecting TCR involved in immune responses from single RepSeq datasets

License

Notifications You must be signed in to change notification settings

pogorely/ALICE

Repository files navigation

ALICE

Detecting TCR involved in immune responses from single RepSeq datasets.

About

Here we provide an R implementation of ALICE approach, see paper for details.

Software requirements

Any OS where R is available (Linux, OS X, Windows), however parallel computing is currently not available on Windows.

Installation

  1. Install R distribution of choice (i.e. from R Core team, or Microsoft R Open )
  2. Install BioStrings package from bioconductor: open R console and execute following commands:
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
  1. Install data.table, stringdist, igraph packages:
install.packages("igraph")
install.packages("data.table")
install.packages("stringdist")
  1. Download this github repository.

Quick start

Lets load some data first and organize it to the list.

This is one VJ combination (TRBV9-TRBJ2-7) from S1 donor from link on day 0 and day 15 after yellow fever immunization.

library(data.table)
S1d15<-fread("sample/S1_d15_V9_J2_7.tsv")
S1d0<-fread("sample/S1_d0_V9_J2_7.tsv")
S1<-list(d0=S1d0,d15=S1d15)

Now lets run ALICE pipeline. Note, that algorithm will create a folder for files produced, if it does not exist:

source("ALICE.R")
S1_alice<-ALICE_pipeline(DTlist=S1,folder="S1_res",cores=1,iter=10,nrec=5e5) #this takes few minutes to run
sapply(S1_alice,nrow)

For this VJ-combination we have no significant results in day 0 timepoint, and 34 significant hits in day 15 timepoint.

Note, that for demo purposes we run it on 1 core with 10 iterations (0.5 mln sequences in each iteration) for generative probability estimation simulation. Total number of simulated TCR sequences (both inframe and out-of-frame) in this case is 5 million.

In paper we used 100 mln simulated sequences for each VJ-class, and this takes a lot of time. Datasets from the paper are available here.

Additional parameters

Read_count_filter(default 0) and Read_count_neighbour(default 1) parameters are two conceptually different count threshold for clones considered by the algorithm.

Algorithm discards all clones with count Read_count_filter or less prior to analysis, and it does not consider as neighbours clone with count Read_count_neighbour or less (but such clones are not discarded, so if they have a lot of high count neighbours, it could be significant hit).

qL(default is FALSE) uses different selection factor for different lengths instead of universal selection factor.

cor_method (default "BH") specifies multiple testing correction method supplied to p.adjust function from stats package. P_thres (default 0.001) determines p-value threshold after correction.

Input file format

Algorithm operates on R data.table with following mandatory columns:

CDR3.amino.acid.sequence, bestVGene, bestJGene, Read.count.

See sample datasets in sample folder.

There is a pipeline to import data in immunoSEQ format. It imports all files in given folder with counts larger than Read_thres, and also identifies CDR3nt and collapses clones with unique CDR3nt+V+J. immunoSEQ data with short reads do not contain complete CDR3nt sequence. For such data specify trim parameter (for reads of length 60 typically trim=3), in this case algorithm will identified CDR3nt trimmed from J side by certain number of amino acids.

source("import.R")
dtlist<-import_immunoseq_pipeline(folder="path",trim=3,Read_thres=0)
results<-ALICE_pipeline(dtlist)

(Experimental) Using pipeline with OLGA for Pgen estimation

Install OLGA first (see OLGA github for details).

ALICE will call OLGA in background. Multi-core usage is not available on Windows.

source("ALICE.R")

#import data

S1d15<-fread("sample/S1_d15_V9_J2_7.tsv")
S1d0<-fread("sample/S1_d0_V9_J2_7.tsv")
S1<-list(d0=S1d0,d15=S1d15)

S1_alice<-ALICE_pipeline_OLGA(DTlist=S1,cores=1)
sapply(S1_alice,nrow)

For this VJ-combination we have 0 significant results at the day 0 timepoint, and 32 significant hits at the day 15 timepoint.

Note: On Rstudio server environments with limited write access you may need to install OLGA using the --user command:

git clone <OLGA>
cd OLGA
python setup.py install --user

This installs olga-compute_pgen and olga-generate_sequences to ~/Username/.local/bin, so this must be added to you PATH within your .Rprofile file: Sys.setenv(PATH="/Username/.local/bin:usr/bin")

Remember to then reload your .Rprofile using 'source(".Rprofile")'

About

Detecting TCR involved in immune responses from single RepSeq datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages