CHORD is a random forest model that uses the relative counts of somatic mutation contexts to predict homologous recombination deficiency (HRD). The primary contexts used by CHORD are deletions with flanking microhomology and 1-100kb structural duplications. Additionally, 1-100kb structural duplications are used to distinguish BRCA1-type HRD from BRCA2-type HRD.
CHORD has been updated to v2.0. In this version, the del.mh
feature
(deletions with microhomology) is split into del.mh.bimh.1
and
del.mh.bimh.2.5
(bases in microhomology: 1bp; 2 to >=5bp). This was
done to reduce false positive predictions. del.mh.bimh.2.5
was found
to be predictive of HRD to a greater extent than del.mh.bimh.1
.
Furthermore, radiation therapy was found to be associated with
del.mh.bimh.1
, which could result in false positive predictions for
radiotherapy treated patients.
Pan-cancer landscape of homologous recombination deficiency
Luan Nguyen, John Martens, Arne Van Hoeck, Edwin Cuppen. Nat Commun 11,
5584 (2020). https://www.nature.com/articles/s41467-020-19406-4
CHORD is itself an R package. Before using CHORD, some other R packages will also need to be installed, with the main one being mutSigExtractor, which is required for extracting the mutation contexts that CHORD uses. The below code can be used to install CHORD and mutSigExtractor, as well as the dependencies for both packages.
## Bioconductor packages required by mutSigExtractor
install.packages('BiocManager')
BiocManager::install('BSgenome') ## Install genome parser
BiocManager::install('BSgenome.Hsapiens.UCSC.hg19') ## Install the default genome
## randomForest is required by CHORD
install.packages('randomForest')
## Install CHORD and mutSigExtractor directly from github using devtools
install.packages("devtools")
devtools::install_github('/~https://github.com/UMCUGenetics/mutSigExtractor/')
devtools::install_github('/~https://github.com/UMCUGenetics/CHORD/')
CHORD can be supplied with vcf files containing somatic (no germline) SNVs, indels, and SVs per sample. The first step is to extract the mutation contexts required by CHORD.
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs/',
vcf.indel = '/path/to/vcf/with/indels/',
vcf.sv = '/path/to/vcf/with/svs/',
sv.caller = 'gridss' ## Can be 'gridss' (default) or 'manta'
)
Note that there is no real standard for SV vcfs. However, parsing vcfs produced by Manta or GRIDSS has been built into CHORD.
Alternatively, dataframes can be used as input (handy if you want to parse the vcfs yourself and supply this preprocessed data to CHORD).
contexts <- extractSigsChord(
df.snv = bed_file_like_dataframe_with_snvs,
df.indel = bed_file_like_dataframe_with_indels,
df.sv = dataframe_with_svtype_and_svlen
)
For SNVs and indels, a dataframe with the columns: chrom, pos, ref, alt.
## CHROM POS REF ALT
## 1 1 16145827 A C
## 2 1 16492085 G C
## 3 1 17890303 C G
## 4 1 18877885 G A
## 5 1 18919776 T C
For SVs, a dataframe with the columns: sv_type, sv_len. For sv_type, values must be DEL, DUP, INV, TRA (deletions, duplications, inversions, translocations). For translocations, sv_len information is discarded.
## sv_type sv_len
## 1 TRA NA
## 2 DEL 1696
## 3 DEL 22644
## 4 DUP 1703
## 5 DEL 1789
## 6 DEL 49256
The column names themselves do not matter, as long as the columns are in the aforementioned order.
The vcf.* and df.* arguments can be mixed. This is especially useful
in the case of SV vcfs, since it is possible that you will need to parse
these yourself (in this case the sv.caller
argument can be left out).
Make sure to only use PASS variants! If your vcfs contain unfiltered
variants, you can provide the desired FILTER values to keep to
vcf.filters
.
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs/',
vcf.indel = '/path/to/vcf/with/indels/',
df.sv = dataframe_with_svtype_and_svlen,
## This is optional (for when your vcfs are not filtered for PASS variants)
vcf.filters=list(snv="PASS",indel="PASS",sv="PASS")
)
Often, SNVs and indels are reported in the same vcfs. In such cases, the
vcf path can be specified to the vcf.snv
argument (vcf.indel
can be
left out). Alternatively, SNVs and indels can be provided as a
dataframe, which can be specified to df.snv
(with df.indel
left
out).
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs_and_indels/', ## or: df.snv = bed_file_like_dataframe_with_snvs_and_indels
df.sv = dataframe_with_svtype_and_svlen
)
A different reference genome than the default
(BSgenome.Hsapiens.UCSC.hg19) can be used. Genomes should be BSgenomes.
The variable name (i.e. no quotes) of the BSgenome object is
specified to ref.genome
.
## Make sure to install and load the desired ref genome first
install.packages('BiocManager')
BiocManager::install('BSgenome.Hsapiens.UCSC.hg38')
## Non-default genomes need to be explicitly loaded. The default (BSgenome.Hsapiens.UCSC.hg19)
## is automatically loaded.
library(BSgenome.Hsapiens.UCSC.hg38)
## Specify the name of the BSgenome object to the ref.genome argument
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs_and_indels/',
df.sv = dataframe_with_svtype_and_svlen
ref.genome=BSgenome.Hsapiens.UCSC.hg38
)
Once we have the mutation contexts, a prediction can be made.
chordPredict(contexts)
This tutorial will demonstrate how HRD prediction can be performed locally. For large vcfs and/or many samples, it is advised to run CHORD on a high-performance cluster (HPC).
Somatic vcfs from a few primary tumors from the 560 breast cancer cohort
(BRCA-EU, ICGC) will be used as example input. These are located at
example/vcf/
in the CHORD git repository. These vcfs have been
generated from tsv files which are publically available at
https://dcc.icgc.org/repositories.
In part 1, we will use the vcfs directly as input for CHORD.
Part 2 of this tutorial will provide a demonstration of how we can parse the vcfs into dataframes that CHORD accepts. This is particularly useful if you have non-standard vcfs, as is often the case with SV vcfs. Of course, it is possible to produce the required input dataframes from any data source, and not just vcfs; it only requires a bit of data wrangling :).
To run the example, clone/download the git repository to your own
computer (e.g. to /Users/lnguyen/Desktop/), and set the working
directory to the example/
directory in the CHORD package.
setwd('/Users/lnguyen/Desktop/CHORD/example/')
We will first create a dataframe to store the vcf paths as well as assign them sample names.
library(CHORD)
options(stringsAsFactors=F)
## Get file paths from the vcf/ subdirectory
vcf_files <- data.frame(
snv_indel=list.files('vcf', pattern='*snv_indel.vcf.gz', full.names=T),
sv=list.files('vcf', pattern='*sv.vcf.gz', full.names=T)
)
## Assign sample names to files
vcf_files$sample <- sapply(strsplit(basename(vcf_files$snv_indel),'_'),`[`,1)
vcf_files
## snv_indel sv sample
## 1 vcf/PD10010_snv_indel.vcf.gz vcf/PD10010_sv.vcf.gz PD10010
## 2 vcf/PD11352_snv_indel.vcf.gz vcf/PD11352_sv.vcf.gz PD11352
## 3 vcf/PD3905_snv_indel.vcf.gz vcf/PD3905_sv.vcf.gz PD3905
## 4 vcf/PD4116_snv_indel.vcf.gz vcf/PD4116_sv.vcf.gz PD4116
## 5 vcf/PD7344_snv_indel.vcf.gz vcf/PD7344_sv.vcf.gz PD7344
Vcfs can either be in plain text format (.vcf) or compressed (.gz).
SNVs and indels are often stored in the same vcf (as is the case with
the example vcfs). CHORD therefore accepts such vcfs as input; SNVs and
indels are automatically split up. However, it is also possible to
specify the SNV and indel vcfs separately (by specifying the paths
separately to vcf.snv
and vcf.indel
to extractSigsChord()
; see
below).
Create a directory to write all subsequent output. Note: the output
from this tutorial exists in the output/
directory by default. Delete
this directory to create new output files.
dir.create('output/')
Extract the features that are used as input for CHORD. With the example data, it should take about 30s to run.
## Make dir to output contexts
contexts_dir <- 'output/contexts/'
dir.create(contexts_dir, recursive=T)
## Extract contexts for all samples
for(i in 1:nrow(vcf_files)){
params <- as.list(vcf_files[i,])
out_path <- paste0(contexts_dir,'/',params$sample,'_contexts.txt')
if(!file.exists(out_path)){
extractSigsChord(
vcf.snv=params$snv_indel,
vcf.sv=params$sv, sv.caller='manta',
sample.name=params$sample,
output.path=out_path, verbose=F
)
}
}
The sv.caller
parameter has been included in extractSigsChord()
since different SV callers report SVs in different ways.
extractSigsChord()
includes parsing for vcfs produced by Manta and
GRIDSS (sv.caller='manta'
or sv.caller='gridss'
). If you have vcfs
from other SV callers, a solution is to provide the relevant data as a
dataframe (see section 2 of the tutorial).
We can then merge the contexts into a matrix.
## Get the paths of the contexts txt files
context_files <- list.files(contexts_dir, full.names=T)
## Read the txt files into R
l_contexts <- lapply(context_files, function(i){
read.delim(i, check.names=F)
})
## Merged the contexts into a matrix
merged_contexts <- do.call(rbind, l_contexts)
## Write to output directory
write.table(merged_contexts, 'output/merged_contexts.txt', sep='\t', quote=F)
Below we can see what a part of the context matrix looks like.
merged_contexts[,1:5]
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A
## PD10010 37 25 8 24 34
## PD11352 29 40 3 21 30
## PD3905 89 61 14 61 91
## PD4116 123 136 15 142 164
## PD7344 10 9 1 6 11
Once we have the context matrix ready, we can use it for predicting HRD.
chord_output <- chordPredict(merged_contexts, do.bootstrap=T, verbose=F)
write.table(chord_output, 'output/chord_pred.txt', sep='\t', quote=F)
To extract the HRD probabilities:
chord_output[,1:8]
## sample p_BRCA1 p_BRCA2 p_hrd hr_status hrd_type
## 1 PD10010 0.000 0.000 0.000 HR_proficient none
## 2 PD11352 0.000 0.000 0.000 HR_proficient none
## 3 PD3905 0.798 0.056 0.854 HR_deficient BRCA1_type
## 4 PD4116 0.062 0.680 0.742 HR_deficient BRCA2_type
## 5 PD7344 0.000 0.002 0.002 cannot_be_determined none
## remarks_hr_status remarks_hrd_type
## 1
## 2
## 3
## 4
## 5 <50 indels
CHORD outputs the probability of:
- HRD:
p_hrd
- BRCA1-type HRD:
p_BRCA1
- BRCA2-type HRD:
p_BRCA2
hr_status
tells us if a sample is HR deficient (p_hrd
>= 0.5) or
proficient. If a sample is HRD, then hrd_type
will tell us if the
sample has BRCA1-type HRD or BRCA2-type HRD (=max(p_BRCA1
,p_BRCA2
)).
CHORD requires >=50 indels to accurately determine whether a sample
is HRD. If this criterion is not met, hr_status
will be
cannot_be_determined
and remarks_hr_status
will be <50 indels
.
CHORD cannot be applied to MSI samples. If an MSI sample is
detected, hr_status
will be cannot_be_determined
and
remarks_hr_status
will be Has MSI (>14000 indel.rep)
If a sample is HRD, CHORD requires >=30 SVs to accurately determine
HRD subtype. If this criterion is not met, hrd_type
will be
cannot_be_determined
, and remarks_hr_status
will be <30 SVs
.
The user may of course ignore these remarks and proceed with using the
raw probabilities outputted by CHORD (p_hrd
and/or
p_BRCA1
/pBRCA2
).
To extract the bootstrap predictions:
chord_output[,9:ncol(chord_output)]
## p_BRCA1.5% p_BRCA1.50% p_BRCA1.95% p_BRCA2.5% p_BRCA2.50% p_BRCA2.95%
## 1 0.0000 0.000 0.0043 0.0000 0.000 0.0040
## 2 0.0000 0.000 0.0000 0.0000 0.000 0.0000
## 3 0.4983 0.638 0.8644 0.0156 0.079 0.1704
## 4 0.0477 0.069 0.1046 0.6393 0.675 0.7403
## 5 0.0000 0.000 0.1810 0.0000 0.002 0.4714
## p_hrd.5% p_hrd.50% p_hrd.95%
## 1 0.0000 0.000 0.0062
## 2 0.0000 0.000 0.0000
## 3 0.5960 0.785 0.8947
## 4 0.7276 0.742 0.8058
## 5 0.0000 0.003 0.6444
To assess the stability of prediction for each sample, bootstrapping is performed by resampling the feature vector 20 times and calculating HRD probabilities for each iteration. The probabilities at the 5%, 50% 95% quantiles are then calculated. In other words, the bootstraped predictions provide a lower (5% quantile) and upper (95% quantile) error from the median probability (50% quantile) for each prediction class.
The user may choose to use the median probabilities in place of the
single probabilities (as in chord_output$pred
). This can give more
accurate HRD probabilities, especially for samples with low mutational
load.
Note that performing the bootstrapping is computationally expensive.
Bootstrapping can be turned off by specifying do.bootstrap=F
to
chordPredict()
.
To run CHORD on non-standard vcfs or from other sources, we can create
dataframes that extractSigsChord()
accepts. The output can then be
passed to chordPredict()
.
In this part of the tutorial, we will create the required dataframes from the example vcfs. This will thus also serve as a short demonstration on how one could parse vcfs in R. Furthermore, we will focus on one sample for simplicity’s sake.
For SNVs and indels, a dataframe containing CHROM, POS, REF and ALT information (i.e. a bed file like format) is required. The column names of this dataframe can be anything; as long as the columns are in the aforementioned order.
We can use readVcfFields()
from the mutSigExtractor
package to read
a vcf into R as a dataframe.
library(mutSigExtractor)
df_snv_indel <- readVcfFields('vcf/PD3905_snv_indel.vcf.gz', fields=c('CHROM','POS','REF','ALT'))
Alternatively, we can specify the vcf columns we want by index.
df_snv_indel <- readVcfFields('vcf/PD3905_snv_indel.vcf.gz', fields=c(1,2,4,5))
The dataframe shown below is in the correct format as required by CHORD.
This dataframe contains both SNVs and indels. This is not a problem as
we can specify the argumentdf.snv
(and leaving out df.indel
) in
extractSigsChord()
as is done later in the tutorial.
df_snv_indel[25:30,]
## CHROM POS REF ALT
## 25 1 16145827 A C
## 26 1 16333830 GTGTGAAGGTGTC G
## 27 1 16492085 G C
## 28 1 17890303 C G
## 29 1 18877885 G A
## 30 1 18919776 T C
For SVs, a dataframe is required with SV type (values must be one of the following: DEL, DUP, INV, TRA) and SV length information. Again, the column names of this dataframe can be anything; as long as the columns are in the aforementioned order.
Again, readVcfFields()
can be used to read the vcf. The example vcfs
have been produced by Manta, which reports SVTYPE and SVLEN in the INFO
field, so only this field needs to be extracted from the vcf.
vcf_sv <- readVcfFields('vcf/PD3905_sv.vcf.gz', fields='INFO')
print(vcf_sv[1:5,,drop=F], right=F)
## INFO
## 1 SVLEN=-13737754;SVTYPE=TRA;CHR2=3;END=10206978;Donor_ID=DO224557;project=BRCA-EU;icgc_sample_ID=PD3905a;icgc_specimen_ID=SA569900;assembly_version=NA
## 2 SVLEN=1696;SVTYPE=DEL;CHR2=1;END=54899207;Donor_ID=DO224557;project=BRCA-EU;icgc_sample_ID=PD3905a;icgc_specimen_ID=SA569900;assembly_version=NA
## 3 SVLEN=22644;SVTYPE=DEL;CHR2=1;END=72263583;Donor_ID=DO224557;project=BRCA-EU;icgc_sample_ID=PD3905a;icgc_specimen_ID=SA569900;assembly_version=NA
## 4 SVLEN=1703;SVTYPE=DUP;CHR2=1;END=75618395;Donor_ID=DO224557;project=BRCA-EU;icgc_sample_ID=PD3905a;icgc_specimen_ID=SA569900;assembly_version=NA
## 5 SVLEN=1789;SVTYPE=DEL;CHR2=1;END=75619297;Donor_ID=DO224557;project=BRCA-EU;icgc_sample_ID=PD3905a;icgc_specimen_ID=SA569900;assembly_version=NA
Then, split the data each INFO entry, which are separated by ;
.
vcf_sv_info <- strsplit(vcf_sv$INFO,';')
vcf_sv_info[1:2]
## [[1]]
## [1] "SVLEN=-13737754" "SVTYPE=TRA"
## [3] "CHR2=3" "END=10206978"
## [5] "Donor_ID=DO224557" "project=BRCA-EU"
## [7] "icgc_sample_ID=PD3905a" "icgc_specimen_ID=SA569900"
## [9] "assembly_version=NA"
##
## [[2]]
## [1] "SVLEN=1696" "SVTYPE=DEL"
## [3] "CHR2=1" "END=54899207"
## [5] "Donor_ID=DO224557" "project=BRCA-EU"
## [7] "icgc_sample_ID=PD3905a" "icgc_specimen_ID=SA569900"
## [9] "assembly_version=NA"
Extract the SVTYPE and SVLEN data, and put it in the correct dataframe format.
sv_type <- sapply(vcf_sv_info,`[`,2) ## Select the 2nd object from each INFO entry
sv_type <- gsub('SVTYPE=','',sv_type) ## Remove the 'SVTYPE=' prefix
sv_len <- sapply(vcf_sv_info,`[`,1) ## Select the 1st object from each INFO entry
sv_len <- gsub('SVLEN=','',sv_len) ## Remove the 'SVLEN=' prefix
sv_len <- as.integer(sv_len) ## Convert the character vector to an integer vector
df_sv <- data.frame(sv_type, sv_len)
head(df_sv)
## sv_type sv_len
## 1 TRA -13737754
## 2 DEL 1696
## 3 DEL 22644
## 4 DUP 1703
## 5 DEL 1789
## 6 DEL 49256
You may notice that translocations (TRA) have an SV length, which doesn’t really make sense. This is no problem since SV length info is discarded for translocations.
Extract the relevant mutation contexts from the variant data.
contexts <- extractSigsChord(
df.snv = df_snv_indel,
df.sv = df_sv,
sample.name='PD3905'
)
Below we can see what a part of the context matrix looks like.
contexts[,c(1:4, 97:104, 127:131),drop=F]
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T del.rep.len.1 del.rep.len.2
## PD3905 89 61 14 61 24 6
## del.rep.len.3 del.rep.len.4 del.rep.len.5 ins.rep.len.1 ins.rep.len.2
## PD3905 4 1 0 19 1
## ins.rep.len.3 DEL_0e00_1e03_bp DEL_1e03_1e04_bp DEL_1e04_1e05_bp
## PD3905 0 0 8 8
## DEL_1e05_1e06_bp DEL_1e06_1e07_bp
## PD3905 1 2
Lastly, make the HRD prediction.
chord_output <- chordPredict(contexts, verbose=F)
chord_output
## sample p_BRCA1 p_BRCA2 p_hrd hr_status hrd_type remarks_hr_status
## 1 PD3905 0.798 0.056 0.854 HR_deficient BRCA1_type
## remarks_hrd_type
## 1
Please refer back to section 1 of the tutorial for interpreting CHORD’s output.