Skip to content

Commit

Permalink
Updating docs, fixing bugs
Browse files Browse the repository at this point in the history
  • Loading branch information
suvakov committed Feb 27, 2020
1 parent 1f50ad7 commit 54f9fe9
Show file tree
Hide file tree
Showing 3 changed files with 208 additions and 34 deletions.
211 changes: 189 additions & 22 deletions GettingStarted.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Getting Started with CNVpytor

## Installing from GitHub
## Install

### Cloning GitHub repository

```
> git clone /~https://github.com/abyzovlab/CNVpytor.git
Expand All @@ -12,73 +14,239 @@ For single user (without admin privileges) use:
> python setup.py install --user
```

## Import read depth signal
### Using pip

```
> pip install cnvpytor
> cnvpytor -download
```

Make sure that you have indexed SAM, BAM or CRAM file.
Second line will download resource files from GitHub.

## Import read depth signal

Make sure that you have indexed alignment (SAM, BAM or CRAM) file.

Initialize your CNVpytor project by running:
```
> cnvpytor -root file.pytor -rd file.bam [-chrom name1 ...] [-T ref.fa.gz]
```
where:

* file.pytor -- specifies output cnvpytor file (HDF5 file)
* name1 ... -- specifies chromosome name(s).
* file.bam -- specifies bam/sam/cram file name.
* -T ref.fa.gz -- specifies reference genome file (only for cram file without reference genome).


Chromosome names must be specified the same way as they are described in the sam/bam/cram header,
e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified,
read mapping is extracted for all chromosomes in the sam/bam file.
Note that the pytor file is not being overwritten.

**Examples:**

```
> cnvpytor -root NA12878.pytor -chrom 1 2 3 -rd NA12878_ali.bam
```
for bam files with a header like this:
```
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
```

or
```
> cnvpytor -root NA12878.pytor -chrom chr1 chr2 chr3 -rd NA12878_ali.bam
```
for bam files with a header like this:
```
> cnvpytor -root file.pytor -rd file.bam
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
```

File file.pytor will be created and read depth data binned to 100 base pair bins will be stored

After -rd step file file.pytor is created and read depth data binned to 100 base pair bins will be stored
in _pytor_ file.

CNVpytor will detect reference genome and use internal database for GC content and 1000 genome strict mask.
Chromosome names and lengths are parsed from the input file header and used to
detect reference genome.

After instalation this works for hg19 and hg38 genomes. For other species or reference genomes you have to
[describe reference genome](examples/AddReferenceGenome.md).
Reference genome is important for for GC correction and 1000 genome strict mask filtering.
CNVpytor installation includes human genomes hg19 and hg38 with data files that provide information
about GC content and strict mask. For other species or reference genomes you have to
configure reference genome (see
[example](examples/AddReferenceGenome.md)).

To check is reference genome detected use:

```
> cnvpytor -root file.pytor -ls
```

CNVpytor will print out details about file.pytor including line that specify which reference genome is
used and are there available GC and mask data.
used and are there available GC and mask data:
```
Using reference genome: hg19 [ GC: yes, mask: yes ]
```

## Calling CNV/CNA events
Command -ls is useful if you want to check content of _pytor_ file but also date and version of cnvpytor
that created it.

## Predicting CNV regions

First we have to chose bin size. By cnvpytor design it have to be divisible by 100.
Here we will use 10 kbp and 100 kbp bins.


First hose bin size. It has to be divisible by 100. Here we will use 10 kbp and 100 kbp bins.

To calculate binned, GC corrected RD signal type:
To calculate a read depth histograms, GC correction and statistics type:
```
> cnvpytor -root file.pytor -his 10000 100000
```
CNVpytor stores


Next step is partitioning using mean-shift method:
```
> cnvpytor -root file.pytor -partition 10000 100000
```

Finally we can call CNV regions using commands:
```
> cnvpytor -root file.pytor -call 10000 > calls.10000.tsv
> cnvpytor -root file.pytor -call 100000 > calls.100000.tsv
```

Result is stored in tab separated files with following columns:
* CNV type: "deletion" or "duplication",
* chromosome name,
* start position,
* end position,
* CNV size,
* CNV level - read depth normalized to 1,
* e-val1 -- p value calculated using t-test statistics between RD statistics in the region and global,
* e-val2 -- p value from the probability of RD values within the region to be in the tails of a gaussian distribution of binned RD,
* e-val3 -- same as e-val1 but for the middle of CNV,
* e-val4 -- same as e-val2 but for the middle of CNV,
* q0 -- fraction of reads mapped with q0 quality.


## Import SNP data

### From single variant file
### From variant file

To import variant data from VCF file use following command:
```
> cnvpytor -root file.root -snp file.vcf.gz [-chrom name1 ...] [-ad AD_TAG] [-gt GT_TAG] [-noAD]
```
where:

* file.pytor -- specifies cnvpytor file,
* name1 ... -- specifies chromosome name(s),
* file.vcf -- specifies variant file name.
* -ad AD_TAG -- specifies AD tag used in vcf file (default AD)
* -gt GT_TAG -- specifies GT tag used in vcf file (default GT)
* -noAD -- ref and alt read counts will not be readed (see next section)

Chromosome names must be specified the same way as they are described in the vcf header,
e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified,
read mapping is extracted for all chromosomes in the vcf file.

If chromosome names in variant and alignment file are different in prefix chr (e.g. in "1" and "chr1")
cnvpytor will detect it and match the names using first imported name for both signals.

### Using SNP positions from variant file and counts from alignment file

## Calculating RD histograms
In some cases it is useful to read positions of SNPs from vcf file and
extract read counts from bam file. For example if we have two samples, normal tissue and
cancer, normal can be used to call germline SNPs, while samtools mpileup procedure can be used
to calculate read counts in cancer sample at the positions of SNPs. CNVpytor have
implemented this procedure. After reading SNP positions (previous step) type:

## Calculating BAF histograms
```
> cnvpytor -root file.pytor -pileup file.bam [-T ref.fa.gz]
```
where
* file.pytor -- specifies cnvpytor file,
* file.bam -- specifies bam/sam/cram file,
* -T ref.fa.gz -- specifies reference genome file (only for cram file without reference genome).


### Calculating BAF histograms

```
> cnvpytor -root file.pytor -baf 10000 100000 [-nomask] [-useid] [-usephase]
```

## Genotyping genomic regions

Using -genotype option followed by bin_sizes you can enter region and genotype calculation
for each bin size will be performed:

```
> cnvpytor -root file.pytor -genotype 10000 100000
12:11396601-11436500
12:11396601-11436500 1.933261 1.937531
22:20999401-21300400
22:20999401-21300400 1.949186 1.957068
```

**Example:**

Genotype all called CNVs:

```
> awk '{ print $2 } END { print "exit" }' calls.10000.tsv | cnvpytor -root file.pytor -genotype 10000 100000
```

## Visualization

### Plot from command line

Chromosome wide plots:
```
> cnvpytor -root file.pytor -plot [rd BIN_SIZE] [likelihood BIN_SIZE] [baf BIN_SIZE] [snp] [-o IMAGE_FILENAME]
```
where
* rd BIN_SIZE -- plots RD signal for all chromosomes
* likelihood BIN_SIZE -- plots baf likelihood for all chromosomes
* baf BIN_SIZE -- plots baf/maf/likelihood peak position for all chromosomes
* snp -- plots baf for each snp for all chromosomes
* -o IMAGE_FILENAME -- if specified, saves plot in file instead to show on the screen

Manhattan plot:
```
> cnvpytor -root file.pytor -plot manhattan BIN_SIZE [-chrom name1 ...] [-o IMAGE_FILENAME]
```

Circular plot:
```
> cnvpytor -root file.pytor -plot circular BIN_SIZE [-chrom name1 ...] [-o IMAGE_FILENAME]
```

Plot genomic regions:
```
> cnvpytor -root file.pytor -plot regions [reg1[,| ]...] BIN_SIZE [-panels [rd] [likelihood] [baf] [snp] ...] [-o IMAGE_FILENAME]
```
where
* reg1 -- comma or space separated regions in form CHR[:START-STOP], e.g. 1:1M-20M 2 3:200k-80000010
* if regions are comma separated they will be plotted in the same subplot
* space will split regions in different subplots
* -panels -- specify which panels to plot: rd likelihood baf snp
* -o IMAGE_FILENAME -- if specified, saves plot in file instead to show on the screen


### Plot from interactive mode

The best way to visualize cnvpytor results is interactive mode. Enter interactive mode by typing:
```
cnvpytor -root file.pytor -view BIN_SIZE
```

## Plotting
There is tab completion and help similar to man pages. Type double tab or help to start.

## Visualize CNVpytor data inside [JBrowse](/~https://github.com/GMOD/jbrowse)
### Visualize CNVpytor data inside [JBrowse](/~https://github.com/GMOD/jbrowse)
#### JBrowse version and plugins
JBrowse version: /~https://github.com/GMOD/jbrowse/archive/1.16.6-release.tar.gz

Expand Down Expand Up @@ -139,4 +307,3 @@ For SNP data, it exports Binned BAF, Likelihood of the binned BAF signals. These

cnvpytor does the segmentation for all of the above data based on the user provided bin size. The multiscalebigwig provides the option to show the data based on the visualized area on the reference genome, which means if a user visualizes a small region of the genome it shows small bin data and vice versa.

## Other
21 changes: 13 additions & 8 deletions cnvpytor/root.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class Root: main CNVpytor class
from .viewer import anim_plot_likelihood, anim_plot_rd
import numpy as np
import logging
import os.path

_logger = logging.getLogger("cnvpytor.root")

Expand Down Expand Up @@ -233,6 +234,8 @@ def _read_vcf(self, vcf_file, chroms, sample='', use_index=False, no_counts=Fals
vcff = Vcf(vcf_file)
chrs = [c for c in vcff.get_chromosomes() if len(chroms) == 0 or c in chroms]

use_index = use_index and os.path.exists(vcf_file + ".tbi")

def save_data(chr, pos, ref, alt, nref, nalt, gt, flag, qual):
if (len(chroms) == 0 or chr in chroms) and (not pos is None) and (len(pos) > 0):
self.io.save_snp(chr, pos, ref, alt, nref, nalt, gt, flag, qual, callset=callset,
Expand Down Expand Up @@ -293,7 +296,8 @@ def rd(self, bamfiles, chroms=[], reference_filename=False):
self.io_gc = IO(Genome.reference_genomes[rg_name]["gc_file"])
self.rd_stat()

def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_tag="GT", callset=None, use_index=False):
def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_tag="GT", callset=None,
use_index=True):
""" Read SNP data from variant file(s) and store in .cnvpytor file
Parameters
Expand All @@ -319,7 +323,7 @@ def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_
"""
for vcf_file in vcf_files:
self._read_vcf(vcf_file, chroms, sample, no_counts=no_counts, ad_tag=ad_tag, gt_tag=gt_tag,
callset=callset, use_index=use_index)
callset=callset, use_index=use_index)
self.io.add_meta_attribute("VCF", vcf_file)

def rd_from_vcf(self, vcf_file, chroms=[], sample='', ad_tag="AD", dp_tag="DP", use_index=False):
Expand Down Expand Up @@ -1046,7 +1050,8 @@ def calc_grad_par(ks):

self.io.create_signal(c, bin_size, "RD partition", levels, flags=flag_rd)

def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_mask=False, genome_size=2.9e9, genome_cnv_fraction=0.01):
def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_mask=False, genome_size=2.9e9,
genome_cnv_fraction=0.01):
"""
CNV caller based on the segmented RD signal.
Expand Down Expand Up @@ -1090,7 +1095,7 @@ def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_ma
rd_mask_chromosomes[rd_name] = c

for bin_size in bin_sizes:
ret[bin_size]=[]
ret[bin_size] = []
for c in self.io.rd_chromosomes():
if (c in rd_gc_chromosomes or not use_gc_corr) and (c in rd_mask_chromosomes or not use_mask) and (
len(chroms) == 0 or (c in chroms)):
Expand Down Expand Up @@ -1856,7 +1861,7 @@ def call_2d(self, bin_sizes, chroms=[], use_gc_corr=True, rd_use_mask=False, snp
if (segments[j][0] - segments[i][-1]) < max_distance * (
len(segments[i]) + len(segments[j])) and \
normal_overlap(level[i], error[i], level[j], error[j]) * likelihood_overlap(
likelihood[i], likelihood[j]) == maxo:
likelihood[i], likelihood[j]) == maxo:
nl, ne = normal_merge(level[i], error[i], level[j], error[j])
nlh = likelihood[i] * likelihood[j]
level[i] = nl
Expand Down Expand Up @@ -1898,9 +1903,9 @@ def call_2d(self, bin_sizes, chroms=[], use_gc_corr=True, rd_use_mask=False, snp
rd_p = t_test_1_sample(mean, level[i], error[i], len(segments[i]))
baf_mean, baf_p = likelihood_baf_pval(likelihood[i])
print(c + ":" + str(segments[i][0] * bin_size + 1) + "-" + str(
segments[i][-1] * bin_size + bin_size),
(segments[i][-1] - segments[i][0] + 1) * bin_size, bin_size, len(segments[i]),
rd_mean, rd_p, baf_mean, baf_p)
segments[i][-1] * bin_size + bin_size),
(segments[i][-1] - segments[i][0] + 1) * bin_size, bin_size, len(segments[i]),
rd_mean, rd_p, baf_mean, baf_p)

self.io.create_signal(c, bin_size, "RD mosaic segments",
data=segments_code(segments), flags=flag_rd)
Expand Down
10 changes: 6 additions & 4 deletions examples/AddReferenceGenome.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ For GC correction and 1000 genome strict mask filtering CNVpytor uses informatio
related to the reference genome. With installation two reference genomes are
available: hg19 (GRCh37) and hg28 (GRCh38).

If you want to use other reference genome for human or other species first we have
If we want to use CNVpytor for other species or different reference genome for human first we have
to create GC and mask file (optional).

In this example we will configure mouse reference genome MGSCv37.
Expand Down Expand Up @@ -34,7 +34,7 @@ apart from the filtering step.
You may also generate your own mask file by creating fasta file that contains character "P" if corresponding
base pair passes the filter and any character different than "P" if not.

Now, we will create example_ref_genome_conf.py file containing following:
Now, we will create _example_ref_genome_conf.py_ file containing list of chromosomes and chromosome lengths:

```
import_reference_genomes = {
Expand All @@ -56,7 +56,9 @@ import_reference_genomes = {
}
```

Last line can be skipped, if there is no mask file.
Last line can be skipped, if there is no mask file. Letter next to chromosome length denote type of a chromosome:
A - autosome, S - sex chromosome, M - mitochondria.


To use CNVpytor with new reference genome us -conf option in each cnvpytor command, e.g.
```
Expand All @@ -71,7 +73,7 @@ cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9
```

To avoid typing "-conf REL_PATH/example_ref_genome_conf.py" each time you run cnvpytor,
you can create an alias. However, we would like to encourage you to send us configuration,
you can create an bash alias. However, we would like to encourage you to send us configuration,
gc and mask file and we would be glad to include it into the CNVpytor code. Or, even better,
fork the repository on GitHub, add configuration in cnvpytor/genome.py, data files in cnvpytor/data
and create a pull request.
Expand Down

0 comments on commit 54f9fe9

Please sign in to comment.