-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathreadme.txt
79 lines (65 loc) · 3.9 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
##############
CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping
CNV-PG an open-source application written in Python, including two parts: CNV predicting (CNV-P) and CNV genotyping (CNV-G). For CNV-P, we trained on a subset of validated CNVs from different CNV callers separately to obtain the corresponding classifier used for the identification of true CNVs. For CNV-G, a genotyper, which is compatible with existing CNV callers and generating a uniform set of high-confidence genotypes.
#############
Prerequisites:
python3(https://www.python.org/)
sklearn(https://pypi.org/project/sklearn/)
matplotlib(https://pypi.org/project/matplotlib/)
pysam(https://pypi.org/project/pysam/)
pandas(https://pypi.org/project/pandas/)
numpy(https://pypi.org/project/numpy/)
Getting started
1. CNV-P
Running:
Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh -h" to see the usage information.The follow options are required:
-i BAMFILE, the path of BAM file(generated by bwa commonly)
-b BASFILE, the path of BAS file(provided by user)
-v VCFFILE, the path of VCF file(the results of CNVcallers[breakdancer,Delly,Lumpy,Manta or Pindel])
-p PYTHON, the path of python
-o OUTDIR, the results outdir
-n SAMPLENAME, the prefix of outputfile
-c CODE_PATH, the path of CNV-P code ($HOME/CNV-PG/CNV-P/)
-s CNVCALLER, the name of CNVcaller [breakdancer,Delly,Lumpy,Manta or Pindel]
In the above command, "BASFILE" needs to be created extra, the format show as follow:
for example:
bam_filename md5 study sample platform library readgroup #_total_bases #_mapped_bases #_total_reads #_mapped_reads #_mapped_reads_paired_in_sequencing #_mapped_reads_properly_paired %_of_mismatched_bases average_quality_of_mapped_bases mean_insert_size insert_size_sd median_insert_size insert_size_median_absolute_deviation #_duplicate_reads coverage
HG002 - HG002 HG002 ILLUMINA HG002 HG002 - - - - - -- - 569 95 568.177944 163.819637 - 35.41
the Columns of "sample","mean_insert_size","insert_size_sd" and "coverage" are required in the step of feature extraction.
Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh" to classify candidate CNV:
$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
-i $HOME/CNV-PG/test_data/BAMFILE \
-b $HOME/CNV-PG/test_data/BASFILE \
-e $HOME/CNV-PG/test_data/VCFFILE \
-p $HOME/python/bin/python3 \
-o $HOME/OUTDIR \
-n SAMPLENAME \
-c $HOME/CNV-PG/CNV-P \
-s CNVCALLER
Outputs:
CNVCALLER.SAMPLENAME.fil.mer.bed # the results of candidate CNV Extract from VCF file
CNVCALLER.SAMPLENAME.feature.txt # the features matrix
CNVCALLER.SAMPLENAME.pre.prop.txt # the results of predicting which provide category and probability for each CNV
2.CNV-G
Runing:
Similar to the CNV-P, Run "$HOME/CNV-PG/CNV-P/CNV-G_predict.sh -h" to see the usage information.
$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
-i $HOME/CNV-PG/test_data/BAMFILE \
-b $HOME/CNV-PG/test_data/BASFILE \
-e $HOME/CNV-PG/test_data/BEDFILE \
-p $HOME/python/bin/python3 \
-o $HOME/OUTDIR \
-n SAMPLENAME \
-c $HOME/CNV-PG/CNV-P \
The "BEDFILE" shuld be 5 Columns: chromsome,start,end,size of CNV,type of CNV(DUP:1,DEL:0); this also can be generate by CNV-P(such as CNVCALLER.SAMPLENAME.fil.mer.bed)
for example:
chr1 10482480 10483779 1300 0
chr1 16151940 16155439 3500 1
chr1 35101421 35111976 10556 0
chr1 39998214 40001244 3031 1
chr1 58743909 58744822 914 0
chr1 60048636 60049661 1026 0
Outputs:
SAMPLENAME.feature.txt #the features matrix
SAMPLENAME.pre.prop.txt #the results of genotype and probability for each CNV
Please help us improve CNV-PG by reporting bugs or ideas on how to make things better.