readme.txt

##############
CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping
CNV-PG an open-source application written in Python, including two parts: CNV predicting (CNV-P) and CNV genotyping (CNV-G). For CNV-P, we trained on a subset of validated CNVs from different CNV callers separately to obtain the corresponding classifier used for the identification of true CNVs. For CNV-G, a genotyper, which is compatible with existing CNV callers and generating a uniform set of high-confidence genotypes. 

#############
Prerequisites:
python3(https://www.python.org/)
sklearn(https://pypi.org/project/sklearn/)
matplotlib(https://pypi.org/project/matplotlib/)
pysam(https://pypi.org/project/pysam/)
pandas(https://pypi.org/project/pandas/)
numpy(https://pypi.org/project/numpy/)


Getting started
1. CNV-P
Running:
Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh -h" to see the usage information.The follow options are required:
	-i BAMFILE, the path of BAM file(generated by bwa commonly) 
	-b BASFILE, the path of BAS file(provided by user)
	-v VCFFILE, the path of VCF file(the results of CNVcallers[breakdancer,Delly,Lumpy,Manta or Pindel])
	-p PYTHON, the path of python
	-o OUTDIR, the results outdir
	-n SAMPLENAME, the prefix of outputfile
	-c CODE_PATH, the path of CNV-P code ($HOME/CNV-PG/CNV-P/)
	-s CNVCALLER, the name of CNVcaller [breakdancer,Delly,Lumpy,Manta or Pindel]
	
In the above command, "BASFILE" needs to be created extra, the format show as follow:
for example:
bam_filename    md5     study   sample  platform        library readgroup       #_total_bases   #_mapped_bases  #_total_reads   #_mapped_reads  #_mapped_reads_paired_in_sequencing     #_mapped_reads_properly_paired  %_of_mismatched_bases   average_quality_of_mapped_bases mean_insert_size        insert_size_sd median_insert_size       insert_size_median_absolute_deviation   #_duplicate_reads       coverage
HG002   -       HG002   HG002   ILLUMINA        HG002   HG002   -       -       -       -       -      --       -       569     95      568.177944      163.819637      -       35.41

the Columns of "sample","mean_insert_size","insert_size_sd" and "coverage" are required in the step of feature extraction. 

Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh" to  classify candidate CNV:
$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
        -i $HOME/CNV-PG/test_data/BAMFILE \
        -b $HOME/CNV-PG/test_data/BASFILE \
        -e $HOME/CNV-PG/test_data/VCFFILE \
        -p $HOME/python/bin/python3 \
        -o $HOME/OUTDIR \
        -n SAMPLENAME \
        -c $HOME/CNV-PG/CNV-P \
        -s CNVCALLER 

Outputs:
CNVCALLER.SAMPLENAME.fil.mer.bed # the results of candidate CNV Extract from VCF file
CNVCALLER.SAMPLENAME.feature.txt # the features matrix  
CNVCALLER.SAMPLENAME.pre.prop.txt # the results of predicting which provide category and probability for each CNV


2.CNV-G
Runing:
Similar to the CNV-P, Run "$HOME/CNV-PG/CNV-P/CNV-G_predict.sh -h" to see the usage information.

$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
        -i $HOME/CNV-PG/test_data/BAMFILE \
        -b $HOME/CNV-PG/test_data/BASFILE \
        -e $HOME/CNV-PG/test_data/BEDFILE \
        -p $HOME/python/bin/python3 \
        -o $HOME/OUTDIR \
        -n SAMPLENAME \
        -c $HOME/CNV-PG/CNV-P \

The "BEDFILE" shuld be 5 Columns: chromsome,start,end,size of CNV,type of CNV(DUP:1,DEL:0); this also can be generate by CNV-P(such as CNVCALLER.SAMPLENAME.fil.mer.bed)

for example:
chr1    10482480        10483779        1300    0
chr1    16151940        16155439        3500    1
chr1    35101421        35111976        10556   0
chr1    39998214        40001244        3031    1
chr1    58743909        58744822        914     0
chr1    60048636        60049661        1026    0

Outputs:
SAMPLENAME.feature.txt  #the features matrix
SAMPLENAME.pre.prop.txt #the results of genotype and probability for each CNV

Please help us improve CNV-PG by reporting bugs or ideas on how to make things better.