diff --git a/GettingStarted.md b/GettingStarted.md index 9900aa7..99e26ac 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -28,8 +28,8 @@ in _pytor_ file. CNVpytor will detect reference genome and use internal database for GC content and 1000 genome strict mask. -This works for hg19 and hg38 genomes. For other species or reference genomes you have to -[specify reference genome](examples/AddReferenceGenome.md). +After instalation this works for hg19 and hg38 genomes. For other species or reference genomes you have to +[describe reference genome](examples/AddReferenceGenome.md). To check is reference genome detected use: @@ -46,6 +46,7 @@ Using reference genome: hg19 [ GC: yes, mask: yes ] + First hose bin size. It has to be divisible by 100. Here we will use 10 kbp and 100 kbp bins. To calculate binned, GC corrected RD signal type: diff --git a/cnvpytor/genome.py b/cnvpytor/genome.py index b16ac70..d59fcfc 100644 --- a/cnvpytor/genome.py +++ b/cnvpytor/genome.py @@ -282,8 +282,7 @@ def load_reference_genomes(cls, filename): """ _logger.info("Reading configuration file '%s'." % filename) - import_reference_genomes = {} - exec(open(filename).read()) + exec(open(filename).read(),globals()) for g in import_reference_genomes: _logger.info("Importing reference genome data: '%s'." % g) cls.reference_genomes[g] = import_reference_genomes[g] diff --git a/examples/AddReferenceGenome.md b/examples/AddReferenceGenome.md new file mode 100644 index 0000000..8123ff1 --- /dev/null +++ b/examples/AddReferenceGenome.md @@ -0,0 +1,79 @@ +# Configuring reference genome + +For GC correction and 1000 genome strict mask filtering CNVpytor uses information +related to the reference genome. With installation two reference genomes are +available: hg19 (GRCh37) and hg28 (GRCh38). + +If you want to use other reference genome for human or other species first we have +to create GC and mask file (optional). + +In this example we will configure mouse reference genome MGSCv37. + +To create GC file we need sequence of the reference genome in fasta.gz file: + +``` +> cnvpytor -root MGSCv37_gc_file.pytor -gc ~/hg19/mouse.fasta.gz -make_gc_file +``` + +This command will produce _MGSCv37_gc_file.pytor_ file that contains information about +GC content in 100-base-pair bins. + +For reference genomes where we have strict mask in the same format as 100 Genomes Project +[strict mask](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/), +we can create mask file using command: + +``` +> cnvpytor -root MGSCv37_mask_file.pytor -mask ~/hg19/mouse.strict_mask.whole_genome.fasta.gz -make_mask_file +``` + +If we do not have mask file, we can skip this step. Mask file contains information about +regions of the genome that are more accessible to next generation sequencing methods +using short reads. CNVpytor uses P marked positions to filter SNP-s and read depth signal. +If reference genome configuration does not contain mask file, CNVpytor will still be fully functional, +apart from the filtering step. +You may also generate your own mask file by creating fasta file that contains character "P" if corresponding +base pair passes the filter and any character different than "P" if not. + +Now, we will create example_ref_genome_conf.py file containing following: + +``` +import_reference_genomes = { + "mm9": { + "name": "MGSCv37", + "species": "Mus musculus", + "chromosomes": OrderedDict( + [("chr1", (197195432, "A")), ("chr2", (181748087, "A")), ("chr3", (159599783, "A")), + ("chr4", (155630120, "A")), ("chr5", (152537259, "A")), ("chr6", (149517037, "A")), + ("chr7", (152524553, "A")), ("chr8", (131738871, "A")), ("chr9", (124076172, "A")), + ("chr10", (129993255, "A")), ("chr11", (121843856, "A")), ("chr12", (121257530, "A")), + ("chr13", (120284312, "A")), ("chr14", (125194864, "A")), ("chr15", (103494974, "A")), + ("chr16", (98319150, "A")), ("chr17", (95272651, "A")), ("chr18", (90772031, "A")), + ("chr19", (61342430, "A")), ("chrX", (166650296, "S")), ("chrY", (15902555, "S")), + ("chrM", (16299, "M"))]), + "gc_file":"/..PATH../MGSCv37_gc_file.pytor", + "mask_file": "/..PATH../MGSCv37_mask_file.pytor" + } +} +``` + +Last line can be skipped, if there is no mask file. + +To use CNVpytor with new reference genome us -conf option in each cnvpytor command, e.g. +``` +cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rd file.bam +``` + +CNVpytor will use chromosome lengths from alignment file to detect reference genome. +However, if you configured reference genome after you had already run -rd step you +could assign reference genome using -rg: +``` +cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9 +``` + +To avoid typing "-conf REL_PATH/example_ref_genome_conf.py" each time you run cnvpytor, +you can create an alias. However, we would like to encourage you to send us configuration, +gc and mask file and we would be glad to include it into the CNVpytor code. Or, even better, +fork the repository on GitHub, add configuration in cnvpytor/genome.py, data files in cnvpytor/data +and create a pull request. + +