-
Notifications
You must be signed in to change notification settings - Fork 11
Home
This page will explain in detail how to run PanGenie on different datasets.
THIS PAGE IS CURRENTLY UNDER CONSTRUCTION
PanGenie expects a directed and acyclic pangenome graph as input (-v
option).
This graph is represented in terms of a VCF file that needs to have certain properties:
- multi-sample - it needs to contain haplotype information of at least one known sample
- fully-phased - haplotype information of the known panel samples are represented by phased genotypes and each sample must be phased in one single block (i.e. from start to end).
- non-overlapping variants - the VCF represents a pangenome graph. Therefore, overlapping variation must be represented in a single, multi-allelic variant record.
Note especially the third property listed above. See the figure below for an illustration of how overlapping variant alleles need to be represented in the input VCF provided to PanGenie.
We typically generate such VCFs from haplotype-resolved assemblies (see below). However, any VCF with the properties listed above can be used as input. Note again that the haplotypes must be phased into a single phased block. So phased VCFs generated by phasing tools like WhatsHap are not suitable!
Any VCF following the format described in the previous section can be used as input to PanGenie in order to genotype bubbles in the pangenome graph. However, in many cases a bubble in the graph often does not represent a single variant but rather is a combination of many individual variants present in the haplotypes in the corresponding genomic region (see Figure above). In other words, bubbles often contain many nested variant alleles. In order to derive genotypes for variant alleles nested inside of graph bubbles, we typically produce PanGenie input VCFs containing special annotations encoding a decomposition of graph bubbles into the individual variant alleles they are composed of. After genotyping the bubbles with PanGenie, these annotations can be used to translate bubble genotypes to genotypes for these nested alleles. For this purpose, our pipelines producing PanGenie-ready VCFs always produce two VCF files: a multi-allelic graph-VCF representing bubbles in the graph (PanGenie input VCF) and a bi-allelic callset-VCF defining all the individual variant alleles nested inside of the graph bubbles.
In the multi-allelic graph-VCF (top in Figure above), each record represents a bubble in the graph and lists all paths covered by at least one haplotypes as the alternative allele sequences. Each such alternative allele is annotated by a sequence of variant IDs (separated by a colon) in the INFO field, indicating which individual variant alleles it is composed of (since bubbles are usually composed of many individual variant alleles). The bi-allelic callset-VCF (bottom in Figure above) contains one separate record for each such variant ID. Both VCFs describe the same genetic variation, but using different ways of representation. The graph-VCFs are used as input to PanGenie for genotyping. Using the annotations, the resulting bubble genotypes can be translated into genotypes for each individual variant ID using the callset-VCF. This enables properly analyzing variant alleles contained inside of bubbles. How we produce these annotations depends on the data. In the following sections, we provide precomputed VCFs as well as pipelines to be used to generate them.
Note that this decomposition procedure is useful in many cases, but PanGenie can still be run in VCFs not containing these special annotations. It is just an additional downstream analysis step which helps analyzing variation encoded inside of bubbles.
For VCFs following the format described in this section, these commands can be used for genotyping:
# run PanGenie (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf"
PanGenie -i <input-reads> -v <graph-vcf> -r <reference-genome> -o pangenie -j 24 -t 24
# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
The first step is running PanGenie, and the second step uses the annotations in the VCFs to translate bubble genotypes to genotypes for all variant alleles. Thus, the final VCF contains exactly the same records as the callset-VCF, just with genotypes added for all these variants.
We have written a pipeline that calls variants from haplotype-resolved assemblies of human samples and generates a graph-VCF to be used as input to PanGenie. This pipeline is available here: https://bitbucket.org/jana_ebler/vcf-merging/src/master/pangenome-graph-from-assemblies/. The pipeline produces two ouput VCFs. A mulit-allelic graph-VCF and a bi-allelic callset-VCF formatted as described in section Genotyping variation nested inside of bubbles above.
For the HPRC Minigraph-Cactus graph published in https://doi.org/10.1038/s41586-023-05896-x, we have generated PanGenie-ready VCFs containing haplotype data from 44 human samples (88 haplotypes). VCFs were generated based on GRCh38 and CHM13. They are available at:
Dataset | PanGenie input VCF | Callset VCF |
---|---|---|
HPRC-GRCh38 (88 haplotypes) | graph-VCF | callset-VCF |
HPRC-CHM13 (88 haplotypes) | graph-VCF | callset-VCF |
For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in section "Decomposing bubbles".
You can also generate your own PanGenie-ready VCFs from a Minigraph-Cactus graph. What you need in order to do so, is the raw VCFs produced using vg decompose
from the graph, as well as the GFA file of the graph itself. For the HPRC MC-graph, these VCFs are available from /~https://github.com/human-pangenomics/hpp_pangenome_resources/tree/main ("Raw VCF" in section "Minigraph-Cactus").
The pipeline provided here: /~https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC can then be used in order to produce a graph-VCF as well as the corresponding callset-VCF in the same format as explained in section "Using PanGenie-ready VCFs produced by HPRC" above.
For the HGSVC data published in https://www.science.org/doi/10.1126/science.abf7117, we have generated PanGenie-ready VCFs containing haplotype data from 32 human samples (64 haplotypes). VCFs were generated based on GRCh38.
Dataset | PanGenie input VCF | Callset VCF |
---|---|---|
HGSVC-GRCh38 (freeze3, 64 haplotypes) | graph-VCF | callset-VCF |
HGSVC-GRCh38 (freeze4, 64 haplotypes) | graph-VCF | callset-VCF |
For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in section "Decomposing bubbles".