diff --git a/CHANGELOG.md b/CHANGELOG.md index fb65191..ae2428e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,11 @@ +# v0.13.1 +## Changes +- For HLA genes, StarPhase would previously ignore any HLA allele definitions that were missing a DNA sequence in the database. StarPhase now allows these partial HLA allele definitions by default. +- A new option was added to enable the previous behavior: `--hla-require-dna`. If this option is enabled, any HLA allele definition that is missing a DNA sequence will be ignored and never reported in StarPhase outputs. + +## Fixed +- Fixed an issue where a _CYP2D6_ deletion allele (\*5) could be reported on the same haplotype as another allele. While this is biologically possible (e.g., deletion of one \*10 in a "\*10x2" haplotype), it is not considered a valid star-allele at this time. This combination will still show up in the debug log files, but it will get filtered in final reporting. For example: a "\*10+\*5" haplotype will now get reported as "\*10". + # v0.13.0 ## Changes - The algorithm for _HLA-A_ and _HLA-B_ has been modified to use a consensus-based approach to solve the alleles, a simpler version of the algorithm for _CYP2D6_. diff --git a/docs/methods.md b/docs/methods.md index e065773..aea628d 100644 --- a/docs/methods.md +++ b/docs/methods.md @@ -16,7 +16,7 @@ As a result, pb-StarPhase uses a different, consensus-based approach to diplotyp 1. **Load the HLA database** - This step loads the JSON database that was created from querying the IMGTHLA GitHub repo. Each haplotype sequence is loaded into memory for the following steps. 2. **Load fully spanning reads** - This step will parse the alignment file (BAM) and extract all reads that **fully span** the gene region of interest (including a small buffer around the defined region). We note that the haplotypes stored in IMGTHLA do not have the same mapping coordinates, so we have defined the regions of interest in the database files. Any read with an edit distance that is too high relative to the reference genome will be ignored (this removes mismapped reads). 3. **Run a diplotype consensus on the read collection** - This step will first extract all the cDNA sequences from each read and attempt to find two unique consensuses. It checks the number of reads assigned to each consensus to determine if it passes the MAF and CDF filters. If not, the cDNA is likely identical, so it repeats this step using the DNA sequences instead to see if there is a sub-type difference. After this step, there is either one consensus (homozygous at cDNA and DNA levels) or two consensuses (heterozygous at cDNA and/or DNA levels). -3. **Score all HLA haplotypes in the database against each consensus** - For each consensus, we generate a full-length DNA sequence and accompanying spliced cDNA sequence. We then score each defined HLA haplotype to each consensus by aligning and saving the edit distance of the mappings, prioritizing cDNA and then DNA. By default, any HLA haplotypes that are missing a DNA sequence are ignored. +3. **Score all HLA haplotypes in the database against each consensus** - For each consensus, we generate a full-length DNA sequence and accompanying spliced cDNA sequence. We then score each defined HLA haplotype to each consensus by aligning and saving the edit distance of the mappings, prioritizing cDNA and then DNA. By default, any HLA haplotypes with at least cDNA sequence are allowed. The `--hla-require-dna` option can be used to also require DNA sequence in the HLA definition. 4. **Call best diplotype** - For each consensus, identify the best matching haplotype as defined above. This is the haplotype with the lowest edit fraction (`edit_distance + unmapped / total_bp`). This pair of best matching haplotypes is the final diplotype result for the HLA gene. 5. **Report findings** - In contrast to the CPIC genes, only a single diplotype result is ever reported for the HLA genes.