You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When analysing the differences between the previous release and the newest release, we found a couple of samples where bad alignments led to missing parts at the start or end of the region. Upon further investigation, we noticed that the alignment to the region looked really poor in general, so we want to evaluate how well a sampled region matches its coordinate reference and its best blast match. To do this, we will create a new output that shows the concordance between the sampled region and the coordinate reference as well as the best blast match, respectively.
To do:
For the coordinate reference concordance, step through the entries for nuc.csv and count the matches of the MAX consensus to the coordinate reference. Ignore all indels.
For the seed concordance, try using the alignment of the consensus to the coordinate regions to figure out query positions for each region, and use these to step through the consensus-seed alignment.
For the seed concordance, align each contig / consensus (for remapped) to the best blast match. Then, align the blast match to the coordinate reference. Use these mappings to identify the chunk of the blast match and the contig that correspond to the region coordinates, respectively. Iterate through all the alignment matches within the region, ignoring indels, and count how many nucleotides within the matches are concordant.
If the above method works well, the region coordinates within the individual seeds can be pre-computed instead of calculating them from scratch each time.
The text was updated successfully, but these errors were encountered:
Instead of aligning the best blast match to the coordinate reference, I first tried using the information we already have about the region alignments. Using the coordinate region alignments, we can find the query start and end positions that correspond to each region, and we can use those to step through the query-seed alignment and count the concordance. However, we run into trouble if a part of the region did not align to the query. Instead, I'll now go with our original idea of aligning the seed to the coordinate reference region and thus figuring out the seed start and end positions for each region. This can be done either in nucleotide space or in amino acid space (potentially leading to a better alignment, but we have to be careful about the skipped position, and align all three possible reading frames for the seed).
When analysing the differences between the previous release and the newest release, we found a couple of samples where bad alignments led to missing parts at the start or end of the region. Upon further investigation, we noticed that the alignment to the region looked really poor in general, so we want to evaluate how well a sampled region matches its coordinate reference and its best blast match. To do this, we will create a new output that shows the concordance between the sampled region and the coordinate reference as well as the best blast match, respectively.
To do:
The text was updated successfully, but these errors were encountered: