Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement the stitcher #1030

Closed
38 of 41 tasks
Donaim opened this issue Oct 31, 2023 · 6 comments · Fixed by #1032
Closed
38 of 41 tasks

Reimplement the stitcher #1030

Donaim opened this issue Oct 31, 2023 · 6 comments · Fixed by #1032
Assignees
Milestone

Comments

@Donaim
Copy link
Member

Donaim commented Oct 31, 2023

The existing implementation of stitching has shown to produce nonsensical results in certain cases. The results from stitching should be a logical summation of its parts, but currently, they sometimes are not. The root cause appears to be the reliance on regions of the reference genome, rather than contigs produced by the assembler. And in cases when some regions have low concordance with the reference genome, they are aligned differently, producing conflicting versions of overlaps between them.

Objectives:

  1. In scenarios where a single contig has been assembled, stitching should return it as the stitched consensus.
  2. When multiple contigs are present, the result of putting them together should not be too surprising.
  3. Other parts of the pipeline should not be significantly affected by this change.

Tasks:

  • Review the existing stitching code and understand the alleged issues.
  • Plan the strategy for the new implementation.
    This includes determining how to handle single and multiple contig scenarios.
  • Implement the stitching algorithm
    • Develop a CIGAR tools library to handle CIGAR strings and coordinate mappings.
      • Create functions for extending, converting, and translating coordinates between query and reference sequences.
      • Implement a CigarHit data class to manage slicing operations, overlaps, and MSA conversions.
      • Add and refine methods for handling insertions, deletions, cuts, and strip operations in the CIGAR context.
      • Resolve any floating point errors and off-by-one issues within the Cigar tools.
      • Ensure that the Cigar tools maintain the associative property during operation combinations.
      • Make the CigarHit class immutable.
      • Remove unused methods to simplify the API.
      • Improve docstrings to aid in the understanding of CIGAR operations and library usage.
    • Implement a proof of concept algorithm that works for simple cases.
    • Streamline context usage in the stitcher.
    • Prevent name collisions for names of temporary contigs.
    • Create a staging area (e.g., main() entry point) to integrate the contig stitcher and test it in an isolated environment before full pipeline integration.
    • Treat cross-alignments as anomalies.
  • Implement tools for diagnostics.
    • Develop logic to handle unaligned contig parts to ensure complete contig representation.
    • Refactor contig size calculations to simplify the visualizer interface.
    • Implement visual differentiation in the visualizer for distinct contig sections.
    • Improve visual positioning for smaller images and negative coordinate handling in the visualizer.
    • Colour-code the reference track in the visualizer based on coverage statistics.
    • Debug and correct visual representation for cross-alignments.
    • Fix numbering inconsistencies in the visualizer.
    • Optimize visualizer output for representation of discarded contigs.
    • Resolve the problem where certain contigs are incorrectly discarded when they should be visualized.
    • Investigate cases of duplicated visual representation of discarded unaligned contig parts.
    • Fix the visualizer's handling of unaligned contig parts as violated in test_correct_processing_of_two_overlapping_and_one_separate_contig_2.svg.
    • Ensure that the strand parameter is checked every time an arrow is drawn in the visualizer.
    • Fix HCV landmarks not displaying properly.
    • Refactor the visualizer to make it at least somewhat readable. #1086
  • Integrate the new stitcher into the pipeline.
    • Basic integration into existing denovo path.
    • Make the old contigs.csv file still produce the same output as before the stitcher by introducing a new output file contigs_stitched.csv to be used in downstream analyses.
    • Produce additional original versions of unstitched files, named contigs_unstitched.csv and remap_unstitched_conseq.csv
    • Update proviral pipeline to use the original unstitched files.
  • Thoroughly test the new implementation.
    • Add basic scenario-based tests.
    • Add basic property-based tests.
    • Add basic visualizer tests.
    • Validate the handling of cases with reverse complement alignments.
    • Develop and implement targeted tests for visualizer specifically.
    • Ensure 100% coverage.
  • Add documentation describing the new stitcher. #1089
  • Remove old stitching code. #1087

Notes:

This reimplementation may provide opportunities for simplification in the regions alignment code, which is currently very complex.

@Donaim
Copy link
Member Author

Donaim commented Nov 15, 2023

Current version of the algorithm (ea58060) does work for simple cases.

@Donaim
Copy link
Member Author

Donaim commented Nov 22, 2023

Current version of the algorithm (7a153c0) works on real-world examples, and produces expected results. Now it is about finding and fixing individual bugs.

@Donaim
Copy link
Member Author

Donaim commented Dec 8, 2023

Currently working on the diagnostics. It turned out to be useful for finding bugs... Reordering goals.

@Donaim
Copy link
Member Author

Donaim commented Jan 17, 2024

Part of the diagnostics is the visualizer (diagram maker) that is based on logs. It turned out to be almost as difficult to implement as the stitcher itself.

The first version is implemented in 7e84f61

@Donaim
Copy link
Member Author

Donaim commented Jan 31, 2024

The task list in the issue description has been updated to better reflect the conceptual progress and milestones we've achieved as documented in our commits.
These updates stem directly from our commit history and practical work on the stitcher and its diagnostics.

@Donaim Donaim added this to the 7.17 milestone May 8, 2024
@Donaim
Copy link
Member Author

Donaim commented Sep 11, 2024

The introduction of new stitcher changes contents and handling of some input/output files.

Below is a breakdown:

Same Content

The following table lists files that have identical contents in both the old and new versions. The dash symbol (-) indicates that the contents of the old file may differ from the new file, although they might coincide occasionally.

Old File New File
g2p_csv g2p_csv
g2p_summary_csv g2p_summary_csv
remap_counts_csv -
remap_conseq_csv -
unmapped1_fastq unmapped1_fastq
unmapped2_fastq unmapped2_fastq
conseq_ins_csv conseq_ins_csv
failed_csv failed_csv
cascade_csv -
nuc_csv -
amino_csv -
insertions_csv -
conseq_csv unstitched_conseq_csv
conseq_all_csv -
concordance_csv -
concordance_seed_csv -
failed_align_csv -
coverage_scores_csv -
coverage_maps_tar -
aligned_csv -
g2p_aligned_csv g2p_aligned_csv
genome_coverage_csv -
genome_coverage_svg -
genome_concordance_svg -
contigs_csv unstitched_contigs_csv
read_entropy_csv read_entropy_csv
conseq_region_csv -
conseq_stitched_csv -

Same Role

The following table lists files that serve the same purpose in the pipeline across most use cases and within the proviral pipeline specifically:

Old File Most Usecases Proviral Pipeline
g2p_csv g2p_csv
g2p_summary_csv g2p_summary_csv
remap_counts_csv remap_counts_csv
remap_conseq_csv remap_conseq_csv
unmapped1_fastq unmapped1_fastq
unmapped2_fastq unmapped2_fastq
conseq_ins_csv conseq_ins_csv
failed_csv failed_csv
cascade_csv cascade_csv cascade_csv
nuc_csv nuc_csv
amino_csv amino_csv
insertions_csv insertions_csv
conseq_csv conseq_csv unstitched_conseq_csv
conseq_all_csv conseq_all_csv
concordance_csv concordance_csv
concordance_seed_csv concordance_seed_csv
failed_align_csv failed_align_csv
coverage_scores_csv coverage_scores_csv
coverage_maps_tar coverage_maps_tar
aligned_csv aligned_csv
g2p_aligned_csv g2p_aligned_csv
genome_coverage_csv genome_coverage_csv
genome_coverage_svg genome_coverage_svg
genome_concordance_svg genome_concordance_svg
contigs_csv contigs_csv unstitched_contigs_csv
read_entropy_csv read_entropy_csv
conseq_region_csv conseq_region_csv
conseq_stitched_csv conseq_csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant