Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement stitching algorithm #1032

Merged
merged 205 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from 204 commits
Commits
Show all changes
205 commits
Select commit Hold shift + click to select a range
cde06c8
Complete CIGAR string definition
Donaim Nov 1, 2023
062a800
Add module for CIGAR strings handling
Donaim Nov 6, 2023
647ad39
Add initial unit tests for CIGAR module
Donaim Nov 6, 2023
0164ba6
Add initial implementation of the new contig stitcher
Donaim Nov 6, 2023
d3c4aa7
Add tests for the new contig stitcher
Donaim Nov 6, 2023
3be6566
Improve concordance calculation by scanning in both directions
Donaim Nov 6, 2023
c6221d6
Contig stitcher: ensure the order of stitched contigs
Donaim Nov 6, 2023
c2cba4a
Remove unused imports in cigar_tools and contig_sticher
Donaim Nov 6, 2023
54266cb
CigarHit: add translate method
Donaim Nov 6, 2023
ea48997
Cigar: add type checking for negative op nums
Donaim Nov 6, 2023
a612a4c
Fix overlap check in contig stitcher
Donaim Nov 6, 2023
bf2e50e
Add MockAligner class to test utils
Donaim Nov 7, 2023
7f7ad3c
Use mocked Aligner in contig stitcher
Donaim Nov 7, 2023
2a18db4
Contig stitcher: fix Frankenstein cut_reference implementation
Donaim Nov 6, 2023
4c29798
Mention that CigarHit.overlaps only applicable in same refs
Donaim Nov 6, 2023
00ee9fe
Improve basic config stitcher tests
Donaim Nov 6, 2023
4cc5b9b
Fix typo in GenotypeContig error message
Donaim Nov 6, 2023
72148b5
Improve contig stitcher class hierarchy
Donaim Nov 7, 2023
0619581
Cigar tools: do not enforce commutativity on CigarHits
Donaim Nov 7, 2023
eba4319
Cigar tools: test associativity of CigarHit addition
Donaim Nov 7, 2023
24a0cbb
Cigar tools: add lstrip and rstrip functions
Donaim Nov 7, 2023
5789ee6
Cigar tools: rename "r_len" and "q_len"
Donaim Nov 8, 2023
9469286
Cigar tools: add default alignment concept
Donaim Nov 8, 2023
615e877
Contig stitcher: fix containment check
Donaim Nov 8, 2023
3c3a95d
Fix off-by-1 errors in MockAligner
Donaim Nov 8, 2023
3b29507
Contig stitcher: keep nonconflicting parts of contigs intact
Donaim Nov 8, 2023
ba788b7
Small improvements to contig stitcher code
Donaim Nov 8, 2023
3772afe
Contig stitcher: improve cutting of Frankenstein contigs
Donaim Nov 8, 2023
e23f775
Contig stitcher: normalize concordance score between 0 and 1
Donaim Nov 8, 2023
9073573
Contig stitcher: split overlap handling and coverage handling
Donaim Nov 8, 2023
eeec6a5
Cigar tools: add getters for coordinate positions
Donaim Nov 8, 2023
be32fd0
Cigar tools: add epsilon quantity for precise cuts
Donaim Nov 9, 2023
b03cbd5
Cigar tools: return coordinates as sets
Donaim Nov 9, 2023
8b8443d
Contig stitcher: make align_to_reference a class method
Donaim Nov 10, 2023
4bee555
Add gaps() method to CigarHit
Donaim Nov 9, 2023
d1f3a9d
Contig stitcher: implement basic gap slicing algorithm
Donaim Nov 10, 2023
c389ee4
Cigar tools: distinguish mapped coordinates and all coordinates
Donaim Nov 11, 2023
3cba9d4
Contig stitcher: skip insignificant gaps
Donaim Nov 11, 2023
ba17fdc
Tests: do not auto-use mock aligner
Donaim Nov 11, 2023
017a93a
Tests: move fixed_random_seed to shared utils.py file
Donaim Nov 11, 2023
8f41cf5
Cigar tools: improve semantics of "closest"
Donaim Nov 11, 2023
6b2bbd6
Tests: add two more edge cases for Cigar tools
Donaim Nov 11, 2023
f00d2c7
Cigar tools: reimplement strip operations
Donaim Nov 11, 2023
a2f2c2c
Contig stitcher: rebase gap splitting on the new strip operation
Donaim Nov 11, 2023
8888bb7
Contig stitcher: implement the more precise gap coverage check
Donaim Nov 11, 2023
6c6786e
Cigar tools: prevent floating point errors in cut_reference
Donaim Nov 11, 2023
3fb21ff
Cigar tools: reimplement CoordinateMapping
Donaim Nov 12, 2023
2ea41b8
Cigar tools: fix some more cut edge cases
Donaim Nov 13, 2023
712ef3b
Cigar tools: fix gaps tests
Donaim Nov 13, 2023
66fcafd
Cigar tools: divide __add__ operation into connect and basic __add__
Donaim Nov 13, 2023
224fbd0
Contig stitcher: reimplement complete coverage check
Donaim Nov 14, 2023
adb7b53
Contig stitcher: throw away parts of query after cuts
Donaim Nov 14, 2023
bee4b8c
Improve documentation of CIGAR tools
Donaim Nov 14, 2023
c460e9f
Small improvements to and cleanup of contig stitcher
Donaim Nov 14, 2023
fb718be
Cigar tools: remove the unused "closest_*" methods
Donaim Nov 14, 2023
751d8f2
Add tests for connect_cigar_hits
Donaim Nov 14, 2023
57805fa
Contig stitcher: add the stitch_consensus function
Donaim Nov 14, 2023
7f3cdc0
Contig stitcher: fix munging of non-touching contigs
Donaim Nov 14, 2023
eb712a3
Add test for checking CigarHit.gaps() lengths
Donaim Nov 14, 2023
aaf2a28
Contig stitcher: fix overlap overcounting
Donaim Nov 14, 2023
98e9240
Add tests for calculate_concordance
Donaim Nov 14, 2023
38f8833
Cigar tools: fix strips of empty queries
Donaim Nov 14, 2023
5ab93bd
Add example tests for CigarHit.lstrip
Donaim Nov 14, 2023
ea58060
Cigar tools: fix strip() logic
Donaim Nov 14, 2023
4d61b41
Contig stitcher: simplify the munge operation
Donaim Nov 15, 2023
87beae3
Remove gaps() method from AlignedContig
Donaim Nov 15, 2023
6d44808
Cigar tools: fix CoordinateMapping equality operator
Donaim Nov 15, 2023
713df5d
Contig stitcher: simplify overlap seq calculation
Donaim Nov 15, 2023
87d2d4d
Contig stitcher: remove unused aligned_seq field
Donaim Nov 15, 2023
2fd20f5
Small improvements to Cigar tools and Contig stitcher
Donaim Nov 15, 2023
c40ec0e
Contig stitcher: fix return types of AlignedContig methods
Donaim Nov 15, 2023
1b0c0ca
Integrate contig stitcher structures into denovo pipeline
Donaim Nov 17, 2023
7ab3666
Contig stitcher: ensure match_fraction value for every contig
Donaim Nov 17, 2023
ce29f04
Cigar tools: fix handling of cross-alignments in connect_cigar_hits
Donaim Nov 17, 2023
877857a
Perform the new stitching in the denovo pipeline
Donaim Nov 18, 2023
e067032
Improvements to contig stitcher tests code
Donaim Nov 20, 2023
7a153c0
Contig stitcher: make mypy-compliant
Donaim Nov 20, 2023
535d03f
Add a simple fuzz-test for contig stitcher
Donaim Nov 24, 2023
a30ffe8
Fix typo in merge_intervals docstring
Donaim Nov 27, 2023
912eff8
Add structured_logger utility module
Donaim Nov 30, 2023
039dfd3
Add logging to contig stitcher
Donaim Nov 30, 2023
cb09d25
Number separately aligned parts in contig stitcher
Donaim Nov 30, 2023
f1d88db
Add more detailed logging for contig stitcher
Donaim Nov 30, 2023
060e13d
Small code style improvements for contig stitcher
Donaim Dec 1, 2023
3720af1
Test logging of contig_stitcher
Donaim Dec 1, 2023
a3d878e
Cigar tools: make CigarHit a dataclass
Donaim Dec 1, 2023
e4f221b
Make Cigar class not a dataclass
Donaim Dec 4, 2023
d031a5e
Tests: pass CigarHits as unparsed strings
Donaim Dec 4, 2023
09171e2
Cigar tools: make CigarHit immutable
Donaim Dec 4, 2023
a08b68f
Contig stitcher: add main() entry point
Donaim Dec 4, 2023
93feed4
Contig stitcher: ignore contigs that align in-reverse
Donaim Dec 5, 2023
bb3fcae
Contig stitcher: fix type error in alignments
Donaim Dec 5, 2023
0194cff
More small improvements to Contig Stitcher logging
Donaim Dec 6, 2023
eaf82f1
Move reverse flag to AlignedContig object in Contig Stitcher
Donaim Dec 6, 2023
520870b
Fix issue with reversed alignments in Contig Stitcher
Donaim Dec 6, 2023
691a8f7
Contig stitcher: log individual munge operations
Donaim Dec 6, 2023
9f05e31
Simplify structured logger utility
Donaim Dec 7, 2023
d4a30b1
Contig stitcher: do not throw away parts of queries
Donaim Dec 11, 2023
21fb8f7
Contig stitcher: only throw out query on strip() operations
Donaim Dec 11, 2023
91ab193
Contig stitcher: handle missed None case
Donaim Jan 12, 2024
522e281
Contig stitcher: output simplication and small fixes
Donaim Jan 12, 2024
7b358e2
Contig stitcher: remove the concept of a FrankensteinContig
Donaim Jan 13, 2024
4c9829a
Contig stitcher: do not munge the final contigs
Donaim Jan 13, 2024
14ea3bb
Contig stitcher: ensure no conflicting mappings in overlap
Donaim Jan 13, 2024
8527428
Contig stitcher: improve concordance handling
Donaim Jan 15, 2024
7e84f61
Implement visualizer for contig stitcher
Donaim Jan 17, 2024
82904a9
Contig stitcher: improve boundaries of cut parts
Donaim Jan 17, 2024
53ad6c1
Contig stitcher: make all structures frozen
Donaim Jan 17, 2024
0a65fb3
Contig stitcher: remove unused code
Donaim Jan 17, 2024
99dd324
Contig stitcher: throw away non-prime-end unaligned parts
Donaim Jan 17, 2024
fc3cd69
Contig stitcher: revert to the simpler version of cut_reference
Donaim Jan 17, 2024
6bb959d
Contig stitcher: do not duplicate query in AlignedContig
Donaim Jan 17, 2024
3a260ce
Contig stitcher: more documentation
Donaim Jan 17, 2024
722e04b
Contig stitcher: simplify cut_query implementation
Donaim Jan 17, 2024
88b2fdc
Contig stitcher: simplify sliding_window implementation
Donaim Jan 17, 2024
e203e9a
Contig stitcher: replace field "reverse":bool by "strand":enum
Donaim Jan 17, 2024
199d44f
Contig stitcher: improve handling of reverse complement alignments
Donaim Jan 18, 2024
c4537ed
Contig stitcher: make sure that mappy coordinates are not reversed
Donaim Jan 18, 2024
f068e2f
Contig stitcher: fix visualisation of non-overlapping contigs
Donaim Jan 19, 2024
a822472
Contig stitcher: strip unaligned parts earlier
Donaim Jan 19, 2024
5710038
Contig stitcher: improve numbering of alignments in the visualizer
Donaim Jan 19, 2024
d0928c0
Contig stitcher: fix landmarks positioning in the visualizer
Donaim Jan 19, 2024
83d6781
Contig stitcher: plot contigs even if reference is not a standard one
Donaim Jan 19, 2024
a540f93
Contig stitcher: produce visualizer plot every time --debug is used
Donaim Jan 19, 2024
90f0177
Contig stitcher: mention why contigs are dropped in the logs
Donaim Jan 19, 2024
b475e9f
Contig stitcher: visualize non-final contigs
Donaim Jan 20, 2024
a0ec6e4
Contig stitcher: fix duplicate visualization of bad contigs
Donaim Jan 20, 2024
7a842dd
Contig stitcher: fix handling of sinks in the visualizer
Donaim Jan 20, 2024
7c90109
Contig stitcher: do not assume that reduced_morphism_graph is fan-out=1
Donaim Jan 20, 2024
d775b1e
Contig stitcher: always rename children
Donaim Jan 20, 2024
c352af4
Contig stitcher: make sure every contig is mapped in the visualizer
Donaim Jan 22, 2024
9e28810
Contig stitcher: fix logging level handling
Donaim Jan 22, 2024
5875c39
Contig stitcher: check that --debug is enabled for --plot
Donaim Jan 22, 2024
39a73f1
Contig stitcher: fix type checking errors
Donaim Jan 22, 2024
16c5e12
Contig stitcher: only extend visualizer alignments in non-bad contigs
Donaim Jan 22, 2024
49c624c
Contig stitcher: add visualize every test case
Donaim Jan 22, 2024
135ef59
Contig stitcher: sort bad contigs in the visualizer
Donaim Jan 22, 2024
77d7d7c
Contig stitcher: do no re-draw same contigs
Donaim Jan 22, 2024
8e904ca
Contig stitcher: improve visualizer finals calculation
Donaim Jan 23, 2024
bf1390f
Contig stitcher: remove hanging comma in the code
Donaim Jan 23, 2024
6caf9ee
Contig stitcher: improve concordance calculations
Donaim Jan 23, 2024
68a5b82
Contig stitcher: introduce a proper context for the name generator
Donaim Jan 23, 2024
5d6a98a
Contig stitcher: use context for logs handling
Donaim Jan 24, 2024
834c89b
Contig stitcher: do not require logging=debug for the visualizer
Donaim Jan 24, 2024
ea50a6d
Contig stitcher: add missing type signatures
Donaim Jan 24, 2024
0238f04
Contig stitcher: simplify the concordance algorithm
Donaim Jan 24, 2024
6bd0bfe
Contig stitcher: simplify some visualizer code
Donaim Jan 24, 2024
3419b9b
Contig stitcher: remove all logging.info calls
Donaim Jan 24, 2024
a07611d
Remove structured logger module
Donaim Jan 24, 2024
3f0376a
Contig stitcher: fix midpoint calculation during gap split
Donaim Jan 24, 2024
ce03442
Contig stitcher: change unaligned colour
Donaim Jan 25, 2024
208316c
Cigar tools: rename "gaps()" to "deletions()"
Donaim Jan 25, 2024
e9617cb
Cigar tools: add insertions() method to CigarHit
Donaim Jan 25, 2024
f2072a0
Contig stitcher: fix a visualization of root combinations
Donaim Jan 26, 2024
8b65488
Contig stitcher: improve visualizer positions handling
Donaim Jan 26, 2024
d2a0886
Cigar tools: add *strip_reference methods
Donaim Jan 26, 2024
947b44a
Cigar tools: improve parsing of cigar hits
Donaim Jan 26, 2024
a86530f
Cigar tools: fix edge cases of strip
Donaim Jan 26, 2024
ebe1e9e
Cigar tools: swap names of query and reference strips
Donaim Jan 26, 2024
0b8bac5
Contig stitcher: improve log messages
Donaim Jan 27, 2024
19bddbf
Contig stitcher: base drawing only on the parent-child relationship
Donaim Jan 29, 2024
68aa30c
Contig stitcher: draw unaligned parts in yellow in the visualizer
Donaim Jan 30, 2024
cffb352
Contig stitcher: improve visualization of unaligned parts
Donaim Jan 30, 2024
2443278
Contig stitcher: add few more tests
Donaim Jan 30, 2024
88c900e
Contig stitcher: fix incorrect numbering case in the visualizer
Donaim Jan 30, 2024
2c203d0
Contig stitcher: fix unaligned display in cross alignment case
Donaim Jan 30, 2024
ef78c2d
Contig stitcher: colour the reference track depending on coverage
Donaim Jan 30, 2024
6faaa1e
Contig stitcher: remove unused variables in the visualizer
Donaim Jan 30, 2024
614a0f7
Contig stitcher: handle negative drawing coordinates better
Donaim Jan 30, 2024
e277bdc
Contig stitcher: improve visualizer positioning for small images
Donaim Jan 31, 2024
832e560
Contig stitcher: do not draw arrows above discarded contigs
Donaim Jan 31, 2024
63c1695
Contig stitcher: make sure that names do not repeat
Donaim Jan 31, 2024
da8cd99
Contig stitcher: simplify usage of the context
Donaim Jan 31, 2024
5321972
Contig stitcher: draw dashed lines to separate contig sections
Donaim Jan 31, 2024
fc89f7e
Contig stitcher: simplify contigs size calculations in visualizer
Donaim Jan 31, 2024
bebe947
Contig stitcher: show unaligned parts of discarded contigs
Donaim Jan 31, 2024
cd7d828
Contig stitcher: fix handling of reverse-complement alignments
Donaim Feb 2, 2024
8012f12
Contig stitcher: do not assume that overlaping contigs align to the s…
Donaim Feb 1, 2024
77df8c2
Contig stitcher: fix test input name
Donaim Feb 1, 2024
c6e1dcf
Contig stitcher: optimize transitive closure calculation
Donaim Feb 2, 2024
58aa1f9
Contig stitcher: make sure to draw rc alignments correctly
Donaim Feb 2, 2024
9397b29
Contig stitcher: remove shortcut handling of rc alignments
Donaim Feb 2, 2024
d08c221
Contig stitcher: add more tests with the real aligner
Donaim Feb 2, 2024
4108bdf
Contig stitcher: unify handling of v5s types of bad contigs
Donaim Feb 2, 2024
417ef9a
Contig stitcher: fix the issue with overreaching bad_contigs
Donaim Feb 2, 2024
1e998af
Contig stitcher: make sure to not double-draw any contigs
Donaim Feb 2, 2024
4b4fca4
Contig stitcher: add couple more simple test cases
Donaim Feb 2, 2024
ad74829
Contig stitcher: fix unaligned regions handling in the visualizer
Donaim Feb 5, 2024
2b34d33
Contig stitcher: further improve drawing of unaligned parts
Donaim Feb 5, 2024
fe391f4
Contig stitcher: increase the split gap size threshold
Donaim Feb 16, 2024
c4c3886
Contig stitcher: factor out context into a separate file
Donaim Feb 20, 2024
5ce730e
Contig stitcher: move logging into events module
Donaim Feb 20, 2024
b5bcc6a
Contig stitcher: remove unused imports
Donaim Feb 20, 2024
0f21847
Contig stitcher: factor out contig structures definitions
Donaim Feb 20, 2024
49558fc
Contig stitcher: fix theoretical bug in plot_contigs.py
Donaim Feb 26, 2024
8000704
Cigar tools: remove dead code in the tests
Donaim Feb 26, 2024
f44e405
Contig stitcher: fix all Ruff warnings
Donaim Feb 27, 2024
7aed74f
Contig stitcher: fix PyCharm warnings
Donaim Feb 28, 2024
b90beb3
Add new output file: contigs_stitched.csv
Donaim Feb 29, 2024
a770f1c
Add new output file: remap_unstitched_conseq.csv
Donaim Feb 29, 2024
14d52f7
Rename contigs.csv to contigs_unstitched.csv
Donaim Mar 4, 2024
cb28178
Contig stitcher: fix visualizer bug ignoring some strip actions
Donaim Mar 5, 2024
940b186
Contig stitcher: fix landmarks visualization
Donaim Mar 5, 2024
6138b57
Update proviral pipeline inputs
Donaim Mar 5, 2024
acc110b
Contig stitcher: various code improvements
Donaim Mar 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Singularity
Original file line number Diff line number Diff line change
Expand Up @@ -188,11 +188,12 @@ From: centos:7
%applabels denovo
KIVE_INPUTS sample_info_csv fastq1 fastq2 bad_cycles_csv
KIVE_OUTPUTS g2p_csv g2p_summary_csv remap_counts_csv \
remap_conseq_csv unmapped1_fastq unmapped2_fastq conseq_ins_csv \
remap_conseq_csv remap_unstitched_conseq_csv unmapped1_fastq unmapped2_fastq conseq_ins_csv \
failed_csv cascade_csv nuc_csv amino_csv insertions_csv conseq_csv \
conseq_all_csv concordance_csv concordance_seed_csv failed_align_csv \
coverage_scores_csv coverage_maps_tar aligned_csv g2p_aligned_csv \
genome_coverage_csv genome_coverage_svg genome_concordance_svg contigs_csv \
genome_coverage_csv genome_coverage_svg genome_concordance_svg \
contigs_unstitched_csv contigs_csv \
read_entropy_csv conseq_region_csv conseq_stitched_csv
KIVE_THREADS 2
KIVE_MEMORY 6000
Expand Down
16 changes: 13 additions & 3 deletions docs/steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@ Individual files are described after the list of steps.
* in - fastq1
* in - fastq2
* in - merged_contigs.csv
* contigs.csv - the assembled contigs, plus any merged contigs, including
* contigs_unstitched.csv - the assembled contigs, plus any merged contigs, including
the best blast results
* contigs.csv - stitched version of `contigs_unstitched`
* blast.csv - multiple blast results for each contig
* `remap`: iteratively use consensus from previous mapping as reference to try
and map more reads. See [remap design] for more details. (The denovo version
Expand All @@ -58,6 +59,8 @@ Individual files are described after the list of steps.
each stage.
* remap_conseq.csv - downloaded - consensus sequence that reads were mapped to
on the final iteration
* remap_unstitched_conseq.csv - downloaded - consensus sequence that reads were
mapped to the unstitched contigs.
* unmapped1.fastq - FASTQ format (unstructured text) reads that didn't map to
any of the final references.
* unmapped2.fastq - FASTQ
Expand Down Expand Up @@ -215,11 +218,15 @@ Individual files are described after the list of steps.
* pos - 1-based position in the consensus sequence that this insertion follows
* insert - the nucleotide sequence that was inserted
* qual - the Phred quality scores for the inserted sequence
* contigs.csv
* genotype - the reference name with the best BLAST result
* contigs_unstitched.csv
* ref - the reference name with the best BLAST result
* match - the fraction of the contig that matched in BLAST, negative for
reverse-complemented matches
* group_ref - the reference name chosen to best match all of
the contigs in a sample
* contig - the nucleotide sequence of the assembled contig
* contigs.csv
Same as `contigs_unstitched.csv`, but contigs are stitched by `micall/core/contig_stitcher.py`.
* coverage_scores.csv
* project - the project this score is defined by
* region - the region being displayed
Expand Down Expand Up @@ -343,6 +350,9 @@ Individual files are described after the list of steps.
* remap_conseq.csv
* region - the region mapped to
* sequence - the consensus sequence used
* remap_unstitched_conseq.csv
* region - the region mapped to
* sequence - the consensus sequence used
* resistance.csv
* region - the region code, like PR or RT
* drug_class - the drug class code from the HIVdb rules, like NRTI
Expand Down
Loading
Loading