Fix issue with RunRedup hanging on large duplicate cluster sizes #318

lvreynoso · 2024-01-20T01:07:27Z

CZID-8963

Fixes an issue where the RunRedup task would hang when processing samples with large amounts of duplicate sequence data. grep has been replaced with awk when processing the duplicate cluster sizes tsv and clusters csv files.
Fixes an issue where RunRedup output fasta files were inconsistently named
Harmonizes file naming in RunRedup to use underscores (_) instead of dashes (-).

…rep_hang

rzlim08

Looks good! Just some questions.

How was this reviewed?

Read through the code
Ran the code and examined output files
Stepped through the code

Are there files that were not reviewed?

No

Are there tests included

Yes
No - and why? - Already covered

rzlim08 · 2024-01-26T00:07:19Z

workflows/amr/run.wdl

@@ -224,12 +228,13 @@ task RunRedup {
        export SUBSAMPLED_READS_FILES=(~{sep=' ' subsampled_reads})

        for index in ${!HOST_FILTERED_READS_FILES[@]}; do
-            seqkit grep -f duplicated-pairs.txt ${HOST_FILTERED_READS_FILES[$index]} | seqkit fq2fa -o duplicate_reads_$index.fasta
-            cat ${SUBSAMPLED_READS_FILES[$index]} duplicate_reads_$index.fasta > redups_$index.fasta
+            part=$(($index + 1))


Is the change here just to ensure that the files are 1-indexed?

rzlim08 · 2024-01-26T00:11:34Z

workflows/amr/run.wdl


        python3 <<CODE
        pair_values = []
        passed_filters = set()
        duplicates = set()
        with open("passed_filters.txt", "r") as pf:
            passed_filters.update(pf.read().splitlines())
-        with open("duplicated-reads.txt", "r") as dr:
+        with open("duplicated_reads.txt", "r") as dr:
            duplicates.update(dr.read().splitlines())
        with open("~{clusters}", "r") as clusters:
            for line in clusters:
                key, value = line.strip().split(",")
                if key in duplicates and key in passed_filters and key != value:


I know I wrote this line, but I'm a bit dubious that all of it is necessary. If a key != value I think that by definition means that it'll be a duplicate. We should merge this in to get the fix out, but possibly consider removing this check entirely.

rzlim08 · 2024-01-26T00:13:19Z

workflows/amr/run.wdl

@@ -185,8 +185,12 @@ task RunRedup {
    }
    command <<<
        set -euxo pipefail
-        # exit if no duplicate reads
-        if [[ ! $(grep -v "^1\t" "~{cluster_sizes}") ]]; then
+        # extract duplicate read ids from the cluster sizes tsv; awk looks at the first column and prints


Thanks for the comments!

lvreynoso added 2 commits January 19, 2024 17:06

Replace grep file manipulation with awk commands

0e68fc4

Add prelease tag

c108215

lvreynoso marked this pull request as ready for review January 22, 2024 20:54

Update README

1f9d23e

lvreynoso requested a review from rzlim08 January 22, 2024 20:58

Merge remote-tracking branch 'origin/main' into lvreynoso/fix_redup_g…

52dd8e7

…rep_hang

rzlim08 approved these changes Jan 26, 2024

View reviewed changes

lvreynoso merged commit 034c1d5 into main Jan 26, 2024
14 checks passed

lvreynoso deleted the lvreynoso/fix_redup_grep_hang branch January 26, 2024 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with RunRedup hanging on large duplicate cluster sizes #318

Fix issue with RunRedup hanging on large duplicate cluster sizes #318

lvreynoso commented Jan 20, 2024 •

edited by jira bot

Loading

rzlim08 left a comment

rzlim08 Jan 26, 2024

rzlim08 Jan 26, 2024

rzlim08 Jan 26, 2024

Fix issue with RunRedup hanging on large duplicate cluster sizes #318

Fix issue with RunRedup hanging on large duplicate cluster sizes #318

Conversation

lvreynoso commented Jan 20, 2024 • edited by jira bot Loading

rzlim08 left a comment

Choose a reason for hiding this comment

How was this reviewed?

Are there files that were not reviewed?

Are there tests included

rzlim08 Jan 26, 2024

Choose a reason for hiding this comment

rzlim08 Jan 26, 2024

Choose a reason for hiding this comment

rzlim08 Jan 26, 2024

Choose a reason for hiding this comment

lvreynoso commented Jan 20, 2024 •

edited by jira bot

Loading