Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with RunRedup hanging on large duplicate cluster sizes #318

Merged
merged 4 commits into from
Jan 26, 2024

Conversation

lvreynoso
Copy link
Contributor

@lvreynoso lvreynoso commented Jan 20, 2024

CZID-8963

  • Fixes an issue where the RunRedup task would hang when processing samples with large amounts of duplicate sequence data. grep has been replaced with awk when processing the duplicate cluster sizes tsv and clusters csv files.
  • Fixes an issue where RunRedup output fasta files were inconsistently named
  • Harmonizes file naming in RunRedup to use underscores (_) instead of dashes (-).

@lvreynoso lvreynoso marked this pull request as ready for review January 22, 2024 20:54
@lvreynoso lvreynoso requested a review from rzlim08 January 22, 2024 20:58
Copy link
Collaborator

@rzlim08 rzlim08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looks good! Just some questions.

How was this reviewed?

  • Read through the code
  • Ran the code and examined output files
  • Stepped through the code

Are there files that were not reviewed?

  • No

Are there tests included

  • Yes
  • No - and why? - Already covered

@@ -224,12 +228,13 @@ task RunRedup {
export SUBSAMPLED_READS_FILES=(~{sep=' ' subsampled_reads})

for index in ${!HOST_FILTERED_READS_FILES[@]}; do
seqkit grep -f duplicated-pairs.txt ${HOST_FILTERED_READS_FILES[$index]} | seqkit fq2fa -o duplicate_reads_$index.fasta
cat ${SUBSAMPLED_READS_FILES[$index]} duplicate_reads_$index.fasta > redups_$index.fasta
part=$(($index + 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change here just to ensure that the files are 1-indexed?


python3 <<CODE
pair_values = []
passed_filters = set()
duplicates = set()
with open("passed_filters.txt", "r") as pf:
passed_filters.update(pf.read().splitlines())
with open("duplicated-reads.txt", "r") as dr:
with open("duplicated_reads.txt", "r") as dr:
duplicates.update(dr.read().splitlines())
with open("~{clusters}", "r") as clusters:
for line in clusters:
key, value = line.strip().split(",")
if key in duplicates and key in passed_filters and key != value:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I wrote this line, but I'm a bit dubious that all of it is necessary. If a key != value I think that by definition means that it'll be a duplicate. We should merge this in to get the fix out, but possibly consider removing this check entirely.

@@ -185,8 +185,12 @@ task RunRedup {
}
command <<<
set -euxo pipefail
# exit if no duplicate reads
if [[ ! $(grep -v "^1\t" "~{cluster_sizes}") ]]; then
# extract duplicate read ids from the cluster sizes tsv; awk looks at the first column and prints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

@lvreynoso lvreynoso merged commit 034c1d5 into main Jan 26, 2024
14 checks passed
@lvreynoso lvreynoso deleted the lvreynoso/fix_redup_grep_hang branch January 26, 2024 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants