-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with RunRedup hanging on large duplicate cluster sizes #318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Looks good! Just some questions.
How was this reviewed?
- Read through the code
- Ran the code and examined output files
- Stepped through the code
Are there files that were not reviewed?
- No
Are there tests included
- Yes
- No - and why? - Already covered
@@ -224,12 +228,13 @@ task RunRedup { | |||
export SUBSAMPLED_READS_FILES=(~{sep=' ' subsampled_reads}) | |||
|
|||
for index in ${!HOST_FILTERED_READS_FILES[@]}; do | |||
seqkit grep -f duplicated-pairs.txt ${HOST_FILTERED_READS_FILES[$index]} | seqkit fq2fa -o duplicate_reads_$index.fasta | |||
cat ${SUBSAMPLED_READS_FILES[$index]} duplicate_reads_$index.fasta > redups_$index.fasta | |||
part=$(($index + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change here just to ensure that the files are 1-indexed?
|
||
python3 <<CODE | ||
pair_values = [] | ||
passed_filters = set() | ||
duplicates = set() | ||
with open("passed_filters.txt", "r") as pf: | ||
passed_filters.update(pf.read().splitlines()) | ||
with open("duplicated-reads.txt", "r") as dr: | ||
with open("duplicated_reads.txt", "r") as dr: | ||
duplicates.update(dr.read().splitlines()) | ||
with open("~{clusters}", "r") as clusters: | ||
for line in clusters: | ||
key, value = line.strip().split(",") | ||
if key in duplicates and key in passed_filters and key != value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I wrote this line, but I'm a bit dubious that all of it is necessary. If a key != value
I think that by definition means that it'll be a duplicate. We should merge this in to get the fix out, but possibly consider removing this check entirely.
@@ -185,8 +185,12 @@ task RunRedup { | |||
} | |||
command <<< | |||
set -euxo pipefail | |||
# exit if no duplicate reads | |||
if [[ ! $(grep -v "^1\t" "~{cluster_sizes}") ]]; then | |||
# extract duplicate read ids from the cluster sizes tsv; awk looks at the first column and prints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
CZID-8963
grep
has been replaced withawk
when processing the duplicate cluster sizes tsv and clusters csv files._
) instead of dashes (-
).