Description
Hi, I've run sourmash gather
on approx 1,600 different metagenomes with the following:
parallel -j 64 sourmash gather -k 21 {} -o {/.}.gt ~/sourmash_dbs/nih_smgc_k21.sbt.zip ~/sourmash_dbs/gtdb-rs207.genomic-reps.dna.k21.zip ~/sourmash_dbs/genbank-2022.03-fungi-k21.zip ::: ../*.sig
My immediate goal is to know the total proportion of the metegenome that is contained in my databases. When I sum up the f_orig_query
column in each of +1,000 resultant csv files (for X in *.csv; do echo -ne $X"\t"; awk -F "," 'NR>1{sum=sum+$2} END{print(sum)}' $X; done | sort -nk 2,2
), I get a wide distribution of proportions, and >99% are between 0 and 1.0, which is what I'd expect. However, the f_orig_query
column in 7 of the csv files sums to more than 1.0, which I would not expect. Do I misunderstand the meaning of f_orig_query
column, or is there something else that causes a sum greater than 1.0?
I've searched the issues, read this lovely guide to interpreting gather csvs, and haven't found an answer to my question, but I'm skeptical that it hasn't been asked already, so I apologize for any redundancy.
Here is an example gather csv from one of the metagenomes with a f_orig_query
column sum greater than 1.0:
ju473iap108192021_S185_adj.csv