Skip to content

f_orig_query sums to greater than 1.0 #2300

Closed
@nmb85

Description

Hi, I've run sourmash gather on approx 1,600 different metagenomes with the following:
parallel -j 64 sourmash gather -k 21 {} -o {/.}.gt ~/sourmash_dbs/nih_smgc_k21.sbt.zip ~/sourmash_dbs/gtdb-rs207.genomic-reps.dna.k21.zip ~/sourmash_dbs/genbank-2022.03-fungi-k21.zip ::: ../*.sig

My immediate goal is to know the total proportion of the metegenome that is contained in my databases. When I sum up the f_orig_query column in each of +1,000 resultant csv files (for X in *.csv; do echo -ne $X"\t"; awk -F "," 'NR>1{sum=sum+$2} END{print(sum)}' $X; done | sort -nk 2,2), I get a wide distribution of proportions, and >99% are between 0 and 1.0, which is what I'd expect. However, the f_orig_query column in 7 of the csv files sums to more than 1.0, which I would not expect. Do I misunderstand the meaning of f_orig_query column, or is there something else that causes a sum greater than 1.0?

I've searched the issues, read this lovely guide to interpreting gather csvs, and haven't found an answer to my question, but I'm skeptical that it hasn't been asked already, so I apologize for any redundancy.

Here is an example gather csv from one of the metagenomes with a f_orig_query column sum greater than 1.0:
ju473iap108192021_S185_adj.csv

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions