Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider outputting extra info in pangenome databases #14

Open
ctb opened this issue May 30, 2024 · 0 comments
Open

consider outputting extra info in pangenome databases #14

ctb opened this issue May 30, 2024 · 0 comments

Comments

@ctb
Copy link
Member

ctb commented May 30, 2024

we have two problems with our current pangenomics databases -

first, they present as "regular" sourmash sketches, with abundances. This could lead to misuse/mistakes.

second, it is annoying to track extra information (e.g. lineage counts as in #13) in a separate file.

there is an analogous issue over in sourmash, sourmash-bio/sourmash#2216, that talks about including taxonomy files in zip databases: the idea is that we can provide various standard lineage files in the actual .zip file databases, and then switch between them using CLI options (--gtdb and --ncbi, etc.)

so one idea here would be to produce the pangenome zip file full of sketches, and then add an extra file or two that indicate it's a pangenome database. This wouldn't necessarily prevent misuse (item 1 above) unless we adopted more metadata-in-zip-files in sourmash generally, but would help a great deal with carting around extra files (item 2). and the extra files would help with debugging, potentially.

it is kinda interesting to think about how to add more metadata in generally; this is the closest thing we have over in sourmash-land: sourmash-bio/sourmash#2180

Related issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant