Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNPlocs.Hsapiens.dbSNP154.GRCh38 and SNPlocs.Hsapiens.dbSNP154.GRCh37 #3

Open
AhmedArslan opened this issue Nov 3, 2023 · 16 comments

Comments

@AhmedArslan
Copy link

Hello, I would like to request if you could create dbSNP154.GRCh38/dbSNP154.GRCh37 or provide guidance to built dbSNP154. GWAS Catalogue uses dbSNP154 version and this could be helpful for help working on GWAS data.

Many thanks.

@Al-Murphy
Copy link

Hey, I believe one of the main (if not only?) uses of these packages is for MungeSumstats hance why I'm answering this. The creation of supplementary dbSNP release packages is something that has been discussed here.

The TLDR is that it is very RAM intensive and time consuming to create these packages (on the scale of 80 cpus and 384 Gb RAM running for a week for each package) and so isn't really feasible using the current approach. Really we need to refactor the approach is done which isn't something @hpages or me have had time to do.

@hpages
Copy link
Owner

hpages commented Nov 4, 2023

At least dbSNP154 is slightly smaller than dbSNP155 (729,491,867 RS count vs 1,085,850,277) so the requirements won't be so bad.

Let me know if you want to give this a try @AhmedArslan, by following the overview of the process I provided here. I'll be happy to answer questions and provide more detailed guidance if needed.

Best,
H.

@hpages
Copy link
Owner

hpages commented Nov 4, 2023

@Al-Murphy Actually now that I look at the numbers, I see that size of dbSNP156 is 1,130,597,309 RS count which is really not that much bigger than dbSNP155 (only 4% bigger), especially compared to the growth between dbSNP154 and dbSNP155, which was 49%. So maybe I'll give a shot at forging SNPlocs.Hsapiens.dbSNP156.GRCh38 and SNPlocs.Hsapiens.dbSNP156.GRCh37 after all, in the next couple of weeks or so.

@AhmedArslan
Copy link
Author

At least dbSNP154 is slightly smaller than dbSNP155 (729,491,867 RS count vs 1,085,850,277) so the requirements won't be so bad.

Let me know if you want to give this a try @AhmedArslan, by following the overview of the process I provided here. I'll be happy to answer questions and provide more detailed guidance if needed.

Best, H.

@hpages only limitation is that I do not have resources to perform such intensive analysis. Although if dbSNP155 is broadly different from dbSNP154 (as you mentioned) in terms of SNP ids, perhaps its essential to produce dbSNP154?

@hpages
Copy link
Owner

hpages commented Nov 6, 2023

Although if dbSNP155 is broadly different from dbSNP154 (as you mentioned) in terms of SNP ids

Well, all I'm saying is that dbSNP155 has a lot more SNP ids than dbSNP154. That doesn't mean that the SNP ids in the latter are not in the former.

IIUC dbSNP builds are incremental with every new build mostly adding new SNPs to the previous one and making some corrections to the existing ones. So I would expect dbSNP155 to be a superset of dbSNP154 i.e. that most of the SNP ids found in the latter are still in the former. In other words, I would imagine that using dbSNP155 would still cover your use case.

In the unlikely case that the SNPs in dbSNP154 have changed so much in dbSNP155 that the latter cannot be used to annotate the SNPs in the former, then this would suggest that the data in dbSNP154 is outdated, and that the GWAS Catalogue should probably be updated to be based on dbSNP155 in order to remain relevant.

What's the plan anyways for the GWAS Catalogue? How often do they switch to a more recent dbSNP build? dbSNP 154 is more than 3 year old now so maybe it's time.

@hpages
Copy link
Owner

hpages commented Nov 7, 2023

So ealier today I asked the GWAS folks about their plans to map to a more recent dbSNP build and I got the following answer:

Hi Hervé,

Thanks for your interest in the GWAS Catalog. We use dbSNP mappings from Ensembl, which is currently on Build 154. However, we expect that with the next release scheduled for this month, the mapping will be updated to dbSNP 156. See Ensembl’s page here: https://www.ensembl.info/2023/09/13/whats-coming-in-ensembl-111-ensembl-genomes-58/

I understand that build 155 will be skipped.

Best wishes,
Elliot Sollis
GWAS Catalog Curator

> On 6 Nov 2023, at 18:44, Hervé Pagès via gwas-info <gwas-info@ebi.ac.uk> wrote:
>
> Hi,
>
> Are there any plans to update the GWAS catalogue to map it to dbSNP Build 155 or 156 instead of dbSNP Build 154?
>
> Is there a timeline for that?
>
> Thanks,
>
> H.
> -- 
> Hervé Pagès
>
> Bioconductor Core Team
> [hpages.on.github@gmail.com](mailto:hpages.on.github@gmail.com)

One more reason to focus on dbSNP156!

I will start working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] this week.

@Al-Murphy
Copy link

Hi @hpages, has there been any update on work for working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37]? Ideally I would love to add them to MungeSumstats when available!

@hpages
Copy link
Owner

hpages commented Sep 6, 2024

Thanks for the ping.

The bad news is that we had many technical problems with the powerful server that I use for these things. The server is back but it's not the first time that the IT people manage to bring it back. However they've always done it without really addressing the root causes so almost zero progress has been made to improve reliability.

Anyways I'm trying to run again my scripts there but my expectations are low. Fingers crossed.

The other bad news is that it turns out that these huge SNPlocs packages have contributed significantly to our egress costs in a way that is not sustainable for the Bioconductor project. The current format and mode of distribution is inadequate and will need to change. However I don't have the bandwidth at the moment for this kind of refactoring. So if our server doesn't let me down and I manage to actually produce SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] then I'll put the tarballs on an egress-free location for you to manually download.

This will be a temporary situation until I find the time to refactor these packages.

@hpages
Copy link
Owner

hpages commented Sep 9, 2024

Good news is that extract_snvs_from_RefSNP_json_files.sh completed 🎉 This is by far the most resource intensive step in the pipe. What's surprising is that it took "only" 67h to run, which is fast compared to the 100h it took for dbSNP155 on the same server a couple of years ago. I don't want to call this good news though before I understand the reason behind such a big difference. It could actually hide something bad.

Anyways, going to run the next steps: select_GRCh38_snvs.sh + build_GRCh38_OnDiskLongTable.sh and select_GRCh37_snvs.sh + build_GRCh37_OnDiskLongTable.sh. These are expected to take a few hours only...

@hpages
Copy link
Owner

hpages commented Sep 9, 2024

Oops select_GRCh38_snvs.sh fails with dbSNP156 because some rs ids are too big to fit in an integer (e.g. rs2147714790). Switching to use a double vector instead of an integer vector to store the billion or so rs ids. Unfortunately this will make the resulting SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] packages significantly bigger. 😞

@hpages
Copy link
Owner

hpages commented Sep 10, 2024

The dbSNP156 packages are ready to go!

Here are some numbers:

  1. Nb of SNPs (i.e. nb of rs ids):

    • 989,474,892 in SNPlocs.Hsapiens.dbSNP156.GRCh38 (vs 949,021,448 in SNPlocs.Hsapiens.dbSNP155.GRCh38)
    • 967,680,084 in SNPlocs.Hsapiens.dbSNP156.GRCh37 (vs 929,496,192 in SNPlocs.Hsapiens.dbSNP155.GRCh37)

    So not a tremendous increase between dbSNP155 and dbSNP156 (only about 4.2%).

  2. Sizes of the source tarballs:

    • 6.7G for SNPlocs.Hsapiens.dbSNP156.GRCh38 (vs 5.8G for SNPlocs.Hsapiens.dbSNP155.GRCh38)
    • 6.6G for SNPlocs.Hsapiens.dbSNP156.GRCh37 (vs 5.7G for SNPlocs.Hsapiens.dbSNP155.GRCh37)

Note that the dbSNP156 packages require a machine with at least 16G or RAM instead of 10G for the dbSNP155 packages. This increase in memory footprint is due to the fact that the rs ids are now stored in a double vector instead of an integer vector (see my previous comment above for why this change was needed). This change also slows down the loading into memory of the rs ids vector (this loading happens the first time one of the snpsBy*() function is called). It also makes the source tarballs slightly bigger.

But the most crazy number about these new packages is that R CMD build takes more than 3 hours to complete despite the fact that the packages have no vignettes! Luckily that doesn't affect the end user, only me 😓

I'll move the two packages here soon where they'll be available for download. IMPORTANT: They both require BSgenome >= 1.73.1 which will only become available in Bioconductor 3.20 in the next couple of days.

@Al-Murphy
Copy link

Thank you Herve! I will add functionality to MungeSumstats so users can supply these to use dbSNP 156

@hpages
Copy link
Owner

hpages commented Sep 11, 2024

The two packages are finally available at http://149.165.171.124/SNPlocs/

@sroener
Copy link

sroener commented Jan 20, 2025

Thank you @hpages !
Is there a plan to move the two packages to an official Bioconductor repository anytime soon?

@hpages
Copy link
Owner

hpages commented Jan 21, 2025

Not anytime soon, I'm afraid. Our policy for hosting such enormous data packages is being revisited because of excessive egress costs. See my previous comment above. Unfortunately I still don't have the bandwidth at the moment to work on the kind of refactoring that the SNPlocs packages would need.

@sroener
Copy link

sroener commented Jan 22, 2025

Thank you for the answer. It's very understandable that you have to consider the costs of hosting huge files, especially considering that data size is growing constantly.

I used the hosted version and everything works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants