-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNPlocs.Hsapiens.dbSNP154.GRCh38 and SNPlocs.Hsapiens.dbSNP154.GRCh37 #3
Comments
Hey, I believe one of the main (if not only?) uses of these packages is for MungeSumstats hance why I'm answering this. The creation of supplementary dbSNP release packages is something that has been discussed here. The TLDR is that it is very RAM intensive and time consuming to create these packages (on the scale of 80 cpus and 384 Gb RAM running for a week for each package) and so isn't really feasible using the current approach. Really we need to refactor the approach is done which isn't something @hpages or me have had time to do. |
At least dbSNP154 is slightly smaller than dbSNP155 (729,491,867 RS count vs 1,085,850,277) so the requirements won't be so bad. Let me know if you want to give this a try @AhmedArslan, by following the overview of the process I provided here. I'll be happy to answer questions and provide more detailed guidance if needed. Best, |
@Al-Murphy Actually now that I look at the numbers, I see that size of dbSNP156 is 1,130,597,309 RS count which is really not that much bigger than dbSNP155 (only 4% bigger), especially compared to the growth between dbSNP154 and dbSNP155, which was 49%. So maybe I'll give a shot at forging SNPlocs.Hsapiens.dbSNP156.GRCh38 and SNPlocs.Hsapiens.dbSNP156.GRCh37 after all, in the next couple of weeks or so. |
@hpages only limitation is that I do not have resources to perform such intensive analysis. Although if dbSNP155 is broadly different from dbSNP154 (as you mentioned) in terms of SNP ids, perhaps its essential to produce dbSNP154? |
Well, all I'm saying is that dbSNP155 has a lot more SNP ids than dbSNP154. That doesn't mean that the SNP ids in the latter are not in the former. IIUC dbSNP builds are incremental with every new build mostly adding new SNPs to the previous one and making some corrections to the existing ones. So I would expect dbSNP155 to be a superset of dbSNP154 i.e. that most of the SNP ids found in the latter are still in the former. In other words, I would imagine that using dbSNP155 would still cover your use case. In the unlikely case that the SNPs in dbSNP154 have changed so much in dbSNP155 that the latter cannot be used to annotate the SNPs in the former, then this would suggest that the data in dbSNP154 is outdated, and that the GWAS Catalogue should probably be updated to be based on dbSNP155 in order to remain relevant. What's the plan anyways for the GWAS Catalogue? How often do they switch to a more recent dbSNP build? dbSNP 154 is more than 3 year old now so maybe it's time. |
So ealier today I asked the GWAS folks about their plans to map to a more recent dbSNP build and I got the following answer:
One more reason to focus on dbSNP156! I will start working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] this week. |
Hi @hpages, has there been any update on work for working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37]? Ideally I would love to add them to MungeSumstats when available! |
Thanks for the ping. The bad news is that we had many technical problems with the powerful server that I use for these things. The server is back but it's not the first time that the IT people manage to bring it back. However they've always done it without really addressing the root causes so almost zero progress has been made to improve reliability. Anyways I'm trying to run again my scripts there but my expectations are low. Fingers crossed. The other bad news is that it turns out that these huge SNPlocs packages have contributed significantly to our egress costs in a way that is not sustainable for the Bioconductor project. The current format and mode of distribution is inadequate and will need to change. However I don't have the bandwidth at the moment for this kind of refactoring. So if our server doesn't let me down and I manage to actually produce SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] then I'll put the tarballs on an egress-free location for you to manually download. This will be a temporary situation until I find the time to refactor these packages. |
Good news is that Anyways, going to run the next steps: |
Oops |
The dbSNP156 packages are ready to go! Here are some numbers:
Note that the dbSNP156 packages require a machine with at least 16G or RAM instead of 10G for the dbSNP155 packages. This increase in memory footprint is due to the fact that the rs ids are now stored in a double vector instead of an integer vector (see my previous comment above for why this change was needed). This change also slows down the loading into memory of the rs ids vector (this loading happens the first time one of the But the most crazy number about these new packages is that I'll move the two packages here soon where they'll be available for download. IMPORTANT: They both require BSgenome >= 1.73.1 which will only become available in Bioconductor 3.20 in the next couple of days. |
Thank you Herve! I will add functionality to MungeSumstats so users can supply these to use dbSNP 156 |
The two packages are finally available at http://149.165.171.124/SNPlocs/ |
Thank you @hpages ! |
Not anytime soon, I'm afraid. Our policy for hosting such enormous data packages is being revisited because of excessive egress costs. See my previous comment above. Unfortunately I still don't have the bandwidth at the moment to work on the kind of refactoring that the SNPlocs packages would need. |
Thank you for the answer. It's very understandable that you have to consider the costs of hosting huge files, especially considering that data size is growing constantly. I used the hosted version and everything works fine. |
Hello, I would like to request if you could create dbSNP154.GRCh38/dbSNP154.GRCh37 or provide guidance to built dbSNP154. GWAS Catalogue uses dbSNP154 version and this could be helpful for help working on GWAS data.
Many thanks.
The text was updated successfully, but these errors were encountered: