Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"The index file is older than the data file" when opening 1000 Genomes VCF #877

Closed
PlatonB opened this issue Jan 15, 2020 · 3 comments
Closed

Comments

@PlatonB
Copy link

PlatonB commented Jan 15, 2020

elementary OS 5.1
pysam 0.15.3

  1. Download any 1000 Genomes VCF:
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz

  2. Download corresponding index:
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz.tbi

  3. Try to create a VariantFile object:
    from pysam import VariantFile
    variant_file_obj = VariantFile('/path_to/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz')

[W::hts_idx_load2] The index file is older than the data file: /path_to/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz.tbi
[W::hts_idx_load2] The index file is older than the data file: /path_to/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz.tbi

There was no error in the previous version of pysam.

@PlatonB
Copy link
Author

PlatonB commented Apr 27, 2020

Pysam 0.15.4: bug is actual.

@jmarshall
Copy link
Member

That warning message is from htslib, and has been in pysam's bundled htslib since 2014. It has not changed. Presumably when you saw no message from previous versions of pysam, your previous copies of these files had different timestamps.

What are the timestamps on the /path_to/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz* files on your machine? I expect the warning is accurate. You should touch or regenerate the .tbi file to resolve it.

The timestamps for these files on ftp.1000genomes.ebi.ac.uk make it appear that the .tbi index may be out of date:

$ curl -I ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz
Last-Modified: Fri, 05 May 2017 14:26:27 GMT
Content-Length: 1061765413
Accept-ranges: bytes
$ curl -I ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6_GRCh38.genotypes.20170504.vcf.gz.tbi
Last-Modified: Fri, 05 May 2017 13:50:07 GMT
Content-Length: 161964
Accept-ranges: bytes

so if your downloaded files reflect these timestamps (e.g., you used curl -R) then pysam will correctly produce this warning.

@PlatonB
Copy link
Author

PlatonB commented Apr 28, 2020

Thanks for the explanation.

@PlatonB PlatonB closed this as completed Apr 28, 2020
PlatonB added a commit to PlatonB/ld-tools that referenced this issue Oct 28, 2020
- В тот раз отсев разноразмерных повторных вставок я реализовал неправильно. Новый алгоритм уж точно верный.
- Пополнение CHROM-POS-ID-таблицы конвертационной БД больше не производится кусками. Не так уж и много туда загоняется данных, чтобы возникало опасение перерасхода оперативной памяти.
- Временно обошёл проблему индексации chrX-файла новыми версиями Pysam (samtools/bcftools#1154). Теперь для chrX будет качаться готовый индекс из FTP 1000 Genomes. Почему я не стал так делать для всех 1000G-архивов? Тогда htslib заспамил бы вас многочисленными ворнингами (см. pysam-developers/pysam#877).
- Сделал код проверки существования тех или иных файлов более элегантным.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants