-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools 1.10 fails to index files with wrong INFO/END values but bcftools 1.9 worked fine #1154
Comments
Also, more importantly, if I output the VCF to a binary format and then I try to index as follows:
I get this very weird error:
It seems like there was a failure to check for END being greater than POS in the encoding to binary format that caused some sort of overflow. |
Can you please try with the latest htslib version? This is a duplicate of samtools/htslib#1006 and was addressed by samtools/htslib@ea879e1. |
I have just re-cloned and re-compiled bcftools and I get the same exact errors. I have noticed though that the download of the VCF snippet with bcftools view often fails, sometimes yielding a VCF without variants, but if I retry it many many times it eventually works. I am not sure why this is. Maybe it is a temporary issue at the time of this writing with the EBI webiste. |
Actually I am noticing this HTSlib problem that is unrelated but it seems also like a serious HTSlib bug, so it might be worth it reporting it. At the very moment the EBI website seems to return 404 errors for files that are actually present. When I run the following command:
I get three possible outputs:
Or I get a VCF without variants or I get a VCF with both rs6640136 and rs57675010 variants. The first output is okay, as it means the EBI website provided a 404 answer. The third outcome is the desired outcome. The second outcome seems to be due to the fact that the EBI website provided the header fine but when bcftools tried to get the X:4380006-4461913 region it provided a 404 answer. This is troublesome because bcftools did not throw any error. This does not seem like a desired behavior. |
I am not able to reproduce, I only get a correct warning with the latest version:
Make sure the latest version is used and a truncated version of the remote index is not present in the local directory. |
I have the same version as you do.
Also, I am sure it is not just me as this bug was reported to me from someone in Finland that just installed HTSlib and BCFtools from github a few days ago. |
You are only showing the indexing step. Can you make sure the BCF was created by the new version as well? Either remove and recreate or look in the header for the stamp. Otherwise, after some experimentation I see that several problems can occur:
The error message in the last case could be made more informative and give a hint that the problem can be temporary, otherwise it seems to be working OK. |
I am sorry to mix two different bugs, but it is not a problem of the indexing file being truncated. That was downloaded once and for all and it copied fine. What happens to me occasionally is that I run:
which runs without any errors or warnings. And then I get:
Which means the header loaded okay, but then I get no bgzf_read or hts_idx_load3 error. Somehow EBI just did not send anything. I am not sure whether HTSlib can tell that it was a 404 error message. |
Sorry, you are right, the latest version of bcftools does not reproduce the malformed bcf file. May this got mixed with a malformed index file or something. However, the following recreates the problem with vcf.gz files:
While bcftools 1.9 was capable of indexing just fine. You can close the issue. |
In fact it does — see samtools/hts-specs#266 (comment) and samtools/hts-specs#436, and also #1153.
In fact they did not. Prior to 1.10, records with INFO/END < POS could lead to
It is unclear whether your query is returning no results due to this problem with a previously built broken index. Do your query coordinates vary, or are you saying you are intermittently getting different results for |
Okay, I have made a mess of a bug report, but here is the bug, reproducible with the latest version of bcftools:
This is quicker way to reproduce it
In both cases the error I get is the following:
The problem seems to arise from bcftools norm splitting a variant with INFO/END smaller than POS. I previously dissected the bug incorrectly. |
I see it now. The |
This is an extension of ea879e1 which added checks for reading VCF. Resolves samtools/bcftools#1154
This is an extension of ea879e1 which added checks for reading VCF. Resolves samtools/bcftools#1154
This is an extension of ea879e1 which added checks for reading VCF. Resolves samtools/bcftools#1154
@jmarshall @pd3 This bug recently hit pysam. It wasn't in v0.15.4, it appeared in v0.16.0. |
This is a feature, not a bug. The previous behaviour of not diagnosing the invalid INFO/END value and producing an unusable index was buggy. See #1154 (comment), in particular the biostars posting linked from there. What program has produced the INFO/END values that you are seeing diagnostic messages about? |
@jmarshall I tried to index the same chrX archive from 1000 Genomes as @freeseek. Just in case, I quote an error message:
|
@jmarshall, for background, the file being discussed is a liftover of 1000 Genomes phase three from GRCh37 to GRCh38. The END coordinate issue was introduced by the liftover process (which in this case was fairly complicated, going via dbSNP). Brief documentation (including this issue) is here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/README_GRCh38_liftover_20170504.txt. Ensembl share a version of this data where (I believe) these cases will have been filtered out - I notice that that isn't mentioned in the existing README (we will aim to update it). Also, probably worth highlighting that there are increasing numbers of call sets for the 1000 Genomes samples called directly on GRCh38 (where liftover artefacts are not an issue and all of GRCh38 was included in the calling) . We are happy to take questions on the data at info@1000genomes.org. |
As far as I understand, liftover was applied for all chromosomes. Then why does this problem occur only for chrX? |
I think chromosome X received special treatment. Notice for example that they forgot to liftOver the PAR1 and PAR2 regions ... |
@freeseek - regarding chromosome X, as noted, the liftover process for this data set was somewhat atypical/complicated going via dbSNP (rather than liftOver directly GRCh37 to GRCh38). Given that chromosome X frequently suffers from anomalies associated with the PARs, etc., I find it plausible that there are some problems specific to chromosome X but I'm not aware that the processing of chromosome X was any different from the others (based on the information available to me). @PlatonB - I think the safest approach is to treat these case by case. This could be down to circumstances that only occur on chromosome X but I'm afraid I can't guarantee that. You are correct that the liftover was applied to all chromosomes. |
This doesn't remove the fact that the data is erroneous otherwise the newer code wouldn't be complaining. If 1000 genomes have erroneous data then they need to be informed. Previously it used this invalid fields without checking and could silently give the wrong results. There's nothing we can do to make it give the correct results as the primary data needs fixing. |
@jkbonfield, agreed that this is a data issue and not an issue with the software - the current software behaviour described (not allowing the error to go undetected) sounds correct to me. For the data, 1000 Genomes are aware - this has been listed as a known issue for two years. We will, however, point people to the filtered version of the files hosted by Ensembl. |
- В тот раз отсев разноразмерных повторных вставок я реализовал неправильно. Новый алгоритм уж точно верный. - Пополнение CHROM-POS-ID-таблицы конвертационной БД больше не производится кусками. Не так уж и много туда загоняется данных, чтобы возникало опасение перерасхода оперативной памяти. - Временно обошёл проблему индексации chrX-файла новыми версиями Pysam (samtools/bcftools#1154). Теперь для chrX будет качаться готовый индекс из FTP 1000 Genomes. Почему я не стал так делать для всех 1000G-архивов? Тогда htslib заспамил бы вас многочисленными ворнингами (см. pysam-developers/pysam#877). - Сделал код проверки существования тех или иных файлов более элегантным.
commit a7a90fe913f8a466f32f6e284cf46653944acd6f Merge: fd0f895 cd0a84b Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Tue Sep 22 13:15:49 2020 +0100 Release 1.11 commit cd0a84b490e910c085b987e3026f724be98349bf Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Tue Sep 22 13:14:15 2020 +0100 More minor NEWS fixes commit 80161279b755d83df6e0dbc52942f625df414452 Author: James Bonfield <jkb@sanger.ac.uk> Date: Fri Sep 18 11:44:04 2020 +0100 Add back assert.h to vcf.h. This undoes a change in #1060. It's a philosphical point. On one hand we have an inclusion of a spurious file that we do not utilise ourselves. On the other hand we have an API breakage where old software (in this case bcftools 1.10) attemting to compile against current htslib no longer succeeds because we've changed which symbols we declare. This is unclean, but I think backwards compatibity trumps tidyness in this case. commit 75e2017bb097cb47722ac8796de10036d21c9625 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu Sep 17 09:43:36 2020 +0100 Minor NEWS updates Reword tabix --separate-regions description. Fix PR#1094 text -- the original has <space> in the description (edit the PR comment to see it). Credit @lczech. Trivial formatting fixes. Spelling mistakes PR surely too trivial to mention :-). commit 1748a2dd34e5ff4b4800fcf04dbd5860d8c30f22 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Sep 16 10:10:46 2020 +0100 More NEWS updates commit d50a3a531c597a24644c6878fc0068d0d1cc0a0d Author: Andrew Whitwham <whitwham@users.noreply.github.com> Date: Mon Sep 14 20:49:22 2020 +0100 News update Sept 2020. commit af6dbb49b2d522bfd6d9762bd9329b28aa0b1e70 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Wed Sep 16 10:17:50 2020 +0100 Spelling corrections [minor] (PR #1137) commit 170047656473a7fadf9f23f3afdfac46b9c52b21 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Mon Aug 17 17:18:27 2020 +0100 Handle q==0.0 case in kt_fisher_exact() and add tests When some of kt_fisher_exact()'s input arguments are very large, hypergeo(...) is dominated by the subtracted lbinom() so returns exp(-750 or less), which underflows, hence hypergeo() returns 0.0. When q is 0.0, kt_fisher_exact()'s loops never execute and which of i and j is closer to n11 is indeterminate. To avoid this problem a special case is added for q == 0.0 which determines the output values by comparing n11 to the location of the peak of the hypergeometric distribution. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit 9067dee60e91eafba657b13e502307d500dcd489 Author: James Bonfield <jkb@sanger.ac.uk> Date: Fri Sep 11 14:16:28 2020 +0100 Fixed a crash in bcf_record_check. Commit 3ac8a04 breaks the check if hdr is passed in as NULL. This never happens in htslib tests, but does in bcftools. commit 8bab82bdb8c2613e1ca7bd5573d7c12117a2dc02 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu Sep 10 07:05:16 2020 +0000 Check against the proper type lenght. commit 34ba6c726925efceaa2cc995d7d4b0409907f331 Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Sep 8 18:23:24 2020 +0100 Add sam_format_aux1 and bam_aux_get_str APIs. (PR #1134) sam_format_aux1 is extracted out from sam_format1_append, and indeed is the lion's share of this function. It is useful as an external API for anything that is generating its own tags or a partial tag list in SAM format, such as "samtools fastq". The tag key, type and data are passed in as separate pointers to aid usage with bam_aux_get which returns 2 bytes into the in-memory representation of a tag. It also allows the function to be used in cases where these are not stored together, for example the internal CRAM representation. bam_aux_get_str is a wrapper around sam_format_aux1 which simplifies the case of converting a single tag from a record in a bam1_t struct. commit 3ac8a04f8f6071be0901a9ddcda296f58b2bcf0c Author: James Bonfield <jkb@sanger.ac.uk> Date: Wed Aug 26 14:29:55 2020 +0100 Sped up bcf_check_record. This is mainly by speeding up bcf_dec_typed_int1_safe, but also by some other code in there to streamline the function and keep the loops tighter. The effect on clang is slight. On gcc9 reading an uncompressed BCF file can be a bit faster (up to 15% on one file), bringing the speed of gcc and clang to similar speeds when using -O3. (Gcc -O2 is still noticably poorer than clang -O2 on this test though.) Examples on CHM1_CHM13_2.1.gatk_raw.gvcf.ubcf and gcc9 -O3. Before: 1549.79 msec task-clock # 0.999 CPUs util 5549055377 cycles # 3.581 GHz 16764915993 instructions # 3.02 insn per After: 1345.83 msec task-clock # 0.999 CPUs util 4806341286 cycles # 3.571 GHz 13942053336 instructions # 2.90 insn per Record checking is still around 50% of the entire decode CPU cost though. commit f4f7f24914251744a928eb6f2aa8e89fa5788d0c Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Sep 1 12:28:33 2020 +0100 Improvement (personal preference) to previous C99 fix. In #498 I removed the warning: vcf.c:3030:26: warning: assignment makes pointer from integer without a cast [-Wint-conversion] d->allele[i] = tmp.l; by storing a pointer starting from a fixed "" string, and later on subtracting it back again to get the length. While that works, it's pretty icky and I think undefined behaviour as it uses pointer arithmetic outsides the bounds of an object. (In reality it's fine on all modern systems as the days of segmented pointer archictures are long gone.) Instead I now cast the length into a char* as the size of allele is never going to grow beyond the size of a pointer, and then cast back prior to adding to the d->als pointer. [Ideally we wouldn't need these shenanigans at all and keep it in index space instead of pointers so growing memory doesn't require a lot of pointer juggling, but bcf_dec_t is a public structure.] commit e9447f86ec386798bdf0cc9ee3646f982328e321 Author: James Bonfield <jkb@sanger.ac.uk> Date: Thu Aug 27 12:22:01 2020 +0100 Remove now unused ks_resize2 function. NB: This never made it into a release so it isn't an ABI breaking change and nor does it need to be listed in the NEWS file. commit e1d314931fe6b2c8d8bc5d0330b989d33adb175d Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu Aug 27 11:09:08 2020 +0300 Have realloc and free in the same layer of kstring (#1129) Bring realloc back to the inline function, so that the allocation and freeing of memory are done by the calling layer. Replace the roundup to a power of 2 with a simpler formula that multiplies the desired realloc size with 1.5. Try to allocate a maximum of SIZE_MAX/2 bytes. commit 78648e2a93aaec850ae47fa56cc5c68deb924fa0 Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Aug 25 16:52:59 2020 +0100 Bug fix to the pileup constructor. We call constructor and destructor for each new bam object that is added to and removed from the pileup. Pileup also takes its own copy of the bam1_t struct as sam_read1 loops may be using the same structure over and over again. However the constructor was called with the passed in bam1_t instead of pileups own copy of it (unlike destructor). Hence anything the constructor attempts to cache, such as the location of an aux tag, may get written over. commit 0456cec92e06f4354bbdde26dd3a3b9d3b8bb0ae Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Fri Jul 31 11:16:59 2020 +0100 Clarify that htsFile fields are not to be used directly commit adeeb8870eadf3e005ae0f17e773d7d90cf58624 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu Jul 30 11:27:20 2020 +0100 Add bcf_ prefix to variant_t struct This struct is used only as the type of a field within bcf_dec_t, which itself is an internal field of bcf1_t. So user code is very unlikely to be using this identifier directly. commit 71b9e4ff763e97f66e0a139a4f3b2974075fb317 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Sat Jul 11 12:34:42 2020 +0100 Rename public HTSlib struct tags to omit leading underscores File-scope identifiers starting with an underscore are reserved to the compiler, and there is nothing wrong with a struct tag matching a typedef name. In cases where foo_t is actually a typedef to a pointer, rename the struct tag from __foo_t to foo_s to distinguish it from foo_t. commit 2fb3b7e178304eff76381c630e54e6408a94f834 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu Jul 30 10:53:04 2020 +0100 Remove unused aux_key_t typedef This was added during PR #608 but accidentally not removed when the code using the typedef was removed later in the PR's development. commit 3c6f04be2fa45ee8eb4c2f9a171f2ba8a6032588 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Wed Jul 29 14:09:14 2020 +0100 Add struct tags for public typedeffed structs This enables user code to forward declare these types with e.g. "struct htsFile;" without including <htslib/hts.h>. This commit adds tags for typedefs that previously didn't have one. commit 7ecaaf90d4515fa9747e7536234722e5d31e514a Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Aug 20 17:43:27 2020 +0100 Ignore errors when closing the input file in the fuzzer Since 3855184b, hts_close() returns non-zero if earlier errors were detected on the file handle. As this frequently happens with fuzz input, it should not be treated as a fuzzing failure. commit 4162046b28a7d9d8a104ce28086d9467cc05c212 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Aug 13 15:45:59 2020 +0100 Tidy bcf_hdr_subset() error handling Gather clean-up code in one place, fixing some cases that were incompletely handled. Catch more memory errors directly. commit 564baf0364762cdd7fadde03ff333c379ec078f8 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Aug 13 12:21:29 2020 +0100 Improve clean-up in hts_tpool_init() Allocate hts_tpool::t_stack earlier so it's easier to clean up if it fails. Add error handling code to tidy up any threads that got created prior to a pthread_create() failure. Rename variable i to t_idx so it's easier to see how it links between the thread creation and the clean-up code. Adds a dependency on htslib/hts_log.h so it can use hts_log_error(). commit 260c5a35ad855788baba16c5419ae50b6d23061a Author: James Bonfield <jkb@sanger.ac.uk> Date: Mon Aug 10 09:59:50 2020 +0100 Improve bcf_hdr_format error checking. It now handles ksprintf failing, and all the calls to it now also check for failure and bubble it up one layer. Everything that calls that layer already had checks, although mostly these are absent in bcftools. (That's an entirely new repository and PR though.) Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit 6e759d374751f2743811bedfb761d6feacbced6b Author: James Bonfield <jkb@sanger.ac.uk> Date: Mon Aug 10 09:46:04 2020 +0100 Improved return from bcf_hdr_parse_line to enable error detection. The function returns NULL for both failure and also for running out of text. Hence it's executed in a loop until it returns NULL, assuming success. On successful end it also sets *len to 0, so now on error we explicitly set *len to non-zero so it is possible to use that as a side channel for disambiguating EOT from error. Values returned in *len are documented in the header file. If bcf_hdr_parse_line() returns NULL due to a malformed line, bcf_hdr_parse() can use the returned value in *len to skip to the next one instead of searching for the newline itself. To stop multiple warning messages from being printed on finding malformed lines, make hts_hdr_parse_line() responsible for logging both cases that it traps and remove the corresponding logging from hts_hdr_parse(). Use hts_strprint() when printing bad lines to sanitise any unprintable characters. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit 3855184b58ff51735e9a70282a286702a24ae7f3 Author: James Bonfield <jkb@sanger.ac.uk> Date: Thu Aug 6 11:11:35 2020 +0100 More mem-checking & threading hardening (write side). - Failing to allocate the structure to store a thread result now shuts down all processing queues. This avoids potential deadlock. - Failure in the thread pool now sets the shutdown signal to 2 instead of 1, offering a simple way to determine error-induced shutdown instead of normal shutdown. - BGZF now checks for errors in bgzf_close. Previously some errors were caught, but still returned 0 from close and weren't propagated back to the application. - khash failure in cram_encode_aux could leave fd->metrics_lock locked. Delete unused hash entry if cram_new_metrics() fails. - Improved hts_tpool_process_flush to avoid deadlocks by periodically checking for shutdown. Also fixed a bug in a previous commit where a shutdown queue could lead to flush returning before n_processing hits zero, which in turn could lead to memory being freed for running jobs. - Check for string_ndup failure in cram_encode_aux. - Removed double free in error handling of cram_stats_encoding. - As with SAM reader, the threaded writer now uses the hts_tpool_process_shutdown function instead of ..._destroy to avoid a race condition between that and fd->q = NULL. - Don't attempt to free h->target_name elements when they havn't been initialised yet due to error in creation. - Fix potential double free in sam_format_worker (on error). commit 8ae25df4bfba9c2bc9b4ac1f6c9fa1de70d5eec9 Author: James Bonfield <jkb@sanger.ac.uk> Date: Wed Aug 5 12:22:55 2020 +0100 More protection against running out of memory (all formats read side). These have been found by using a malloc replacement (added to the Makefile plus header include using CFLAGS="-g -include malloc_fuzz.h") so that the Nth memory allocation will always fail. We then loop from 1 upwards until we succeed, checking every place for a proper handling of the error. Eg: for i in `seq 1 20000` do echo $i FUZZ=$i ./test/test_view -@4 -B in.bam 2>&1 test "$?" ">" 1 && break done Changes include - bgzf MT reading now sets mt->command = CLOSE and has additional calls to shutdown the queue on memory errors. This is checked for in bgzf_check_EOF to avoid deadlock. - bgzf now has a pool_create() error check. - Distinguish between errors vs would-block in cram_decode_slice_mt - Make cram_drain_rqueue cope with failed jobs. - Improve cram_close to cope with shuting down after failed initialisation, such as rqueue not being initialised. - Check for realloc failure in refs_load_fai. - Fix segfault in cram_decode_slice if cram_get_ref failed. - Removed abort in load_hfile_plugins. It now returns -1 and is checked for in find_scheme_handler. - Documented return values in hts_getline - Attempt to propagate memory error in kgetline2 to file pointer, but this only works for hgets (not gets). Fundamentally this is unsolvable with the current API. - Fixed memory tear-down in sam_parse_worker so it doesn't claim to have allocated more than it managed. - sam_set_thread_pool now copes with failure to create the queue. - Tabix index_load now checks calloc return. - Extra allocation return checks in hts_tpool_process_init and hts_tpool_init. - Check ksprintf worked in vcf fix_chromosome. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit 773b6ed3c962c434901e0194607bbc8fc6c2fde2 Author: James Bonfield <jkb@sanger.ac.uk> Date: Mon Aug 3 11:25:03 2020 +0100 Fixed deadlocks and crashes in threaded SAM parsing. There was a race with the closing code, caused by early termination (eg failure during parsing). This has been solved by waiting for the sam_dispatcher_read to acknowledge and return a SAM_CLOSE_DONE call. There were also some crashes caused by aborting even earlier, eg failing to launch all the threads due to running out of memory. Note extreme low memory may still lead to crashes in this code (but no more deadlocks). It's challenging to cover even the basics failing to initialise. In practical scenarios it should be robust. (Via Rob Davies): sam_dispatcher_read now uses hts_tpool_process_shutdown instead of hts_tpool_process_destroy. It also removes any change from SAM_CLOSE_DONE to SAM_CLOSE or SAM_NONE to avoid race conditions or deadlocks. Fixed a thread pool bug where worker threads could still launch new jobs after a queue (process) had been shutdown provided the pool itself hadn't. We now check both pool and process. Similarly fixed hts_tpool_process_flush to work after shutdown instead of wedging. However we don't use this so the change is not strictly needed for this particular fix. Finally improved on the error handling so we use errno over EIO if errno is set, to avoid masking the true source of error. commit 0e8a6e21e8e30182041cb0b83407de952b521ffd Author: James Bonfield <jkb@sanger.ac.uk> Date: Wed Aug 5 10:08:43 2020 +0100 Fix bug in the line reallocation for threaded SAM decoding. (The +NM/2 on the wrong side of the comparison!) The +8 changes here may not be necessary, but we did it for some allocs so I'm just being consistent and making sure it's always got 8 more than it uses. Either it's not necessary in which case those changes are harmless, or it is necessary in which case we were sometimes not always leaving at least 8 bytes remaining. I believe this fixes /~https://github.com/samtools/samtools/issues/1293 commit a79009b38ce83e39bcbc8f54c00cf203621aa5bb Author: James Bonfield <jkb@sanger.ac.uk> Date: Wed Aug 5 10:17:46 2020 +0100 Don't trash errno in bam_tag2cigar. bam_aux_get sets errno=ENOENT when we look for a tag and don't find it. Restore it to the old value when failing to find CG tags as not finding the tag is not an error. The consequence of samtools/samtools#1296 was that some errors found while decoding a BAM file got reported as "No such file or directory" prior to this commit. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit ab045d8c9d13a023ed873bd4788e851dd52937f2 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Aug 4 15:24:20 2020 +0100 Add missing Makefile dependency [minor] commit 31f0a76d338c9bf3a6893b71dd437ef5fcaaea0e Author: Andrew Whitwham <whitwham@users.noreply.github.com> Date: Wed Jul 22 09:55:16 2020 +0100 Update dates in LICENSE and program versions. commit f098e4da5728e8ba4ca4351b9a1b945d17dd8982 Author: Andrew Whitwham <whitwham@users.noreply.github.com> Date: Fri Jul 17 12:01:59 2020 +0100 Update the copyright to 2020 where needed. commit a2fdf3b2fa29844a2d48c4d786471f05bd7821a9 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Sat May 9 13:42:28 2020 +0100 Consider alignments which consume no reference bases to have length 1 In bam_endpos(), similarly to reads that are unmapped or have no CIGAR, consider reads whose CIGARs consume no reference bases to have length 1. Thus consider them to cover one reference position (rather than zero) for indexing purposes. Fixes samtools/samtools#1240. Use bam_cigar2rlen() directly in bam_plp_push() to avoid this and other adjustments. (This function already ignores unmapped reads separately.) Update code that sets the bin field accordingly: use bam_endpos() where applicable; when not all fields have been set up, add similar code to make adjustments when rlen==0. commit 302843d739d65d3192c52558b154c25f2a00c103 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Tue Jul 28 15:59:14 2020 +0300 [tabix] Add --separate-regions option (#1108) * Add tabix --separate-regs option. * Rename option to --separate-regions. Print the region name on top of the corresponding group. Add unit test and minor refactoring. commit 3a359cd61322a2fc546198f72cb36988924ab765 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jul 28 12:01:52 2020 +0100 Download index files atomically in idx_test_and_fetch() (PR #1112) * Reuse cram_populate_ref()'s atomic downloading in idx_test_and_fetch() Extract filename uniquification into a new hts_open_tmpfile() function declared in hts_internal.h. No code in hts.c currently uses pthreads directly, so instead of mixing pthread_self() into tmpname, mix in the pointer value of one of the arguments -- which will vary across threads. Test whether the local index file already exists using access() rather than a trial fopen() into local_fp, which is no longer a FILE*. * Improve cram_populate_ref() download error handling Clean up the temporary file if any of hwrite/hclose/chmod/rename fail. Putting hclose() first in the if condition ensures it is always executed, and putting rename() last ensures that the file is still at path_tmp in any failing case. commit da595880d134b29fb62f726b72bdc6302ffc9ccc Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Jul 21 10:19:56 2020 +0100 Fixed a (harmless) CRAM uninitialised data access. This was caused by the fix in PR #1094. The problem there was data without explicit quality scores being set (CF data series with "preserve qual scores" not set) could be coupled with explicit Q or q feature codes to add qualities at specific sites. Without quals, the default is "*" (a run of Q255), but we can't just amend a few qualities here and there as that leads to a mix of Q255 and Q-something-else. We memset from Q255 to Q-something-else the on the first explicit quality change. The flaw is that this check is applied even on data where the "preserve qual scores" CF value is set, but we haven't initialised the quality string to all-Q255 in that scenario. However it was totally harmless, as if we preserve quality scores, then the entire array of quality values are then subsequently overwritten (after applying any per-base qualities - dumb I know, but it was never intended for that scenario). So irrespective of the outcome of the uninitialised check and whether or not the memset to Q30 happened, the data still ends up the same thing. Note curiously this does not show up with -fsanitize=address, but valgrind does detect it. Hence the test harness was passing fine. commit dcd4b7304941a8832fba2d0fc4c1e716e7a4e72c Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Mon Jul 13 11:48:49 2020 +0100 Fix check for VCF record size The check for excessive record size in vcf_parse_format() only looked at individual fields. It was therefore possible to exceed the limit and overflow fmt_aux_t::offset by having multiple fields with a combined size that went over INT_MAX. Fix by including the amount of memory used so far in the check. Credit to OSS-Fuzz Fixes oss-fuzz 24097 commit df31e599a3d3778804b6a7fda9bb5fdc683ee34c Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Fri Jul 10 16:53:43 2020 +0100 Avoid crash for `bcf_update_alleles(hdr, rec, NULL, 0)` Updating a bcf1_t to have 0 alleles is not particularly meaningful, but it is a valid state as that's what bcf_init() and bcf_clear() produce. Hence it is a valid thing for bcf_update_alleles() to do too, so calling it with nals==0 should not crash. Instead set rlen the same way that bcf_init() and bcf_clear() leave it. Fixes #994. Also fix bcf_get_format_values() allocation failure check [minor]. Hat tip @reedacartwright. commit c7c7fb56dba6f81a56a5ec5ea20b8ad81ce62a43 Author: daviesrob <rmd+git@sanger.ac.uk> Date: Wed Jul 15 09:03:25 2020 +0100 Fix hfile_libcurl breakage when using libcurl 7.69.1 or later (#1105) Curl update /~https://github.com/curl/curl/pull/5050 (pause: bail out on bad input) made curl_easy_pause() return an error if passed a disconnected CURL handle. In hfile_libcurl this could happen when libcurl_read() or libcurl_close() was called after all bytes had been read; or all the time in restart_from_position(). Fix by checking for fp->finished in libcurl_read() and libcurl_close(); and by removing the curl_easy_pause() in restart_from_position(). Thanks to @jmarshall for tracking down the exact libcurl change that caused the incompatibility. commit 9c357445c5dab64db4000497193975fa12f63225 Author: Ilya Vorontsov <vorontsov.i.e@gmail.com> Date: Wed Jul 8 15:52:24 2020 +0300 typo "Number=R" instead of "Number=G" in error msg commit 2c49d1913b05e754e1f180fc64d187ba0baba1a8 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Jun 24 18:44:24 2020 +0100 Don't call exit(1) on invalid INFO numbers in vcf_format() Instead, set errno = EINVAL and return -1, as for other types of invalid record. commit 6fc307b777cb20ea52820d88e7e7c2c645edfd4b Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Jun 24 17:02:14 2020 +0100 Catch invalid dictionary indexes in vcf_format() bcf_record_check() can't check the validity of references to the header dictionaries when using the iterator interface as it doesn't have access to the header. Add a check to vcf_format() to catch any invalid records that make it that far. Changes the existing, incomplete, check for FORMAT ids so that it returns -1 instead of calling abort(). Adds 'const' qualifier on 'rec' to bcf_seqname() and bcf_seqname_safe() they can be used without casts in vcf_format(). commit 1d6b68263a58786b6ab88f579e2866ad9e13c17b Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Jun 24 16:29:12 2020 +0100 Better catch BCF CHROM/FILTER/FORMAT ids absent from the header It is possible to make a sparse set of ids in the VCF header using IDX= annotations. This can cause problems if a BCF record refers to one of the indexes that does not exist. Update bcf_record_check() so that these bad records while files are being read. It already did this for INFO; this adds similar checks for CHROM, FILTER and FORMAT data. Credit to OSS-Fuzz Fixes oss-fuzz 23670 commit 832e4ba7b43b68a53ef2277cfbc643b85c8d32fd Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Tue Jul 7 18:46:05 2020 +0100 Fix paths for Windows tests. commit 818b271f8be6d71258364aff69075ede97a49054 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Jul 1 18:33:18 2020 +0100 Add synced reader no-index tests commit 6e0ed10ef42f393628ee4a0e628a8acc5023ed0a Author: Petr Danecek <pd3@sanger.ac.uk> Date: Wed Jun 24 14:46:17 2020 +0100 Support for simultaneous reading of unindexed VCF/BCF files. In order for this to work, the caller must ensure that the input files have chromosomes in the same order and consistent with the order of sequences in the header. Replaces #1076. commit c9fdcbf2bb1e700f388a3302914c6f272e3f84ff Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Fri Jun 12 16:43:36 2020 +0100 Don't dlclose() plugins in atexit() handler Prevents a segfault when HTSlib has been imported using dlopen(), a plugin that itself links against HTSlib is in use and hts_lib_shutdown() was not called before dlclose(htslib). If hts_lib_shutdown() is not called, dlclose(htslib) will not remove libhts.so if any plugins linking it are installed. In this case the atexit() handler runs at program exit where it can clean up any memory still in use. There's no need to remove the plugin shared libraries (which causes the segfault as it removes libhts.so while code inside it is running) as the runtime will tidy them up anyway. If hts_lib_shutdown() has been called, the plugin shared libraries will already have been removed. commit 913fccf7edeb4e959428996e040b958645dc0b2b Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Fri Jun 12 12:55:13 2020 +0100 Add --enable-plugins to travis (and fix -ldl configure test breakage) Using Address Sanitizer on Ubuntu Xenial with gcc-8 somehow breaks the configure test to see if -ldl is needed, making it incorrectly decide that it isn't. This causes a link failure when building test/plugins-dlhts when the compiler decides that it really did want -ldl after all. Oddly, other outputs that reference dlopen(), dlsym() etc. have already built successfully by this point. Fix by making configure test for dlsym() instead of dlopen(), which seems to work better. commit 9a15df9d394777c45abcdebf27dd65fbaf1a8b79 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jun 23 13:40:45 2020 +0100 Log when calling hts_lib_shutdown/dlclose/etc Facilitates adding debugging printfs to e.g. hfile_shutdown() to observe exactly where in the sequence they occur. commit 00c61d2c8d83324494d4edab42c452c2d3cc26fc Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jun 16 08:06:49 2020 +0100 Provide separate dlsym() wrappers for functions and data Avoid -pedantic warnings about data pointer <-> function pointer conversions (which are required to work on POSIX platforms so that dlsym() works, but are nonetheless warned about by compilers). As we're providing wrappers around dlsym() anyway, we can provide separate function and data wrappers; change load_plugin()'s return type as it (practically) always returns a function. This uses the workaround recommended in POSIX Issue 6 dlsym() (since removed in Issue 7 by [1]; at least for GCC, type-punning through a union might be preferable). Hat tip Rob Davies. [1] https://www.austingroupbugs.net/view.php?id=74 commit 2cb229aedbdf81c968a9030521a9b7ed42cce599 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Sat May 16 09:01:12 2020 +0100 Link libhts.so into plugins on Unix&macOS; add hts_lib_shutdown() When libhts.so/.dylib has been dynamically loaded with RTLD_LOCAL (which is the default on Linux), plugins that use hts_always_remote() etc will not be loadable unless they make libhts's symbols available themselves. When libhts.so/.dylib has been dynamically loaded, some plugins have been opened, and dlclose(htslib) is called: htslib is not actually unloaded (or its atexit functions e.g. hfile_exit() called) as the plugins are still using libhts symbols. Eventually at program exit, all atexit functions are called; so the plugins are closed by hfile_exit(), htslib is suddenly unloaded when the final plugin is unloaded, and a segfault occurs because this all happens within hfile_exit() -- which has just been unloaded. To break this cycle, introduce hts_lib_shutdown() which must be called before dlclose(htslib), if that is called explicitly prior to program exit. This unloads the plugins so that dlclose() can immediately unload htslib. Add test exercising libhts via dlopen(). This ensures that all available hFILE plugins have been loaded, then calls hts_lib_shutdown() and dlclose(htslib). As this is the first htslib test that tests things linked to the shared library, we need to set $LD_LIBRARY_PATH etc so that the freshly-built libhts.so/.dylib/etc is used. Add with-shlib.sh script that creates a suitable temporary libdir containing *only* the freshly-built shared libhts. On Windows platforms, this directory is added to %PATH% so shouldn't contain anything else. On macOS, having hfile_*.bundle files in $DYLD_LIBRARY_PATH would reduce the ability to test against different sets of plugins by varying $HTS_PATH. commit 82d01ba63b20eed8905ea60c2e1cfe0f96c73bab Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Fri Jul 3 14:41:19 2020 +0300 Implement vcf_open_mode method (PR #1096) Which can detect the opening mode of a file, based on its extension. commit cb44caaebb25ec334f257b96f9c207cfa6f346c7 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Wed Jun 24 07:32:51 2020 +0100 Check for null pointer in source string. commit 29d06c2f84dc20fbc5f94ca1b181292433b69a21 Author: daviesrob <rmd+git@sanger.ac.uk> Date: Thu Jul 2 13:07:21 2020 +0100 Improve bam_aux_update_str() (#1088) Makes bam_aux_update_str() easier to use by not insisting that the byte count passed in via the len parameter should include the NUL-terminator for the string. This makes it less likely that an invalid tag will be generated by forgetting to include the NUL, and also allows tags to be made from part of a longer string. To support this, it is made to handle adding new tags itself instead of delegating to bam_aux_append(). Where existing tags are replaced, bam_aux_update_str() is changed to more accurately calculate how much extra space is required. A call to bam_aux_del() is removed and the tag shuffling code updated to reduce the number of memmove() calls needed to get the following tags in the right place. Adds some extra tests to ensure growing and shrinking tags both work correctly. Also fixes a bug in the test/sam.c test_mempolicy() function, which was not adding the correct number of bytes when appending the ZZ tag to each alignment record. commit 33bfcfa37df8beae67d3b0874c7253fe14d8929e Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Jun 30 17:01:27 2020 +0100 Fix mpileup regression between 1.9 and 1.10. When RNEXT / PNEXT / TLEN are set to the uninitialised values (* / 0 / 0) the overlap_push function was escaping the overlap removal code. We now still look for overlaps in this case. It's not as efficient when it doesn't have these values as we can't optimise the case of detecting non-overlapping reads, but we do get the correct answer again. commit 3d55140fead0a42762abc7c62032e3ee1265f30f Author: James Bonfield <jkb@sanger.ac.uk> Date: Fri Jun 26 12:29:48 2020 +0100 Fix support for decoding CRAM 'q' feature. This was incorrectly marching along seq / ref / cig-len, which broke the CIGAR string and base calls. Test file: /~https://github.com/jkbonfield/hts-specs/blob/CRAM_tests/test/cram/passed/1005_qual.cram commit 146e6a28c948e935add16e106f77482cfc7e383d Author: James Bonfield <jkb@sanger.ac.uk> Date: Thu Jun 25 14:18:07 2020 +0100 Set CRAM default qual for lossy qual modes. If we have lossy quality enabled and couple that with the use of 'B', 'q' or 'Q' features, CRAM starts off with QUAL being all 255 (as per BAM spec and "*" quality) and then starts to modify individual qualities as dictated by the specific features. However that then produces ASCII quality <space> (q=-1) for the unmodified bases. Instead we use ASCII quality "?" (q=30), as per htsjdk. We use have quality 255 for sequences with no modifications at all though. NB: Neither htslib nor scramble can produce test data to demonstrate this bug. I used gdb to tweak the CF value (b cram_encode.c:2769 and then set cr->cram_flags=0) along with using a SAM file using IUPAC codes in the sequence. commit 808affd6c25d2df7c59ab8d463dc76015bcbc31a Author: Damien Zammit <damien@zamaudio.com> Date: Sat Jun 27 02:17:23 2020 +1000 Fix dirty default build by including latest pkg.m4 instead of using aclocal.m4 (PR #1091) commit 1e1186923f67d82edb181ce6da9ab7d38a20e3eb Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu Jun 25 14:07:30 2020 +0100 Fix a bug, which caused cram_codec_decoder2encoder to return an error code. commit 313a9fc1e3df139bb32d81d9b660a43a5dc9483a Author: James Bonfield <jkb@sanger.ac.uk> Date: Fri Jun 19 15:27:08 2020 +0100 CRAM fix for MD tag generation with "b" FC. We don't normally generate these features, using "B" instead. However some hand crafted CRAM files have shown we didn't create the correct MD tag due to forgetting to clear the distance since last edit, giving rise to the last numeric in MD being high. Eg: MD:Z:0A98T0 came out as MD:Z:0A98T98 if that last T edit was a "b" feature. commit 16f62c5663bc09fda7c87f78322d1c561f483055 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Fri Jun 5 19:23:19 2020 +0100 Avoid warnings from recent and upcoming Clang and GCC releases Prevent GCC 10 strncpy() warning by telling the compiler that cram_file_def::file_id is not a C-style NUL-terminated string. GCC 10 warns for `int_var = uint32_var = (uint32_t) -1`. Clang 11 warns that float doesn't have enough precision to represent RAND_MAX (0x7fffffff) exactly. Recode using uint64_t arithmetic rather than (32-bit) float arithmetic. commit 34c15969fa49d78b07c649374fb008fb3ed1e98a Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Fri Jun 5 14:35:42 2020 +0100 Add documentation to hts_itr_t structure. Add a few minor changes. commit d3147150b4f90a75746500137357f63bb8530d81 Author: James Bonfield <jkb@sanger.ac.uk> Date: Wed May 6 15:30:03 2020 +0100 Fix the multi-threaded CRAM multi-region iterator regression. Fixes #1061 This appears to have been introduced during bfc9f0d and fixed for BAM only in 6149ea6. The effect on multi-threading decoding for CRAM was significant. This fix takes a totally different approach, which not only fixes the regression but passes it. CRAM can do CRAM_OPT_RANGE queries, used by the single version of the iterator, which informs the decode of both start and end range positions. When threading this means we don't preemptively decode the next container unless it'll actually be used. This same logic is now used in the multi-region iterator, although it's complex. The general strategy is as follows: - Ensure we know the next container start from index. This needs a small tweak to cram_index struct as the next container isn't quite the same as this container + slice offset + slice size (sadly). I think it's missing the size of the container struct itself. - When being given a start..end offset, step into the reg_list to find the corresponding chr:start-end range. - Produce a new CRAM_OPT_RANGE_NOSEEK option that does the same job as the old CRAM_OPT_RANGE opt but without the seek. This is necessary hts.c does the seek for us and the pseek/ptell only work with each other and not within the cram subdir code itself). - Identify neighbouring file offsets and merge together their correponding ranges so a block of adjacent offsets becomes a single CRAM_OPT_RANGE query. We cache the end offset (.v) in iter->end so we can avoid duplicating the seek / range request in subsequent intervals. - Manage the EOF vs EOR (end of range) return values. For EOR we have do the incrementing ourselves as we need to restart the loop without triggering the end-of-file or end-of-multi-iterator logic below it. - Tweak the region sorting a bit so ties are resolved by the max value. We want the index into reg_list to also be sorted, so we can accurately step through them. Note this logic is CRAM specific, as the sorting wouldn't work on BAI anyway due to the R-tree. Some benchmarks on ~37,000 regions returning ~110 million seqs across NA06985.final.cram: CRAM-1.10 no threads: real 4m37.361s user 4m21.128s sys 0m7.988s CRAM-1.10 -@16: real 3m14.371s user 28m48.872s sys 4m52.442s CRAM-dev -@16: real 5m55.670s user 61m0.005s sys 11m23.147s CRAM-current -@16: real 1m55.701s user 5m54.234s sys 0m51.699s The increase in user time between unthreaded and 16 threads is modest. System time increase is considerable, but this appears to primarily be down to glibc malloc inefficiencies. Using libtcmalloc.so instead gives: 1M CRAM-current -@16: real 1m30.882s user 6m9.103s sys 0m4.960s We're still nowhere near using 16 threads, but this is because the average size of each region is significantly smaller than 16 containers worth. commit 9de45b72b3a3f3df7a0ffda3554aeab1f6da50ba Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue May 12 17:28:21 2020 +0100 Further speed up of VCF parser (formats). The if elseif checks are now a switch statement and juggled in order a bit. Also the fmt[j].max_m type code is now f->max_m with f incremented along with j. The impact on gcc builds is minor (maybe 1%), but for clang this was 8-9% speed improvement on a 1000 genome multi-sample VCF. Example timings for clang: Previous commit 23333.77 msec task-clock # 1.000 CPUs utilized 83667512727 cycles # 3.586 GHz 199145555089 instructions # 2.38 insn per cycle 43099743981 branches # 1847.097 M/sec 665687093 branch-misses # 1.54% of all branches This commit 75967289857 cycles # 3.585 GHz 195076309580 instructions # 2.57 insn per cycle 43265084488 branches # 2041.736 M/sec 640008186 branch-misses # 1.48% of all branches commit b1acab6219c12c3a6cc127b37e0b03bcf4b90c3f Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu Apr 23 13:23:03 2020 +0100 Implement and use the `hts_str2dbl` method (by @jkbonfield) For faster conversion of strings to floating point values. It speeds up the conversion of values which match the regexp [+-]?[0-9]*[.]?[0-9]* and have at least one and no more than 15 digits in total. Values that don't match (for example they include an exponent) are handled by passing to strtod(). Co-Authored-By: James Bonfield <jkb@sanger.ac.uk> Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk> commit 6d7da635eaa744f8140e10be90995864da381c58 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Tue Apr 21 11:42:23 2020 +0100 Speed up GT parsing * Parse numeric GTs as unsigned integers with at most 30 bits. (no more then 30 bits are needed due to the maximum permitted value of the field). * Pull error checking out of the innermost loop. commit 9c4c64d86a1711a1f67add84a42e8d7a823d33c5 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Wed Feb 19 14:27:37 2020 +0000 Extract functionality from `vcf_parse` into individual parsing methods. commit 4a783b84276997187c3a168d161429a23a2e1548 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu Feb 6 12:58:11 2020 +0000 Use hts_str2int, instead of strtol, in the VCF parser. commit 08488f7a7910d8fdedf9c0d5ba7a882cdcabedc0 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu May 21 00:18:17 2020 +0100 Use including-file-relative paths to top-level private headers Similarly to the previous commits, includes of these headers from files in subdirectories can be subverted if $CPATH/etc or $(CFLAGS) activates a directory containing them. This is unlikely, but the particularly misguided might perhaps add a previous HTSlib source directory to $C_INCLUDE_PATH. To avoid that scenario, and for consistency with all the other intra-HTSlib #includes, add ../ to #includes of top-level headers in subdirectories. With this, the only remaining #includes in HTSlib requiring `-I.` (as is hard-coded in Makefile) are each *.c file's <config.h> inclusion. (Using <> for that is as recommended by the autoconf documentation.) commit 1ce80dd6e006b6554d1387d013bbfba942eaf824 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jan 7 16:06:15 2020 +0000 Use including-file-relative paths to cram/*.h headers Similarly to the previous commit, includes of cram/*.h from files in subdirectories can be subverted if $CPATH/etc or $(CFLAGS) activates a directory containing cram/*.h files. This is less likely than the previous case of previously-installed public htslib/*.h headers, but note that e.g. Debian for a time shipped cram/*.h headers, and the particularly misguided might perhaps add a previous HTSlib source directory to $C_INCLUDE_PATH. Avoid this by removing cram/ from cram/*.h #includes in cram/*.[ch] (and adding ../ elsewhere) and converting any instances of <> to "". commit e569d806ea84b70793d08f290b85e294257d7814 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jan 7 15:32:47 2020 +0000 Use including-file-relative paths to public htslib/*.h headers The full preprocessor search path for #include "..." is typically: 1) directory of the file containing the #include directive 2) environment variables $CPATH, $C_INCLUDE_PATH, etc 3) command line options -I etc 4) system include directories When a top-level *.[ch] file #includes "htslib/foo.h", the nearby header file within the source tree is found due to (1). When a file in a subdirectory (htslib/*.h, cram/*.[ch], or test/*.c) does the same #include, the nearby header file is intended to be found due to (3) via the hard-coded `-I.` in the Makefile. (`.` here being the compiler process's $PWD, i.e., the top-level HTSlib source directory.) However the latter can be subverted if there is an "htslib" directory containing foo.h in $CPATH, $C_INCLUDE_PATH, or on the command line prior to the hard-coded `-I.`. This will be the case if users have made a previous HTSlib installation accessible via $CPATH/etc or by adding a -I option to $(CFLAGS) (rather than $(CPPFLAGS)). Avoid this by adding ../ to #includes in subdirectories (and always using "" rather than <>), so that all includes of htslib/*.h within HTSlib are found due to (1). Fixes #347. commit e3cdb1df911116978c40c7fbfb3fd69ad20c3c1e Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Wed May 20 07:36:36 2020 +0100 Use the new hts_strprint() instead of dump_char() commit 3a73df587add7a2bc7600a57a3ab06ff629475f5 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Jan 21 14:07:05 2020 +0000 Improve SAM parser unrecognised RNAME warnings Fix tid vs mtid typo in RNEXT parsing. Fix "urecognized" typo. Use __VA_ARGS__ to simplify the _parse_err macros. Use hts_strprint() to add the unrecognised reference names to the warning messages. commit 9d5c6533edbdf8eba7f4bc621695a802099ffa37 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue May 19 22:36:16 2020 +0100 Add new hts_strprint() utility function Prints user-supplied text, optionally with quote marks, truncated to a sensible length and with unprintable characters escaped. Thus it is usable on potentially malicious input data. commit e05d8dcd435876c6ec7635fd3819ca7915e302f3 Author: James Bonfield <jkb@sanger.ac.uk> Date: Mon May 18 15:27:16 2020 +0100 Fix test data to have "=" for RNEXT when matching RNAME. commit c0e65a1ec4db767808ffcde1b44305051aea11c2 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Fri May 15 13:41:41 2020 +0100 Mention TBI chromosome length limit in tabix man page commit 382867a850b74e7285166a67ee3243560cd974ac Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Fri May 15 17:19:35 2020 +0100 Fix reference length check. (PR #1068) Credit to OSS-Fuzz Fixes oss-fuzz 22231 commit fc431ebc252166f47f798e20954c5ccf8be83a8c Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu May 14 09:28:17 2020 +0100 Take even more care in converting uintN_t to intN_t This code converts positive and negative values from uintN_t to intN_t separately, but was being undone by having different types on the two sides of the colon, and hence their common type being uintN_t -- thus requiring an implicit conversion to intN_t after all! (int8_t and int16_t are smaller than int so this matters even less for these types, but it's good to follow the pattern anyway.) Uncovered via -Wsign-compare, and grepping the code for /:.*(int/ etc. Followup to dc3303b6da542f74c965b3f3f50a58776f87104b. commit ea16e9d736911f757d830b7e718a7281758123ed Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Wed May 13 22:58:19 2020 +0100 Always check for malformed end < beg ranges Check this too in the case that the "change of chromosome" if-statement above was also executed. commit 315d08dca68bdb7fcbd7386108307a4ab48d2246 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Wed May 13 22:43:25 2020 +0100 Prevent -Wsign-compare warnings in public htslib/*.h headers (Also fix nearby htslib/hfile.h whitespace.) commit b6d48aaa3809767417422595277eac8fbfe56d34 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Tue May 12 12:44:46 2020 +0100 Tolerate PG lines with invalid PP entries, already present in the file, but log a warning. commit cdf5d18768ba29975b70de3194e1fd9ab0c5f062 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Thu May 7 12:57:15 2020 +0100 Fix a test case. commit e49b6665ec0d1383a5574459959c7a7a925daeed Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Wed May 6 16:30:04 2020 +0100 Compute the PG chain lengths and don't count leaves as starting points. If there are only leaves (no chain longer than 1), choose the last entry. Redo the PG links before adding a new PG line. commit cb8a05138e3904f7993d71026d3ee463e0cc80e9 Author: Petr Danecek <pd3@sanger.ac.uk> Date: Wed Apr 22 11:11:58 2020 +0100 Print position of offending VCF/BCF records to help with debugging Adds bcf_seqname_safe which returns "(unknown)" when the sequence name cannot be determined. commit 8859b092f23ece485391a251be616feb2422ad70 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Feb 18 11:13:43 2020 +0000 Remove spurious system headers from public HTSlib headers Reduce unnecessary inclusion of in particular <assert.h> and <stdio.h>. commit 0344174128c7a234c6a2d2a23e2ae9483ee4b434 Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Thu Mar 19 11:11:05 2020 +0000 Use ks_expand() instead of kroundup32+realloc [minor] Use newer kstring utility function to avoid explicit low-level munging. Report any memory allocation error in bgzf_getline(); ks_getuntil2() has no error handling ability, so ignores ks_expand()'s return value. commit 48c4e9a4e6b7ded874630f4ff84f458c77402266 Author: daviesrob <rmd+git@sanger.ac.uk> Date: Wed Apr 29 12:35:04 2020 +0100 Add tabix --cache option; make bgzf cache work better with hfile_libcurl (#1053) * Add tabix bgzf block caching. * Preserve buffer contents for deferred seek in hfile_libcurl. Commit ad2e24f added delayed seeks to hfile_libcurl so that calling hseek() lots of times without an hread() in between would not cause unnecessary web requests (the bgzf block cache tends to do this). While it mostly worked, a small amount of data that has been read but not consumed yet was being dropped by hseek(). Where this data was actually needed after all the hseek()ing, it would cause hfile_libcurl() to have to rewind slightly resulting in a new web request. Again, this can frequently happen when using the bgzf block cache. To prevent this, make hfile_libcurl take a copy of the data that would have been lost. If a later read with a deferred seek hits the preserved region, it can be returned from the cached data without having to start a new web request. Once this cached data has been used up, the original web connection can carry on reading from where it left off. * Fix libcurl_read() when not all preserved data is used up fp->preserved_bytes should not be altered. Assertion at line 780 needs to be changed as it's no longer true that fp->base.begin and fp->base.end will always point to the start of the buffer. commit 2e36fa6ea4d753fc9aac0c6259078bdc9720a930 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Tue Mar 31 15:29:30 2020 +0100 Fix memory leak on failure in cram_decode_slice() Make the error case decrement the reference count on any used refs and free the refs array. Credit to OSS-Fuzz Fixes oss-fuzz 21393 commit f58a6f3796f94f3b7ed933b24af57c13e0215282 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Tue Mar 31 12:17:03 2020 +0100 Document special END tag handling in bcf_update_info() commit 88a588a7b083c1a24051ce8b9ef16c70e78b9f65 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Mar 4 11:18:33 2020 +0000 Add tests for VCF file invalid END tag handling commit e784f64950b91cdba8fc8e6f203218957139fd91 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Feb 27 10:00:21 2020 +0000 Try to handle BCF files with negative rlen fields. This can happen where VCF files with invalid END tag values (less than the reference position) have been converted to BCF. bcf_read1_core() is changed to treat rlen as a signed value (which is what the BCF2 specification says). This means the maximum rlen now supported by HTSlib for BCF files is reduced to 2^31. If rlen is negative, bcf_record_check() will now print out a warning. It will also attempt to fix up rlen by substituting the length of the reference allele. A call to bcf_record_check() is added to bcf_readrec() so reads using iterators will be protected from bad input data. Unfortunately checks involving the header have to be disabled for this interface as it isn't available. commit 6a14289be5b349afcb99bf730bbcbb38ffa6703a Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Feb 27 09:54:42 2020 +0000 Add check for invalid VCF END tags to tabix Where END is less than or equal to the reference position, a warning is printed and the length of the REF allele (which has already been measured) will be used instead. commit d704389e051e5fdc8e69530233b10b33c653176f Author: Petr Danecek <pd3@sanger.ac.uk> Date: Thu Feb 20 09:46:19 2020 +0000 Make sure rlen is updated only from non-missing END values. commit db079d5be536d170d9868127857b66834de3caa3 Author: Petr Danecek <pd3@sanger.ac.uk> Date: Wed Feb 12 13:50:12 2020 +0800 Prevent negative rlen on erroneous INFO/END in bcf_update_info() This is an extension of ea879e19b8 which added checks for reading VCF. Resolves /~https://github.com/samtools/bcftools/issues/1154 commit 0bff7c1cb60f3465e86e6e5c826f37a5948d8d04 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Mar 11 11:17:43 2020 +0000 Fix missing newline in error messages [minor] commit b34a1ae1a91f7878eabfd288de92a4c7d29dab2e Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Mar 11 11:06:23 2020 +0000 Make hts_itr_next() return an error on seek failure Return code needs to be less that -1, or the caller will incorrectly interpret it as end of data, causing it to silently finish too early. Also adds error logging messages. commit 90d3002401c8813c0ed64c314e0877a65b68bcd2 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Mar 11 10:25:16 2020 +0000 Make bgzf_read_block() print more error messages To make it easier to find out where BGZF reads have failed. It prints the offset of the BGZF block where the read / decompress failed, which makes it much easier to locate bad parts of files. Unfortunately it can't report the file name as it's not stored by bgzf or hfile, but hopefully one of the callers will divulge it. This introduces a small change of behaviour as block_address is now updated after an empty block has been read. This may change the location stored in indexes of files with empty blocks, however the index should be more efficient as it will now point to the location where data starts instead of the previous empty block. commit 04b71e6b10bdc55cd288a45118a9a563b67add5f Author: John Marshall <John.W.Marshall@glasgow.ac.uk> Date: Tue Mar 17 18:52:37 2020 +0000 Move the bulk of ks_resize() to a non-inlined helper function commit ba8d1a11bed8f660594ffd104149df31bb31ba0c Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Tue Mar 17 12:41:00 2020 +0000 Unify kroundup macros and fix gcc warnings Replace several kroundup32 and kroundup_size_t macro definitions with a new unified implementation. The new one will work correctly for all integer types (both signed and unsigned) up to 64 bits. An attempt to round up that would overflow results in the maximum value that can be stored in the given type. The implementation avoids conditionals involving large values in an attempt to fix some gcc warnings in samtools and bcftools. They appeared after commit 29c294e which fixed a bug where kroundup would wrap around to zero on large values. The warnings are caused by jump threading optimisations that create code paths for the large value cases, which gcc then warns about. See: https://developers.redhat.com/blog/2019/03/13/understanding-gcc-warnings-part-2/ Includes extra tests to ensure the new kroundup works as desired. commit 9a1035586382ef3272e81c2a5f9eeba876986115 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Tue Mar 17 18:07:37 2020 +0000 Prevent a double free in case of failure. (PR #1047) commit b22e03d805866c04ea02a4b04c9b58a167853418 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Fri Mar 6 17:02:31 2020 +0000 Add crypt4gh file detection Adds format detection for crypt4gh files (see https://samtools.github.io/hts-specs/crypt4gh.pdf). If hts_hopen() finds a crypt4gh file, it will try to re-open it via the hfile_crypt4gh() plug-in. Detection occurs at the same point as htsget, so it does both in a loop to allow for crypt4gh served via htsget. Actual decryption of the file requires an external plug-in, available from /~https://github.com/samtools/htslib-crypt4gh As this may not be present, a dummy handler is included that prints an error message indicating that it is needed. This is overridden by the real one when it is available. commit c6ce820d5ccfeca1c38e6644d72ca6543da01eb1 Author: Petr Danecek <pd3@sanger.ac.uk> Date: Wed Mar 11 15:30:28 2020 +0100 Acknowledge the existence of the symbolic <NON_REF> allele produced by GATK (PR #1045) commit 29c294e6842a56ba3b9a24a24a5f6de1575b0961 Author: James Bonfield <jkb@sanger.ac.uk> Date: Tue Mar 10 10:31:45 2020 +0000 Fixed a raft of integer overflows in VCF land. - Cast data into size_t before multiplication to avoid wrapping around int32. - Added checks for return values to align_mem and ks_resize - Simplified the byzantine calculation in align_mem - Fixed kroundup_size_t and kroundup32 so they cannot wrap around to zero and turn the realloc into a free. - Also added a check for ~2Gb on total length of FORMAT fields, which nullifies the need for some of the above. We may wish to remove this at some point if we want to cope with truely mammoth multi-sample data, and the above fixes means doing so will not expose bugs. However for now this check adds protection against malformed data creating excessive memory usage and CPU requirements. Credit to OSS-Fuzz Fixes oss-fuzz 21139 Fixes oss-fuzz 20881 commit 8ff8bb328f5d4c2981933ff7847374ddc747d572 Author: James Bonfield <jkb@sanger.ac.uk> Date: Mon Mar 9 11:45:22 2020 +0000 Improved CRAM error messages in cram_decode_slice (PR #1042) To help diagnose #1038 (Embedded reference issues) by reporting which ref_id and region is affected. commit 3b1ff68575c9db582a122048a7e95aa2b1a47b11 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Thu Mar 5 21:39:48 2020 +0000 Add tabix --verbosity option commit 31b23540491e114e7cbfde790f5129c26c18eef4 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Wed Mar 4 14:12:20 2020 +0000 Add more error checking to the tabix application Check for error returns from bcf_itr_next(), hts_getline() and tbx_itr_next() so failures can be reported properly. Add checks in other places that could fail, mostly memory allocations and writes to the output file. Note that bcf_itr_querys() and tbx_itr_querys() currently returns NULL for both general failures and those caused by the requested region not being present. As tabix silently skips missing regions, it is currently not possible to detect errors in these functions. commit 9979fac560319507cefe5d5682617304d609b589 Author: Valeriu Ohan <vo2@sanger.ac.uk> Date: Mon Mar 2 08:16:20 2020 -0800 Free allocated SN strings when encountering an error. (PR #1034) Upon successful completion of sam_hdr_create, ownership of the newly created SN strings is given to sam_hdr_t::target_name, which takes care of freeing them at the end. But if an error occurs, the handover might fail and the allocated strings need to be cleaned up by iterating through the hash table instead. Credit to OSS-Fuzz Fixes oss-fuzz 20888 commit 6149ea6f779939451b56e6f63b28409baead2990 Author: Rob Davies <rmd+git@sanger.ac.uk> Date: Mon Feb 24 09:41:15 2020 +0000 Make multi-region iterator better at skipping unneeded data Revert change to hts_itr_t::off in bfc9f0d so the `max` field can be used to store information about which interval goes with each entry. Don't merge hts_itr_t::off entries for different intervals. Make hts_itr_multi_next() skip offsets that will return no data as they are for intervals that have already been completed. Deal with possible failure of reg2intervals() in hts_itr_multi_bam(). commit 27b670444a372809586eabd921d1f568c8cc6c30 Author: James Bonfield <jkb@sanger.ac.uk> Date: Fri Feb 21 11:22:11 2020 +0000 Add BAI linear index usage for multi-region iterator commit 45715e421e1ce9d15d8880a24454ec7dc42bf4bc Author: James Bonfield <jkb@sanger.ac.uk> Date: Thu Feb 20 14:21:12 2020 +0000 Add use of BAI linear index to BGZF querys. Previously it used the bin "loff" field for BAI, to match the CSI data structure (which no longer has a linear index). Unfortunately when coupled with the compress_binning() function this causes BAI to seek to data earlier than is necessary when bins have been compressed. This generally only affects low coverage data set and has minimal impact on deep data as the bins are too large to remove. For CSI we have no option of using the linear index, so this does not change behaviour. Sadly this means CSI indices are less efficient to use than BAI for many small queries on shallow data sets. Some benchmarks on a shallow 1.9x coverage data set, using 1000 randomly picked regions on chr1 each between 1 bp and 10k bp in size. Indexer Branch Time(s) BAM I/O(Mb) Idx I/O(Mb) ----------------------------------------------------- htslib-BAI develop 1.488 164.1 2.43 htsjdk-BAI develop 0.502 53.6 8.27 htslib-BAI this 0.498 53.5 2.43 htsjdk-BAI this 0.530 53.3 8.27 htslib-CSI this 1.532 163.3 0.49 For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the default of 10000, demonstrating the usefulness of changing this if you need fine grained random access. CRAM-s10k this 22.12 602.1 0.11 CRAM-s1k this 6.062 92.7 0.93 CRAM-s300 this 5.115 42.1 2.98 Htslib BAI index, samtools 1.10 ------------------------------- $ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R File: 9827_2#49.bam Num. open 1 Num. close 1 Num. seeks 969 Num. read 6040 Bytes read 164144977 File: 9827_2#49.bam.bai Num. open 1 Num. close 1 Num. read 40 Bytes read 2431088 real 0m1.488s user 0m1.448s sys 0m0.036s Htslib BAI index, this commit ----------------------------- io_trace -S -r 9827_2 -- …
The following code reproduces the error:
The problem here is that the 1000 Genomes project team made a mess and forgot to liftOver the END coordinates of the VCF. I am not sure whether this counts as a bcftools bug. But the VCF specification does not really call for the END field to always be bigger than the POS field, which is what is causing the problem here. I do find odd that tabix and previous versions of bcftools worked fine with these odd cases though.
If this is considered appropriate bcftools behavior, then I think that to a minimum bcftools norm should also check than END is always larger than POS.
The text was updated successfully, but these errors were encountered: