Skip to content

Query tab separated files

Hassan Foroughi edited this page Aug 14, 2018 · 1 revision

VCF files generally require some sort of specific tools to present the output. Although there are numerous ways to query them (e.g. bcftools, vt, and gatk), they don't provide a proper human readable format. Sometimes, merging and/or intersecting multiple VCF files can be a tricky task. A possible solution is to generate a tab or comma separated file to make it easier for downstream analysis (load the whole table in R, Python, or any other plotting tool). One of the major drawbacks of this approach is the loss of information. Another concern is that there are already Python (cyvcf) or R (vcfR) to load and query VCF files. Since they need an extra layer of nonstandard libraries and the effort to tidy the data is almost the same as preparing a tabulated plaintext file, they don't offer an increased productivity.

Now to tabQuery! It's included in the kentutils and it is available from UCSC for linux and mac os x : http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/tabQuery . It can query tab separated files with SQL like statements (without join and other complex commands).

The manual and example are as following:

tabQuery - Run sql-like query on a tab separated file.
usage:
   tabQuery rqlStatement
where rqlStatement is much like a SQL statement, but with no joins and no commands
other than select allowed.  The input file name is taken from the 'from' clause.
examples
    tabQuery select file,date from manifest.tsv
This will output the file and date fields from the manifest.tsv file
    tabQuery select file,date,lab from manifest.tsv where lab like 'myLab%'
This will output the selected three fields from the file where the lab starts with myLab
    tabQuery select file,data from manifest.tsv where lab='myLab'
This will output the selected two fields where the lab field is exactly myLab.
    tabQuery select * from manifest.tsv where lab='myLab'
This will output all fields where the lab field is exactly myLab.
    tabQuery select a*,b* from manifest.tsv where lab='myLab'
This will output all fields starting with a or b where the lab field is exactly myLab.
    tabQuery select count(*) from manifest.tsv where type='fastq' and size < 1000
This will count the number of records where type is fastq and size less than 1000