Lemane, Téo, Rayan Chikhi, and Pierre Peterlongo. "kmdiff, large-scale and user-friendly differential k-mer analyses." Bioinformatics 38.24 (2022): 5443-5445
kmdiff provides differential k-mers analysis between two populations (control and case). Each population is represented by a set of short-read sequencing. Outputs are differentially represented k-mers between controls and cases.
Statistical methods used in kmdiff are from:
- Rahman, Atif, Ingileif Hallgrímsdóttir, Michael Eisen, and Lior Pachter. "Association mapping from sequencing reads using k-mers." Elife 7 (2018): e32920. https://doi.org/10.7554/eLife.32920.001
- Mehrab Z, Mobin J, Tahmid IA, Rahman A (2021) Efficient association mapping from k-mers—An application in finding sex-specific sequences. PLOS ONE 16(1): e0245058. https://doi.org/10.1371/journal.pone.0245058
- Patterson N, Price AL, Reich D (2006) Population Structure and Eigenanalysis. PLOS Genetics 2(12): e190. https://doi.org/10.1371/journal.pgen.0020190
- Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909 (2006). https://doi.org/10.1038/ng1847
Population stratification correction needs:
Ubuntu / Debian
sudo apt-get install libgsl-dev libopenblas-dev liblapacke-dev libbz2-dev zlib1g-dev zlib1g
Fedora
sudo dnf install openblas openblas-devel lapack lapack-devel gsl gsl-devel bzip2-devel
Arch
sudo pacman -S lapack lapacke openblas gsl bzip2 zlib
macOS
brew install gsl lapack openblas bzip2 zlib
For convenience, all kmdiff other build dependencies are included in thirdparty directory.
conda create -p /kmdiff-env
conda activate ./kmdiff-env
conda install -c bioconda -c tlemane kmdiff
- cmake >= 3.13
- gcc >= 8.1 or XCode >= 11.0 or clang >= 7
- zlib
- bzip2
- GSL + Lapacke + OpenBLAS (only with
-p 1
, see build script).
git clone --recursive /~https://github.com/tlemane/kmdiff
kmdiff build script.
Usage:
./install.sh [-r str] [-k LIST[int]] [-t int] [-c int] [-j int] [-s int] [-p] [-e] [-d] [-h]
Options:
-r <Release|Debug> -> build type {Release}.
-k <LIST[INT]> -> k-mer size {"32 64 96 128"}.
-t <0|1|2> -> tests: 0 = disabled, 1 = compile, 2 = compile and run {2}.
-c <1|2|4> -> byte per count {4}.
-j <INT> -> nb threads {8}.
-s <0|1> -> population stratification correction 0 = disabled, 1 = enabled {1}
(-s 1 requires GSL + lapacke + OpenBLAS)
-p -> compile with plugins support {disabled}
-e -> use conda to install compilers/dependencies {disabled}
-d -> delete cmake cache {disabled}
-h -> show help.
If you are unable to install the prerequisites on your system, use -e
. Compilers and build dependencies will thus be provided by a conda environment.
If you have already built a matrix with kmtricks, you can skip the first step kmdiff count
. Note that the --hist
parameter is required when constructing the matrix in order to be compatible with kmdiff diff
.
Input file: one sample per line (controls must appear first)
control1: /path/to/control1_read1.fastq ; /path/to/control1_read2.fastq
control2: /path/to/control2_read1.fastq ; /path/to/control2_read2.fastq
case1: /path/to/case1_read1.fastq ; /path/to/case1_read2.fastq
case2: /path/to/case2_read1.fastq ; /path/to/case2_read2.fastq
Supported files: fasta/fastq, gzipped or not.
Options
kmdiff count v1.1.0
DESCRIPTION
Count k-mers with kmtricks.
USAGE
kmdiff count -f/--file <FILE> -d/--run-dir <DIR> [-k/--kmer-size <INT>] [-c/--hard-min <INT>]
[-r/--recurrence-min <INT>] [--minimizer-type <INT>]
[--minimizer-size <INT>] [--repartition-type <INT>] [--nb-partitions <INT>]
[-t/--threads <INT>] [-v/--verbose <STR>] [-h/--help] [--version]
OPTIONS
[global]
-f --file - fof that contains path of read files
-d --run-dir - output directory.
-k --kmer-size - size of k-mers [8, 127] {31}
-c --hard-min - min abundance to keep a k-mer {1}
-r --recurrence-min - min recurrence to keep a k-mer {1}
[advanced performance tweaks]
--minimizer-type - minimizer type (0=lexi, 1=freq) {0}
--minimizer-size - size of minimizer {10}
--repartition-type - minimizer repartition (0=unordered, 1=ordered) {0}
--nb-partitions - number of partitions (0=auto) {0}
[common]
-t --threads - number of threads. {8}
-h --help - show this message and exit. [⚑]
--version - show version and exit. [⚑]
-v --verbose - Verbosity level [debug|info|warning|error]. {info}
kmdiff diff v1.1.0
DESCRIPTION
Differential k-mers analysis.
USAGE
kmdiff diff -d/--km-run <DIR> -1/--nb-controls <INT> -2/--nb-cases <INT> [-o/--output-dir <DIR>]
[-s/--significance <FLOAT>] [-u/--cutoff <INT>] [-c/--correction <STR>]
[--gender <FILE>] [--kmer-pca <FLOAT>] [--ploidy <INT>] [--n-pc <INT>]
[-t/--threads <INT>] [-v/--verbose <STR>] [-f/--kff-output] [-m/--in-memory]
[--keep-tmp] [--pop-correction] [-h/--help] [--version]
OPTIONS
[global]
-d --km-run - kmtricks run directory.
-o --output-dir - output directory. {./kmdiff_output}
-1 --nb-controls - number of controls.
-2 --nb-cases - number of cases.
-s --significance - significance threshold. {0.05}
-u --cutoff - Divide the significance threshold by N.
Since a large number of k-mers are tested, k-mers with p-values too close to the significance
threshold will not pass the last steps of correction.
It allows to discard some k-mers a bit earlier and thus save space and time. {100000}
-c --correction - significance correction. (bonferroni|benjamini|sidak|holm|disabled) {bonferroni}
-f --kff-output - output significant k-mers in kff format. [⚑]
-m --in-memory - in-memory correction. [⚑]
--keep-tmp - keep tmp files. [⚑]
--save-sk - build the matrix of significant k-mers. [⚑]
[population stratification]
--pop-correction - apply correction for population stratification. [⚑]
--gender - gender file, one sample per line with the id and the gender (M,F,U), space-separated.
--kmer-pca - proportion of k-mers used for PCA (in [0.0, 0.05]). {0.001}
--ploidy - ploidy level. {2}
--n-pc - number of principal components (in [2, 10]). {2}
[common]
-t --threads - number of threads. {8}
-h --help - show this message and exit. [⚑]
--version - show version and exit. [⚑]
-v --verbose - Verbosity level [debug|info|warning|error]. {info}
Outputs
- control significant k-mers:
<output_dir>/control_kmers.[fasta|kff]
- case significant k-mers:
<output_dir>/case_kmers.[fasta|kff]
--save-sk
: Outputs a matrix with the significant k-mers before correction. You can dump it in text with kmtricks aggregate --run-dir <output-dir>/positive_kmer_matrix --matrix kmer --cpr-in
.
Abundances and p-values are provided in fasta headers.
An example on a small dataset is available here.
If you encounter a problem, please open an issue with the return of kmdiff infos
, as well as the content of kmdiff-backtrace.log
if it exists.
Téo Lemane: teo[dot]lemane[at]inria[dot]fr Rayan Chikhi: rayan[dot]chikhi[at]pasteur[dot]fr Pierre Peterlongo: pierre[dot]peterlongo[at]inria[dot]fr