MAF Counter is a multithreaded tool designed to efficiently extract and count k-mers from multiple genome alignments in MAF format. MAF Counter is a suite of programs comprised of maf_counter_count , maf_counter_dump , maf_proteomes_count, maf_proteomes_dump and maf_counter_tools
- maf_counter_count: Performs k-mer counting on MAF files to produce a binary database file that holds all k-mers.
- maf_counter_dump: Used to convert the binary file produced by maf_counter_count to text either in a single or multiple files.
- maf_proteomes_count: The equivalent of maf_counter_count for protein alignments.
- maf_proteomes_dump: The equivalent of maf_counter_dump for protein alignments.
- maf_counter_tools: Used for querying , filtering and performing statistics on the binary database file.
git clone --recursive /~https://github.com/Georgakopoulos-Soares-lab/MAF-Counter
cd MAF-Counter
mkdir build && cd build
cmake ..
make -j$(nproc)
./maf_counter_count
Usage:
./maf_counter_count [ -c ]
[--purge_intermediate[=true|false]]
[--binary_file_output <filename>]
[--genome_ids=all|name1,name2,...]
[--min_a_score=<val>] [--max_a_score=<val>]
[--min_q_level=0..9|F] [--max_q_level=0..9|F]
--k <VAL>
[--reader_threads <VAL>] [--package_manager_threads <VAL>] | [--threads <VAL>]
[--temp_files_dir <DIR>] [--output_directory <DIR>]
<MAFfile>
Notes:
--k is required. Threads can be specified either by:
--reader_threads and --package_manager_threads (both)
OR
--threads (which is then split ~2:1 for readers:PMs).
--temp_files_dir is where intermediate bin files go (default: current dir).
--output_directory is where final.bin and final.metadata go (default: current dir)
./maf_counter_count --k 20 \
--threads 12 \
--binary_file_output final.bin \
--temp_files_dir ./tmp
--output_directory ./analysis_20mers \
./hprc.maf
./maf_counter_count --k 20 \
--threads 12 \
--binary_file_output final.bin \
--temp_files_dir ./tmp
--min_a_score=10000 \
--max_a_score=20000 \
--min_q_level=7 \
--max_q_level=F \
--genome_ids=CHM13 \
--output_directory ./analysis_20mers \
./hprc.maf
./maf_counter_dump --help
Usage:
./maf_counter_dump [options] <final.bin>
Options:
--output_mode=<single|multiple>
Output mode: single = single-file output using mmap (default),
multiple = per-genome output.
--threads <VAL> Number of threads to use (default: 1).
--output_file <filename> Output file name for single-file mode (default: final_sorted_<k>_dump.txt).
--output_directory <DIR> Directory for per-genome output files (default: current directory).
--temp_files_dir <DIR> Directory for intermediate files (default: current directory).
<final.bin> Input binary file generated by maf_counter_count.
./maf_counter_dump \
--output_mode=multiple \
--threads 10 \
--output_directory /home/user/genome_dumps \
--temp_files_dir ./tmp \
./analysis_20mers/final.bin
./maf_counter_dump \
--output_mode=single \
--threads 10 \
--temp_files_dir ./tmp \
./analysis_20mers/final.bin
./maf_counter_proteomes_count
Usage:
./maf_counter_proteomes_count --k <kmer_size> [--threads N]
[--binary_file_output <file>] <MAFFile>
./maf_counter_proteomes_count --k 5 --threads 4 ./hprc.maf
./maf_counter_proteomes_dump --help
Usage:
./maf_counter_dump [options] <final.bin>
Options:
--output_mode=<single|multiple>
Output mode: single = single-file output using mmap (default),
multiple = per-genome output using mmap (updated version).
--threads <VAL> Number of threads to use (default: 1).
--output_file <filename> Output file name for single-file mode (default: final_sorted_<k>_dump.txt).
--output_directory <DIR> Directory for per-genome output files (default: current directory).
--temp_files_dir <DIR> Directory for intermediate files (default: current directory).
<final.bin> Input binary file generated by maf_counter_count.
./maf_counter_proteomes_dump \
--output_mode=multiple \
--threads 10 \
--output_directory ./protein_10mers \
--temp_files_dir ./tmp \
./final.bin
--output_mode=single \
--threads 10 \
--temp_files_dir ./tmp \
./final.bin
Usage:
./maf_counter_tools --std <topCount> --threads <N> --metadata_file <META> --binary_database <BIN>
./maf_counter_tools --expr <EXPR> --threads <N> --metadata_file <META> --binary_database <BIN>
./maf_counter_tools --gstats --threads <N> --metadata_file <META> --binary_database <BIN>
./maf_counter_tools --query <kmerListOr@file> --threads <N> --metadata_file <META> --binary_database <BIN>
./maf_counter_tools --query_regex <REGEX> --threads <N> --metadata_file <META> --binary_database <BIN>
Examples:
./maf_counter_tools --std 20 --threads 4 --metadata_file final.metadata --binary_database final.bin
./maf_counter_tools --expr "CHM13>100 && GRCh38>20" --threads 6 --metadata_file final.metadata --binary_database final.bin
./maf_counter_tools --gstats --threads 4 --metadata_file final.metadata --binary_database final.bin
./maf_counter_tools --query "ACGTACGT,AAAAAAAC" --threads 2 --metadata_file final.metadata --binary_database final.bin
./maf_counter_tools --query_regex "([gG]{3,}\w{1,7}){3,}[gG]{3,}" --threads 2 --metadata_file final.metadata --binary_database final.bin
Find the top 50 k-mers in terms of their standard deviation among differeng genome IDs using 8 threads
./maf_counter_tools --std 50 \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
./maf_counter_tools --expr "HG002>50 || (HG003<10 && CHM13>30)" \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
./maf_counter_tools --gstats \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
./maf_counter_tools --query "AAAAA,TTTTT" \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
./maf_counter_tools --query "@./kmers.txt" \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
./maf_counter_tools --query_regex ""^A{4}T$"" \
--threads 8 \
--metadata_file ./final.metadata \
--binary_database ./final.bin
This project is licensed under the GNU GPL v3.
For any questions or support, please contact