Skip to content

Georgakopoulos-Soares-lab/MAFcounter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAF Counter

MAF Counter is a multithreaded tool designed to efficiently extract and count k-mers from multiple genome alignments in MAF format. MAF Counter is a suite of programs comprised of maf_counter_count , maf_counter_dump , maf_proteomes_count, maf_proteomes_dump and maf_counter_tools

  • maf_counter_count: Performs k-mer counting on MAF files to produce a binary database file that holds all k-mers.
  • maf_counter_dump: Used to convert the binary file produced by maf_counter_count to text either in a single or multiple files.
  • maf_proteomes_count: The equivalent of maf_counter_count for protein alignments.
  • maf_proteomes_dump: The equivalent of maf_counter_dump for protein alignments.
  • maf_counter_tools: Used for querying , filtering and performing statistics on the binary database file.

Build binaries

git clone --recursive /~https://github.com/Georgakopoulos-Soares-lab/MAF-Counter 
cd MAF-Counter
mkdir build && cd build
cmake ..
make -j$(nproc)

maf_counter_count Usage

./maf_counter_count 
Usage:
  ./maf_counter_count [ -c ]
        [--purge_intermediate[=true|false]]
        [--binary_file_output <filename>]
        [--genome_ids=all|name1,name2,...]
        [--min_a_score=<val>] [--max_a_score=<val>]
        [--min_q_level=0..9|F] [--max_q_level=0..9|F]
        --k <VAL>
        [--reader_threads <VAL>] [--package_manager_threads <VAL>] | [--threads <VAL>]
        [--temp_files_dir <DIR>] [--output_directory <DIR>]
        <MAFfile>

Notes:
  --k is required. Threads can be specified either by:
       --reader_threads and --package_manager_threads (both)
       OR
       --threads (which is then split ~2:1 for readers:PMs).
  --temp_files_dir is where intermediate bin files go (default: current dir).
  --output_directory is where final.bin and final.metadata go (default: current dir)

maf_counter_count Examples

Basic Example

./maf_counter_count --k 20 \
    --threads 12 \
    --binary_file_output final.bin \
    --temp_files_dir ./tmp
    --output_directory ./analysis_20mers \
    ./hprc.maf

Advanced Example

./maf_counter_count --k 20 \
    --threads 12 \
    --binary_file_output final.bin \
    --temp_files_dir ./tmp
     --min_a_score=10000 \
     --max_a_score=20000 \
    --min_q_level=7 \
    --max_q_level=F \
    --genome_ids=CHM13 \
    --output_directory ./analysis_20mers \
    ./hprc.maf

maf_counter_dump Usage

./maf_counter_dump --help
Usage:
  ./maf_counter_dump [options] <final.bin>

Options:
  --output_mode=<single|multiple>
        Output mode: single = single-file output using mmap (default),
                     multiple = per-genome output.
  --threads <VAL>           Number of threads to use (default: 1).
  --output_file <filename>  Output file name for single-file mode (default: final_sorted_<k>_dump.txt).
  --output_directory <DIR>  Directory for per-genome output files (default: current directory).
  --temp_files_dir <DIR>    Directory for intermediate files (default: current directory).

  <final.bin>              Input binary file generated by maf_counter_count.

maf_counter_dump Examples

Multiple Outputs Example

./maf_counter_dump \
    --output_mode=multiple \
    --threads 10 \
    --output_directory /home/user/genome_dumps \
    --temp_files_dir ./tmp \
    ./analysis_20mers/final.bin

Single Output Example

./maf_counter_dump \
    --output_mode=single \
    --threads 10 \
    --temp_files_dir ./tmp \
    ./analysis_20mers/final.bin

maf_counter_proteomes_count Usage

./maf_counter_proteomes_count
Usage:
  ./maf_counter_proteomes_count --k <kmer_size> [--threads N]
       [--binary_file_output <file>] <MAFFile>

maf_counter_proteomes_count Examples

./maf_counter_proteomes_count --k 5 --threads 4 ./hprc.maf

maf_counter_proteomes_dump Usage

./maf_counter_proteomes_dump --help
Usage:
  ./maf_counter_dump [options] <final.bin>

Options:
  --output_mode=<single|multiple>
        Output mode: single = single-file output using mmap (default),
                     multiple = per-genome output using mmap (updated version).
  --threads <VAL>           Number of threads to use (default: 1).
  --output_file <filename>  Output file name for single-file mode (default: final_sorted_<k>_dump.txt).
  --output_directory <DIR>  Directory for per-genome output files (default: current directory).
  --temp_files_dir <DIR>    Directory for intermediate files (default: current directory).

  <final.bin>              Input binary file generated by maf_counter_count.

maf_counter_proteomes_dump Examples

Multiple Outputs Example

./maf_counter_proteomes_dump \
    --output_mode=multiple \
    --threads 10 \
    --output_directory ./protein_10mers \
    --temp_files_dir ./tmp \
    ./final.bin

Single Output Example

    --output_mode=single \
    --threads 10 \
    --temp_files_dir ./tmp \
    ./final.bin

maf_counter_tools Usage

Usage:
  ./maf_counter_tools --std <topCount> --threads <N> --metadata_file <META> --binary_database <BIN>
  ./maf_counter_tools --expr <EXPR> --threads <N> --metadata_file <META> --binary_database <BIN>
  ./maf_counter_tools --gstats --threads <N> --metadata_file <META> --binary_database <BIN>
  ./maf_counter_tools --query <kmerListOr@file> --threads <N> --metadata_file <META> --binary_database <BIN>
  ./maf_counter_tools --query_regex <REGEX> --threads <N> --metadata_file <META> --binary_database <BIN>

Examples:
  ./maf_counter_tools --std 20 --threads 4 --metadata_file final.metadata --binary_database final.bin
  ./maf_counter_tools --expr "CHM13>100 && GRCh38>20" --threads 6 --metadata_file final.metadata --binary_database final.bin
  ./maf_counter_tools --gstats --threads 4 --metadata_file final.metadata --binary_database final.bin
  ./maf_counter_tools --query "ACGTACGT,AAAAAAAC" --threads 2 --metadata_file final.metadata --binary_database final.bin
  ./maf_counter_tools --query_regex "([gG]{3,}\w{1,7}){3,}[gG]{3,}" --threads 2 --metadata_file final.metadata --binary_database final.bin

maf_counter_tools Examples

Find the top 50 k-mers in terms of their standard deviation among differeng genome IDs using 8 threads

./maf_counter_tools --std 50 \
   --threads 8 \
   --metadata_file ./final.metadata \
   --binary_database ./final.bin

Filter k-mers based on genomeId counts

./maf_counter_tools --expr "HG002>50 || (HG003<10 && CHM13>30)" \
    --threads 8 \
    --metadata_file ./final.metadata \
    --binary_database ./final.bin

Compute genomeId statistics min max

./maf_counter_tools --gstats \
    --threads 8 \
    --metadata_file ./final.metadata \
    --binary_database ./final.bin

Query k-mers using a comma separated list of k-mers

./maf_counter_tools --query "AAAAA,TTTTT" \
    --threads 8 \
    --metadata_file ./final.metadata \
    --binary_database ./final.bin

Query k-mers using a kmer file that has a k-mer per line

./maf_counter_tools --query "@./kmers.txt" \
    --threads 8 \
    --metadata_file ./final.metadata \
    --binary_database ./final.bin

Query k-mers using a regular expression ( Equivalent regex for AAAAT )

./maf_counter_tools --query_regex ""^A{4}T$"" \
    --threads 8 \
    --metadata_file ./final.metadata \
    --binary_database ./final.bin

License

This project is licensed under the GNU GPL v3.

Contact

For any questions or support, please contact