This script counts the relative abundance of Escherichia coli phylogroups in metagenomic data by analyzing presence of E.coli k-mers in fastq files. It processes E. coli phylogroup-specific k-mers stored ecoli_unique_kmers directory and outputs the results in a CSV format.
- Phylogroup detection: Identifies and counts specific k-mers associated with various E. coli phylogroups.
- Flexible input: Supports multiple fastq files for processing.
- Log tracking: Provides detailed execution logs for easy troubleshooting.
- Normalization: Outputs normalized relative abundance percentages for each phylogroup.
- Python 3.11
- Biopython library
- Pandas library
- Clone the repository:
git clone /~https://github.com/marsfro/ecoli_counter.git
-
Install the required Python libraries if you haven't already:
pip install biopython pandas
python kmers_ecoli_counter.py -i <input_fastq_directory> -o <output_csv_file> -k <kmers_directory> -l <log_file_path>
-i, --input: Path to the directory containing the metagenome fastq files.
-o, --output: Path to the output CSV file where the results will be saved.
-k, --kmers: Path to the directory containing the E. coli phylogroup-specific k-mer files.
-l, --log: Path to the log file to track script execution (default is "log_file.txt").
python kmers_ecoli_counter.py -i ./fastq -o output.csv -k ./ecoli_unique_kmers
The script generates a CSV file with the following columns:
- Phylogroups: Names of the E. coli phylogroups.
- number_kmers: The number of k-mers identified for each phylogroup.
- The number of k-mers found for each phylogroup.
- The percentage of reads with k-mers relative to the total number of reads and k-mers for each phylogroup.
The percentage of reads with k-mers in a phylogroup is calculated using the formula:
Where:
-
$N_{\text{reads}}$ = Number of reads containing at least one k-mer of the specified phylogroup. -
$500,000$ = The average number of k-mers across all E. coli phylogroups. -
$N_{\text{kmers}}$ = Number of k-mers in this phylogroup. -
$N_{\text{metagenome}}$ = Total number of reads in the metagenome.
This formula calculates the percentage of k-mers relative to the total number of k-mers for each phylogroup. It determines how many reads contain at least one k-mer of a certain phylogroup.
If you use this script in your research, please cite the following article:
- Title of the article