Welcome to the DeepChopper tutorial! This guide will walk you through the process of identifying and removing chimeric artificial reads in Nanopore direct-RNA sequencing data. Whether you're new to bioinformatics or an experienced researcher, this tutorial will help you get the most out of DeepChopper.
Before we begin, ensure you have the following installed:
- DeepChopper (latest version)
- Dorado (Oxford Nanopore's basecaller)
- Samtools (for BAM to FASTQ conversion)
- Sufficient storage space for Nanopore data
Start by obtaining your Nanopore direct-RNA sequencing data (POD5 files).
# Example: Download sample data (replace with your actual data source)
wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/200cases.pod5
💡 Tip: Organize your data in a dedicated project folder for easy management.
Convert raw signal data to nucleotide sequences using Dorado.
# Install Dorado (if not already installed)
# Run Dorado without trimming
dorado basecaller --no-trim rna002_70bps_hac@v3 200cases.pod5 > raw_no_trim.bam
# Convert BAM to FASTQ
samtools view raw_no_trim.bam -d dx:0 | samtools fastq > raw_no_trim.fastq
--not_trim
option to preserve potential chimeric sequences.
Replace 200cases.pod5
with the directory containing your POD5 files.
The output will be a FASTQ file containing the basecalled sequences.
Note: For convenience, you can download a pre-prepared FASTQ file directly:
wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/raw_no_trim.fastq
Prepare your data for the prediction model:
# Encode the FASTQ file
deepchopper encode raw_no_trim.fastq
For large datasets, use chunking to avoid memory issues:
deepchopper encode raw_no_trim.fastq --chunk --chunk-size 100000
🔍 Output: Look for raw_no_trim.parquet
or multiple .parquet
files under raw_no_trim.fq_chunks
if chunking.
Analyze the encoded data to identify potential chimeric reads:
# Predict artifical sequences for reads
deepchopper predict raw_no_trim.parquet --output predictions
# Predict artifical sequences for reads using GPU
deepchopper predict raw_no_trim.parquet --output predictions --gpus 2
For chunked data:
deepchopper predict raw_no_trim.fq_chunks/raw_no_trim.fq_0.parquet --output predictions_chunk1
deepchopper predict raw_no_trim.fq_chunks/raw_no_trim.fq_1.parquet --output predictions_chunk2
📊 Results: Check the predictions
folder for output files.
This step will analyze the encoded data and produce results containing predictions, indicating whether it's likely to be chimeric or not.
Remove identified artificial sequences:
# Chop artificial sequences
deepchopper chop predictions/0 raw_no_trim.fastq
For chunked predictions:
deepchopper chop predictions_chunk1/0 predictions_chunk2/0 raw_no_trim.fastq
🎉 Success: Look for the output file with the .chop.fq.bgz
suffix.
This command takes the original FASTQ file (raw_no_trim.fastq
) and the predictions (predictions
), and produces a new FASTQ file (with suffix .chop.fq.bgz
) with the chimeric-artifact chopped.
The default output is a compressed file in BGZIP format.
We can use zless -S OUTPUT
to view the output file contents in a terminal.
The -S
flag prevents line wrapping, making it easier to read long sequences.
The default parameters used in DeepChopper are optimized based on extensive testing and validation during our research, as detailed in our paper. These parameters have been shown to provide robust and reliable results across a wide range of sequencing data.
In general, the whole process will take around 20-30 minutes for the demo data, but processing time may vary depending on your machine's specifications and whether you use CPU or GPU acceleration.
- Explore advanced DeepChopper options with
deepchopper --help
- Use your cleaned data for downstream analyses
- Check our documentation for integration with other bioinformatics tools
-
Issue: Out of memory errors when encoding
Solution: Try using the
--chunk
option in the encode step -
Issue: Out of memory errors for CPU or CUDA (GPU) when predicting
Solution: Try to change the
--batch-size
to a lower value -
Issue: Slow processing
Solution: Ensure you're using GPU acceleration if available
-
Issue: Unexpected results
Solution: Verify input data quality and check DeepChopper version
-
Issue: GPU driver compatibility error
Solution: Update your GPU driver or install a compatible PyTorch version e.g.,
pip install torch --force-reinstall --index-url https://download.pytorch.org/whl/cu118
to install a CUDA 11.8 compatible version.
For more help, visit our GitHub Issues page.
Happy sequencing, and may your data be artifical-chimera-free! 🧬🔍