A Snakemake workflow for processing PacBio raw subreads.bam
into polished mRNA isoforms in FASTA format.
Optionnally, long assembled mRNAs can be aligned against a genomic reference to generate a genomic annotation in the GFF3 format.
The workflow follows Iso-Seq standard analysis that consists of the following steps:
- Get Circular Consensus Sequence (CCS) reads.
- Get Full Length (FL) reads.
- Get refined Full-Length, Non-Concatemer (FLNC) reads.
- Get transcript isoforms from (refined and clustered) FLNC reads.
- Optionally, align these transcript isoforms to a genome reference and create a GFF3 annotation file.
| name | abbreviation | explanation | |----------------------------------- |-------------- |------------------------------------------------------------------------------------------------ | | Full-Length Reads | FL reads | CCS reads with 5’ and 3’ cDNA primers removed. | | Full-Length, Non-Concatemer Reads | FLNC reads | Reads FLNC Reads CCS reads with 5’ and 3’ cDNA primers, polyA tail, and concatemers removed. | | High-Quality Isoforms | HQ isoforms | Polished transcript sequences with predicted accuracy ≥99% & ≥2 FLNC | | Low-Quality Isoforms | LQ isoforms | Polished transcript sequences with predicted accuracy <99% & ≥2 FLNC |
The usage of this workflow is described in the Snakemake Workflow Catalog and also here.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
For each rule, a dedicated Conda/Mamba environment On the crunchomics cluster,
To install the 'conda' package manager from the lightweight miniconda distribution, follow instructions here.
To install the mamba
package manager, follow the instructions here.
This will be your starting environment with:
To create it, run mamba env create -f config/environment.yaml
to install these three Python dependencies.
Snakemake will use the rule conda environments defined in envs/
for each given rule. It will install the conda environment using mamba
so be sure mamba
is available by running either which mamba
.
If using Snakemake interactively execute: snakemake --use-conda -j X
where X is your number of cores.
Otherwise submit your jobs using SLURM job manager: sbatch pacbio_snakemake_sbatch.sh
.
- Tijs Bliek, technician, Plant Development and Epigenetics, SILS, University of Amsterdam.
- Marc Galland, support data scientist, Plant Physiology, SILS, University of Amsterdam.
/~https://github.com/PacificBiosciences/pbbioconda
- Replace
<owner>
and<repo>
everywhere in the template (also under .github/workflows) with the correct<repo>
name and owning user or organization. - Replace
<name>
with the workflow name (can be the same as<repo>
). - Replace
<description>
with a description of what the workflow does. - The workflow will occur in the snakemake-workflow-catalog once it has been made public. Then the link under "Usage" will point to the usage instructions if
<owner>
and<repo>
were correctly set.