Skip to content

Commit

Permalink
feat: batch correction (#87)
Browse files Browse the repository at this point in the history
* feat: providing ngs-test-data

* fix: deleted own old test data

* feat: allowing for batch correction

* feat: renaming 'batch_effect' to 'batch'

* fix: linter did not accout for latest changes

* feat: actual consideration of the batch effect in the DE script

* fix: attempting with a mini default profile#

* fix: test with workflow-profile flag

* fix: added missing unzip package to curl.env

* fix: typo

* fix: added missing design_factors

* fix: removed old test data

* fix: only considering one confounding variable during CI tests

* fix: renamed and added ncbi-datasets-cli as a package to env/reference.yml

* feat: replaced curl downloads wtih ncbi-datasets-cli download by accession number to avoid unstable URLs
  • Loading branch information
cmeesters authored Sep 16, 2024
1 parent cd25504 commit f174574
Show file tree
Hide file tree
Showing 15 changed files with 56 additions and 16,018 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config-simple/config.yml --use-conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache --all-temp"
args: "--configfile .test/config-simple/config.yml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache --all-temp --workflow-profile .test/profile/"

# - name: Test report
# uses: snakemake/snakemake-github-action@v1.24.0
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule ".test/ngs-test-data"]
path = .test/ngs-test-data
url = git@github.com:snakemake-workflows/ngs-test-data.git
8,000 changes: 0 additions & 8,000 deletions .test/01.fq

This file was deleted.

8,000 changes: 0 additions & 8,000 deletions .test/02.fq

This file was deleted.

5 changes: 5 additions & 0 deletions .test/config-simple/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ min_feature_expr: 3

# This section defines the deseq2 plot and data handling parameters
#
# the "design factors" are the confounding variables to be adjusted for
# during the normalization. They must be given in the configuration (samples.csv)
design_factors:
- "condition"
#
# The (log2) log fold change under the null hypothesis. (default: 0).
lfc_null: 0.1
#
Expand Down
2 changes: 1 addition & 1 deletion .test/config-simple/samples.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample condition condition2 batch_effect platform purity
sample condition condition2 batch platform purity
01 male condition2 batch1 NANOPORE 1
02 female condition2 batch1 NANOPORE 1
1 change: 1 addition & 0 deletions .test/ngs-test-data
Submodule ngs-test-data added at 5166ea
2 changes: 2 additions & 0 deletions .test/profile/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
default-resources:
cpus_per_task: 2
8 changes: 7 additions & 1 deletion config/Mainz-MogonNHR/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,13 @@ min_gene_expr: 10
# Minimum transcript counts
min_feature_expr: 3

# This section defines the deseq2 plot and data handling parameters
# This section defines the pyDESeq2 plot and data handling parameters
#
# the "design factors" are the confounding variables to be adjusted fr
# during normalization. They must be given in the configuration (samples.csv).
design_factors:
- "batch"
- "condition"
#
# The (log2) log fold change under the null hypothesis. (default: 0).
lfc_null: 0.1
Expand Down
2 changes: 1 addition & 1 deletion config/Mainz-MogonNHR/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample condition condition2 batch_effect platform purity
sample condition condition2 batch platform purity
m18_bc01 male condition2 batch1 NANOPORE 1
m18_bc02 male condition2 batch1 NANOPORE 1
m18_bc03 female condition2 batch1 NANOPORE 1
Expand Down
3 changes: 2 additions & 1 deletion workflow/envs/curl.yml → workflow/envs/reference.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
channels:
- conda-forge
dependencies:
- curl>=8.8.0
- ncbi-datasets-cli
- unzip
2 changes: 1 addition & 1 deletion workflow/rules/commons.smk
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ samples = (
config["samples"],
),
sep=r"\s+",
dtype={"sample": str, "condition": str, "condition2": str, "batch_effect": str},
dtype={"sample": str, "condition": str, "condition2": str, "batch": str},
header=0,
comment="#",
)
Expand Down
40 changes: 30 additions & 10 deletions workflow/rules/ref.smk
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,53 @@ localrules:

rule get_genome:
output:
genome="references/genomic.fa",
# generic name:
temp("ncbi_dataset.zip"),
params:
accession=config["accession"],
log:
"logs/refs/get_genome.log",
conda:
"../envs/curl.yml"
"../envs/reference.yml"
shell:
"""
curl -s -o data_genome.zip https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{params.accession}/download?include_annotation_type=GENOME_FASTA &> {log};
unzip -p data_genome.zip ncbi_dataset/data/{params.accession}/*.fna > references/genomic.fa 2> {log};
rm data_genome.zip &> {log}
datasets download genome accession {params.accession} --include gff3,genome &> {log}
"""


rule get_annotation:
rule extract_genome:
input:
rules.get_genome.output,
output:
"references/genomic.fna",
group:
"reference"
params:
accession=config["accession"],
log:
"logs/refs/extract_genome.log",
conda:
"../envs/reference.yml"
shell:
"""
unzip -p {input} ncbi_dataset/data/{params.accession}/*.fna > {output} 2> {log}
"""


rule extract_annotation:
input:
rules.get_annotation.output,
output:
"references/genomic.gff",
group:
"reference"
params:
accession=config["accession"],
log:
"logs/refs/get_annotation.log",
conda:
"../envs/curl.yml"
"../envs/references.yml"
shell:
"""
curl -s -o data_annotation.zip https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{params.accession}/download?include_annotation_type=GENOME_GFF &> {log};
unzip -p data_annotation.zip ncbi_dataset/data/{params.accession}/*.gff > references/genomic.gff 2> {log};
rm data_annotation.zip &> {log}
unzip -p {input} ncbi_dataset/data/{params.accession}/*.gff > references/genomic.gff 2> {log};
"""
2 changes: 1 addition & 1 deletion workflow/schemas/samples.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,5 @@ properties:
required:
- sample
- condition
- batch_effect
- batch

2 changes: 1 addition & 1 deletion workflow/scripts/de_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design_factors=["condition"],
design_factors=snakemake.config["design_factors"],
refit_cooks=True,
n_cpus=ncpus,
)
Expand Down

0 comments on commit f174574

Please sign in to comment.