This repository is an example of genomics data analysis with open science in mind. It can be used as starting point. The sections belows presents best practices to make your data analysis more reproducible.
Write a Bash script that downloads any resource files, such as reference genomes and databases, to documentent file versions and locations. Also, it automates analysis environment preparation.
bash download_files.sh
Provide files decribing data processing workflows (WDL, CWL, Nextflow) and input files (normally JSON). In this example we use QA workflow available at /~https://github.com/labbcb/workflows version 1.1.0. Provide the command to execute the workflow.
wget /~https://github.com/broadinstitute/cromwell/releases/download/54/cromwell-54.jar
java -jar cromwell-54.jar run --options options.json --inputs inputs/qc.inputs.json workflows/qc.wdl
Use .gitignore
file to skip downloaded or generated files that you do not want to keep in the repository.
Examples: input data, Cromwell and temporary files.
Use rocker or Bioconductor as base images to make your own Docker image containing all required packages and software libraries to run data analysis. Keep Docker-related files separated from other files to reduce Docker build context. Do not add data file to Docker image. Tag your Docker image according to your analysis version.
Build Docker image:
docker build -t reproducible-analysis:1.1.0 docker
Run RStudio:
docker container run \
--rm \
--detach \
-e DISABLE_AUTH=true \
--volume $PWD:/home/rstudio \
--publish 8787:8787 \
-e USERID=`id -u` \
-e GROUPID=`id -g`\
reproducible-analysis:1.1.0
RStudio Server will be available at http://localhost:8787.
Replace
-e DISABLE_AUTH=true
with-e PASSWORD=secret
to set a password.
Compile RMarkdown file without running RStudio:
docker container run \
--rm \
--volume $PWD:/home/rstudio \
--user `id -u`:`id -g` \
-w /home/rstudio \
reproducible-analysis:1.1.0 \
R -e "rmarkdown::render('data-analysis.Rmd')"
Using Docker compose.
docker-compose up
RStudio Server will be available at http://localhost:8787.
Add
--build
to force Docker to rebuild image.
Add
-d
to detach process from terminal. Rundocker-compose down
to stop service.
Use semantic versioning and GitHub releases.
For example, x.y.z
, where:
x
major version of data analysis. Change only some software is replaced such as sequence aligner.y
minor version. Change when some software is updated to newer version.z
patch version. Change when you have found some bug or typos.
Data should be publish to public repository, such as NCBI GEO and NCBI, according to the type of data.
Repository are meant to be private until paper describing the analysis is published.