Example of reproducible genomics data analysis

This repository is an example of genomics data analysis with open science in mind. It can be used as starting point. The sections belows presents best practices to make your data analysis more reproducible.

Download files

Write a Bash script that downloads any resource files, such as reference genomes and databases, to documentent file versions and locations. Also, it automates analysis environment preparation.

bash download_files.sh

Workflow and input files

Provide files decribing data processing workflows (WDL, CWL, Nextflow) and input files (normally JSON). In this example we use QA workflow available at /~https://github.com/labbcb/workflows version 1.1.0. Provide the command to execute the workflow.

wget /~https://github.com/broadinstitute/cromwell/releases/download/54/cromwell-54.jar
java -jar cromwell-54.jar run --options options.json --inputs inputs/qc.inputs.json workflows/qc.wdl

Ignored files

Use .gitignore file to skip downloaded or generated files that you do not want to keep in the repository. Examples: input data, Cromwell and temporary files.

Docker image for data analysis

Use rocker or Bioconductor as base images to make your own Docker image containing all required packages and software libraries to run data analysis. Keep Docker-related files separated from other files to reduce Docker build context. Do not add data file to Docker image. Tag your Docker image according to your analysis version.

Build Docker image:

docker build -t reproducible-analysis:1.1.0 docker

Run RStudio:

docker container run \
  --rm \
  --detach \
  -e DISABLE_AUTH=true \
  --volume $PWD:/home/rstudio \
  --publish 8787:8787 \
  -e USERID=`id -u` \
  -e GROUPID=`id -g`\
  reproducible-analysis:1.1.0

RStudio Server will be available at http://localhost:8787.

Replace -e DISABLE_AUTH=true with -e PASSWORD=secret to set a password.

Compile RMarkdown file without running RStudio:

docker container run \
  --rm \
  --volume $PWD:/home/rstudio \
  --user `id -u`:`id -g` \
  -w /home/rstudio \
  reproducible-analysis:1.1.0 \
  R -e "rmarkdown::render('data-analysis.Rmd')"

Using Docker compose.

docker-compose up

RStudio Server will be available at http://localhost:8787.

Add --build to force Docker to rebuild image.

Add -d to detach process from terminal. Run docker-compose down to stop service.

Versioning

Use semantic versioning and GitHub releases. For example, x.y.z, where:

x major version of data analysis. Change only some software is replaced such as sequence aligner.
y minor version. Change when some software is updated to newer version.
z patch version. Change when you have found some bug or typos.

Data sharing

Data should be publish to public repository, such as NCBI GEO and NCBI, according to the type of data.

Licencing

Repository are meant to be private until paper describing the analysis is published.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docker		docker
inputs		inputs
results		results
scripts		scripts
workflows		workflows
.gitignore		.gitignore
README.md		README.md
data-analysis.Rmd		data-analysis.Rmd
data-analysis.html		data-analysis.html
docker-compose.yml		docker-compose.yml
options.json		options.json
reproducible-analysis.Rproj		reproducible-analysis.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example of reproducible genomics data analysis

Download files

Workflow and input files

Ignored files

Docker image for data analysis

Versioning

Data sharing

Licencing

About

Releases 2

Packages

Languages

labbcb/reproducible-analysis

Folders and files

Latest commit

History

Repository files navigation

Example of reproducible genomics data analysis

Download files

Workflow and input files

Ignored files

Docker image for data analysis

Versioning

Data sharing

Licencing

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages