Developing and sharing reproducible bioinformatics workflows

Python

Written by tdarde

February 22, 2022

There’s a snake in my boot

Life-sciences are nowadays conducted in multi-disciplinary and multi-centric studies. In this context, the same software components must be deployed in multiple environments most of the time resulting in reproducibility and scalability issues. Besides, data analysis pipelines are usually composed of multiple components, continuously evolving, which leads to maintenance and long-term support challenges. To promote FAIR principles, providing controlled software environments (Docker, Conda, Singularity) and data analysis workflows (SnakeMake, NextFlow, Galaxy, CWL) should be mandatory.

When you’re about to publish your work in a scientific journal, you have to explain to future readers how you got your results. This should, in theory, allow bioinformaticians reader to start from your raw data, and to obtain the same results as you. Here we approach a critical point in bioinformatics: reproducibility. What tools were used, in what versions, with what parameters? It seems anecdotal, but this point allows anyone to duplicate the analysis described in your paper. As for me, a scientific article using/describing bioinformatics analysis SHOULD be associated with a pipeline/script to reproduce the results identically.

One way to deal with this is to use a workflow engine such as SnakeMake in a controlled environment like Conda.

Conda — Welcome to my environment

According to the Conda website:

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer.

Briefly, Conda offers the possibility to have installed different versions of a program side-by-side on the local computer. This environment can be exported and imported on a different computer allowing to test new software without breaking any other running programs (i.e. very useful for R packages available only in a specific version or non-compatible with other packages).

Installing Conda

I will not describe the installation process in this post. You can find how to install conda on your computer on the conda installation page. Select your system and follow the steps to complete the installation. To get more information about how to use Conda you can read on the manual page or the Cheat sheet.

After installation conda already installed the first environment called base. One can get a list of all environment by:

conda env list

For this example let’s create a new environment called snakemake-test, install the latest version of SnakeMake, samtools, and bwa. Then activate your newly created environment:

conda create --name snakemake-test
conda activate snakemake-test
conda install -c bioconda samtools bwa snakemake

When your environment is set up you still install new packages using the command below. You can find the list of available packages here.

conda install -c bioconda fastqc
conda search star # to search available packages

Once you have a complete environment that you like to share with someone, you can export the settings and list of installed packages:

conda env export -n snakemake-test > snakemake-test.yml
#Create the same environment on another system
conda create -f snakemake-test.yml

Enhance analysis reproducibility with SnakeMake

Snakemake is a Pythonic variant of GNU Make. It’s a workflow management system that allows you to create reproducible and scalable data analyses. Combine with Conda, it’s become easy to share and reproduce data analysis describe in a scientific paper.

Just like GNU make, Snakemake rest on the rule principle (rule). A rule is a set of elements leading to the creation of a file. In other words, it’s corresponding to a step in the pipeline. Each rule has at least one output file (output) and one (or more) input file(s).

Snakemake allows you to create a set of rules, each one defining a “step” of your analysis. The rules need to be written in a file called Snakefile. For each step you need to provide:

  • The input: Data files, scripts, executables, or any other files.
  • The expected output. It’s not required to list all possible outputs. Just those that you want to monitor or that is used by a subsequent step as inputs.
  • A command to run to process the input and create the output.

Here an example of a basic rule:

rule myname:
input: [‘myinput1', ‘myinput2']
output: [‘myoutput’]
shell: ‘Some command to go from in to out’

An example. If you want to copy some text from a file called input.txt to output.txt you can do:

rule copy:
input: 'input.txt'
output: 'output.txt'
shell: 'cp input.txt output.txt'

You can even avoid typos by substituting variables instead of typing the filenames twice:

rule merge_files:
input: ['input_1.txt', 'input_2.txt']
output: 'output.txt'
shell: 'cat {input[0]} > {output} && cat {input[1]} >> {output}'

Input and output can also be parametrised using wildcards:

rule copy_and_echo:
input: 'input/{filename}.txt'
output: 'output/{filename}.txt'
shell: 'echo {wildcards.filename} && cp {input} {output}'

If you then make another rule with output/a_file.txt and output/another_file.txt as inputs they will be automatically created by the copy_and_echo rule.

rule all:
input: ['output/a_file.txt', 'output/another_file.txt']

This allows for rules to be reusable, for example to make a rule that can be used to process data with from different years or polarities.

Notice that:

  • Inputs and outputs can be of any type
  • You can provide python code after the tags. e.g. input: glob("*.root")
  • If a single file is input or output you are allowed to omit the brackets
  • Wildcards must always be present in the output of a rule (else it wouldn’t be possible to know what they should be)

And now that your Snakefile is done it’s time to run! Just type

snakemake rulename_or_filename

This will:

  1. Check that the inputs exist ( If inputs do not exist or have changed snakemake will check if there is an other rule that produces them)
  2. Run the command you defined in rulename_or_filename (or the rule that generates the filename that is given)
  3. Check that the output was actually produced.

Comments, which rules are run:

  • If want to run a chain of rules only up to a certain point just put the name of the rule up to which you want to run on the snakemake command.
  • If you want a rule to be “standalone” just do not give its input/outputs as outputs/inputs of any other rule
  • It is normal practice to put as a first rule a dummy rule that only takes as inputs all the “final” outputs you want to be created by any other rule. In this way when you run just snakemake with no label it will run all rules (in the correct order).

A basic snakemake recipe: BWA alignment

First create a working directory:

mkdir -p ~/snakemake_test
cd ~/snakemake_test
mkdir raw_data
cd raw_data

Download some data available here:

wget 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR228/004/ERR2281864/ERR2281864_1.fastq.gz'
wget 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR228/004/ERR2281864/ERR2281864_2.fastq.gz'

Download the reference genome:

cd ..
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gzip -d hg38.fa.gz

Then edit your SnakeFile :

##################################### BWA mapping example ####################################### Expected output filesrule target:
input:
"hg38.fa.bwt",
"mapping/hg38/ERR2281864.sort.bam.bai"# Reference genome index for bwa alignment
rule bwa_index:
input:
"{genome}.fa"
output:
"{genome}.fa.bwt"
message:
"Indexing reference genome :{wildcards.genome}"
shell:
"bwa index {input}"
# Reads alignment
rule bwa_aln:
input:
bwt = "{genome}.fa.bwt",
fasta = "{genome}.fa",
fq = "raw_data/{samples}.fastq.gz"
output:
temp("mapping/{genome}/{samples}.sai")
log:
"mapping/{genome}/{samples}.bwa_aln.log"
message:
"Alignment of {wildcards.samples} sur {wildcards.genome} "
shell:
"bwa aln -t {threads} {input.fasta} {input.fq} >{output} 2>{log}"# Create bam file for Paired end files
rule bwa_sampe_to_bam:
input:
fasta = "{genome}.fa",
sai1 = "mapping/{genome}/{samples}_1.sai",
fq1 = "raw_data/{samples}_1.fastq.gz",
sai2 = "mapping/{genome}/{samples}_2.sai",
fq2 = "raw_data/{samples}_2.fastq.gz"
output:
temp("mapping/{genome}/{samples}.bam")
log:
"mapping/{genome}/{samples}.samse.log"
message:
"PE : sai --> bam ({wildcards.genome}/{wildcards.samples})"
shell:
"bwa sampe {input.fasta} {input.sai1} {input.sai2}
{input.fq1} {input.fq2} 2>{log} | samtools view -Sbh - > {output}"
#Sort Bam file
rule sort_bam:
input:
"{base}.bam"
output:
protected("{base}.sort.bam")
log:
"{base}.sort.log"
message:
"Sort {wildcards.base}.bam"
params:
compression = "9",
max_mem = "1G"
shell:
"samtools sort -l {params.compression} -m {params.max_mem} {input} -o {output} 2>{log}"# Index bam file
rule index_bam:
input:
"{base}.sort.bam"
output:
"{base}.sort.bam.bai"
shell:
"samtools index {input}"

This tells snakemake to align and sort the alignment file the two downloaded fastq files on the human reference genome. It’s a good time to take a closer look at the resulting workflow. By executing

snakemake --rulegraph | dot -Tpdf > dag.pdf
Workflow representation

The graph contains a node for each job and edges representing the dependencies. Jobs that don’t need to be run because their output is up-to-date are dashed. For rules with wildcards, the value of the wildcard for the particular job is displayed in the job node.

Now you can run the workflow:

snakemake -p

Conclusion

This article aims to present one of the many ways to improve reproducibility in data analysis in bioinformatics. You are now able to create a controlled development environment, an analysis workflow, and to share it with whoever you want.

The approach presented here is still very simplistic. I haven’t address certain interesting concepts (such as the use of conda or docker environment directly in the rules of the SnakeFile). Not even all the options for the rule or how to use this environment on computing servers (cluster, AWS, Google cloud …).

You May Also Like…

Introduction to DeSeq2

Introduction to DeSeq2

DeSeq2 is a popular R package for analyzing RNA-seq count data. It can be used to identify differentially expressed...

2 years at SciLicium

2 years at SciLicium

SciLicium is 2 years old! I wanted to take this opportunity to make a brief retrospective of this adventure and...

0 Comments