Snakefiles

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How do I write a simple workflow?

Objectives

Understand the components of a Snakefile: rules, inputs, outputs, and actions.

Write a simple Snakefile.

Run Snakemake from the shell.

Create a file, called Snakefile, with no file extension and containing the following content:

# Count words.
rule count_words:
    input:    'books/isles.txt'
    output:   'isles.dat'
    shell:    'python wordcount.py books/isles.txt isles.dat'

This is a build file, which for Snakemake is called a Snakefile — a file executed by Snakemake. Note that aside from a few keyword additions like rule, it follows standard Python 3 syntax.

Let us go through each line in turn:

# denotes a comment. Any text from # to the end of the line is ignored by Snakemake.
isles.dat is a target, a file to be created, or built. In Snakemake, these are called “outputs”, for simplicity’s sake.
books/isles.txt is a dependency, a file that is needed to build or update the target. Targets can have zero or more dependencies. Dependencies in Snakemake are called “inputs”.
python wordcount.py books/isles.txt isles.dat is an action, a command to run to build or update the target using the dependencies. In this case the action is a set of shell commands (we can also use Python code… more on that later).
Like Python, you can use either tabs or spaces for indentation — don’t use both!
Together, the target, dependencies, and actions form a rule. A rule is a recipe for how to make things.

Our rule above describes how to build the target isles.dat using the action python wordcount.py and the dependency books/isles.txt.

Information that was implicit in our shell script - that we are generating a file called isles.dat and that creating this file requires books/isles.txt - is now made explicit by Snakemake’s syntax.

Let’s first ensure we start from scratch and delete the .dat and .png files we created earlier:

$ rm *.dat *.png

By default, Snakemake looks for a file called Snakefile, and we can run Snakemake as follows:

$ snakemake

By default, Snakemake tells us what it’s doing as it executes actions:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

Depending on your setup, you may receive an error Error: you need to specify the maximum number of CPU cores to be used at the same time with --cores. This can be fixed using an argument to the sankemake command. Try running the following:

$ snakemake --cores 1

If you see a different error, check your syntax and the filepaths of the files in your Snakefile. You can check your present working directory using the command pwd.

Remember, aside from stuff like rule and input, Snakemake follows Python syntax. Let’s see if we got what we expected:

$ head -5 isles.dat

The first 5 lines of isles.dat should look exactly like before.

Snakefiles Do Not Have to be Called Snakefile

We don’t have to call our Snakefile Snakefile. However, if we call it something else we need to tell Snakemake where to find it. This we can do using -s flag. For example, if our Snakefile is named MyOtherSnakefile:
$ snakemake -s MyOtherSnakefile

When we re-run our Snakefile, Snakemake now informs us that:

Nothing to be done.

This is because our target, isles.dat, has now been created, and Snakemake will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

$ touch books/isles.txt

If we compare the timestamps of books/isles.txt and isles.dat,

$ ls -l books/isles.txt isles.dat

then we see that isles.dat, the target, is now older thanbooks/isles.txt, its dependency:

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run Snakemake again,

$ snakemake

then it recreates isles.dat:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

When it is asked to build a target, Snakemake checks the “last modification time” of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.

Snakefiles as Documentation

By explicitly recording the inputs to and outputs from steps in our analysis and the dependencies between files, Snakefiles act as a type of documentation, reducing the number of things we have to remember.

Let’s add another rule to the end of Snakefile. Note that rules cannot have the same name, so we’ll call this one count_words_abyss.

rule count_words_abyss:
    input:    'books/abyss.txt'
    output:   'abyss.dat'
    shell:    'python wordcount.py books/abyss.txt abyss.dat'

If we run Snakemake,

$ snakemake

then we get:

Nothing to be done.

Nothing happens because Snakemake attempts to build the first target it finds in the Snakefile, the default target, which is isles.dat which is already up-to-date. We need to explicitly tell Snakemake we want to build abyss.dat:

$ snakemake abyss.dat

Now, we get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words_abyss
        1

rule count_words_abyss:
    input: books/abyss.txt
    output: abyss.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) doneat

“Up to Date” Versus “Nothing to be Done”

If we ask Snakemake to build a file that already exists and is up to date, then Snakemake informs us that:
Nothing to be done
If we ask Snakemake to build a file that exists but for which there is no rule in our Snakefile, then we get a message like:
$ snakemake wordcount.py
MissingRuleException:
No rule to produce wordcount.py (if you use input functions make sure
that they don't raise unexpected exceptions).
When we see this error, double-check that you have a rule to produce that file, and also that the filename has been specified correctly. Even a small difference in a filename will result in a MissingRuleException.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, to do this. We will call it clean, as this is a common name for rules that delete auto-generated files, like our .dat files:

rule clean:
    shell: 'rm -f *.dat'

This is an example of a rule that has no inputs or outputs! We just want to remove the data files whether or not they exist. If we run Snakemake and specify this target,

$ snakemake clean

then we get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       clean
        1

rule clean:
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

An ls of our current directory reveals that all of our troublesome output files are now gone (as planned)!

We can add a similar command to create all the data files. We can put this at the top of our Snakefile so that it is the default target, which is executed by default if no target is given to the snakemake command:

rule dats:
    input:
        'isles.dat',
        'abyss.dat'

This is an example of a rule that has dependencies that are targets of other rules. When snakemake runs, it will check to see if the dependencies exist and, if not, will see if rules are available that will create these. If such rules exist it will invoke these first, otherwise snakemake will raise an error.

Dependencies

The order of rebuilding dependencies is arbitrary. You should not assume that they will be built in the order in which they are listed.

Dependencies must form a directed acyclic graph. A target cannot depend on a dependency which itself, or one of its dependencies, depends on that target.

This rule is also an example of a rule that has no actions. It is used purely to trigger the build of its dependencies, if needed.

If we run,

$ snakemake dats

then snakemake creates the data files:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1       count_words_abyss
        1       dats
        3

rule count_words_abyss:
    input: books/abyss.txt
    output: abyss.dat
    jobid: 1

Finished job 1.
1 of 3 steps (33%) done

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 2

Finished job 2.
2 of 3 steps (67%) done

localrule dats:
    input: isles.dat, abyss.dat
    jobid: 0

Finished job 0.
3 of 3 steps (100%) done

If we run dats again, then snakemake will see that the dependencies (isles.dat and abyss.dat) are already up to date. Given the target dats has no actions, there is nothing to be done:

$ snakemake dats

Nothing to be done

Our Snakefile now looks like this:

rule dats:
     input:
         'isles.dat',
         'abyss.dat'


# delete everything so we can re-run things
rule clean:
    shell: 'rm -f *.dat'


# count words in one of our "books"
rule count_words:
    input:    'books/isles.txt'
    output:   'isles.dat'
    shell:    'python wordcount.py books/isles.txt isles.dat'


rule count_words_abyss:
    input:    'books/abyss.txt'
    output:   'abyss.dat'
    shell:    'python wordcount.py books/abyss.txt abyss.dat'

The following figure shows a graph of the dependencies embodied within our Snakefile, involved in building the dats target:

/hpc-python/Dependencies%20represented%20within%20the%20Snakefile

At this point, it becomes important to see what Snakemake is doing behind the scenes. What commands is Snakemake actually running? Snakemake has a special option (-p), that prints every command it is about to run. Additionally, we can also perform a dry run with -n. A dry run does nothing, and simply prints out commands instead of actually executing them. Very useful for debugging!

$ snakemake clean
$ snakemake -n -p isles.dat

rule count_words:
    input: wordcount.py, books/isles.txt
    output: isles.dat
    jobid: 0
    wildcards: file=isles

python wordcount.py books/isles.txt isles.dat
Job counts:
	count	jobs
	1	count_words
	1

Write Two New Rules

Write a new rule for last.dat, created from books/last.txt.

Update the dats rule with this target.

Write a new rule for results.txt, which creates the summary table. The rule needs to:

Depend upon each of the three .dat files.

Invoke the action python zipf_test.py abyss.dat isles.dat last.dat > results.txt.

Put this rule at the top of the Snakefile so that it is the default target.

Update clean so that it removes results.txt.

The following figure shows the dependencies embodied within our Snakefile, involved in building the results.txt target:

/hpc-python/results.txt%20dependencies%20represented%20within%20the%20Snakefile

Key Points

Snakemake follows Python syntax

Rules can have an input and/or outputs, and a command to be run.

previous episode

Introduction to High-Performance Computing in Python

next episode

Snakefiles

Overview

Snakefiles Do Not Have to be Called `Snakefile`

Snakefiles as Documentation

“Up to Date” Versus “Nothing to be Done”

Dependencies

Write Two New Rules

Key Points

previous episode

next episode

previous episode

Introduction to High-Performance Computing in Python

next episode

Snakefiles

Overview

Snakefiles Do Not Have to be Called Snakefile

Snakefiles as Documentation

“Up to Date” Versus “Nothing to be Done”

Dependencies

Write Two New Rules

Key Points

previous episode

next episode

Snakefiles Do Not Have to be Called `Snakefile`