Wildcards
Overview
Teaching: 30 min
Exercises: 20 minQuestions
How can I abbreviate the rules in my pipeline?
Objectives
Use Snakemake wildcards to simplify our rules.
Understand that outputs depend not only on the input data files but also on the scripts or code.
After the exercise at the end of the previous episode, our Snakefile looked like this:
# Generate summary table
rule zipf_test:
input:
'isles.dat',
'abyss.dat',
'last.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'
rule dats:
input:
'isles.dat',
'abyss.dat',
'last.dat'
# delete everything so we can re-run things
rule clean:
shell: 'rm -f *.dat results.txt'
# Count words in one of the books
rule count_words:
input: 'books/isles.txt'
output: 'isles.dat'
shell: 'python wordcount.py books/isles.txt isles.dat'
rule count_words_abyss:
input: 'books/abyss.txt'
output: 'abyss.dat'
shell: 'python wordcount.py books/abyss.txt abyss.dat'
rule count_words_last:
input: 'books/last.txt'
output: 'last.dat'
shell: 'python wordcount.py books/last.txt last.dat'
This has a lot of duplication. For example, the names of text files and data files are repeated in many places throughout the Snakefile. Snakefiles are a form of code and, in any code, repetition can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).
D.R.Y. (Don’t Repeat Yourself)
In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python, R, or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.
Let us set about removing some of the repetition from our Snakefile. In our
zipf_test rule we duplicate the data file names and the name of the results
file name:
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'
Looking at the results file name first, we can replace it in the action with
{output}:
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > {output}'
{output} is a Snakemake wildcard which is equivalent to the
value we specified for the rule output.
We can replace the dependencies in the action with {input}:
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py {input} > {output}'
{input} is another wildcard which means ‘all the inputs of the current rule’.
Again, when Snakemake runs it will replace this variable with the actual inputs.
Let’s update our text files and re-run our rule:
touch books/*.txt
snakemake -c 1 results.txt
We get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 count_words_last
1 zipf_test
4
rule count_words_last:
input: books/last.txt
output: last.dat
jobid: 1
Finished job 1.
1 of 4 steps (25%) done
rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
2 of 4 steps (50%) done
rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 3
Finished job 3.
3 of 4 steps (75%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
4 of 4 steps (100%) done
Update Dependencies
What will happen if you now execute:
touch *.dat snakemake -c 1 results.txt
- nothing
- all files recreated
- only
.datfiles recreated- only
results.txtrecreatedSolution
Only
results.txtrecreated.The rules for
*.datare not executed because their corresponding.txtfiles haven’t been modified.If you run:
touch books/*.txt snakemake -c 1 results.txtyou will find that the
.datfiles as well asresults.txtare recreated.
As we saw, {input} means ‘all the dependencies of the current rule’. This
works well for zipf_test as its action treats all the dependencies the same
- as the input for the
zipf_test.pyscript.
Time for you to update all the rules that build a .dat file to use the
{input} and {output} wildcards.
Rewrite
.datrules to use wildcardsRewrite each
.datrule to use the{input}and{output}wildcards.Solution
Only one rule is shown here, the others will have an identical action (the
shell:line):rule count_words: input: 'books/isles.txt' output: 'isles.dat' shell: 'python wordcount.py {input} {output}'
Handling dependencies differently
For many rules, we will need to make finer distinctions between inputs. It is
not always appropriate to pass all inputs as a lump to your action. For example,
our rules for .dat use their first (and only) dependency specifically as the
input file to wordcount.py. If we add additional dependencies (as we will soon
do) then we don’t want these being passed as input files to wordcount.py: it
expects just one input file.
Let’s see this in action. We need to add wordcount.py as a dependency of each
of our data files so that the rules will be executed if the script changes. In
this case, we can use {input[0]} to refer to the first dependency, and
{input[1]} to refer to the second:
rule count_words:
input: 'wordcount.py', 'books/isles.txt'
output: 'isles.dat'
shell: 'python {input[0]} {input[1]} {output}'
Alternatively, we can name our dependencies:
rule count_words_abyss:
input:
cmd='wordcount.py',
book='books/abyss.txt'
output: 'abyss.dat'
shell: 'python {input.cmd} {input.book} {output}'
Let’s mark wordcount.py as updated, and re-run the pipeline:
touch wordcount.py
snakemake -c 1
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 zipf_test
3
rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
1 of 3 steps (33%) done
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1
Finished job 1.
2 of 3 steps (67%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
3 of 3 steps (100%) done
Notice how last.dat (which does not depend on wordcount.py) is not
rebuilt.
Intuitively, we should also add wordcount.py as dependency for
results.txt, as the final table should be rebuilt if we remake the .dat
files. However, it turns out we don’t have to! Let’s see what happens to
results.txt when we update wordcount.py:
touch wordcount.py
snakemake -c 1 results.txt
then we get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 zipf_test
3
rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
1 of 3 steps (33%) done
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1
Finished job 1.
2 of 3 steps (67%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
3 of 3 steps (100%) done
The whole pipeline is triggered, even the creation of the results.txt file!
To understand this, note that according to the dependency graph, results.txt
depends on the .dat files. The update of wordcount.py triggers an update of
the *.dat files. Thus, Snakemake sees that the dependencies (the .dat
files) are newer than the target file (results.txt) and it therefore recreates
results.txt. This is an example of the power of Snakemake: updating a subset
of the files in the pipeline triggers rerunning the appropriate downstream
steps.
Updating One Input File
What will happen if you now execute:
touch books/last.txt snakemake -c 1 results.txt
- only
last.datis recreated- all
.datfiles are recreated- only
last.datandresults.txtare recreated- all
.datandresults.txtare recreatedSolution
3.onlylast.datandresults.txtare recreated
Update
count_words_lastto depend onwordcount.pyUse either indexed or named inputs.
Updating
zipf_testruleAdd
zipf_test.pyas a dependency ofresults.txt. We haven’t yet covered the techniques required to do this with named wildcards so you will have to use indexing. Yes, this will be clunky, but we’ll fix that part later!Remember that you can do a dry run with
snakemake -n -p!Solution
rule zipf_test: input: 'zipf_test.py', 'isles.dat', 'abyss.dat', 'last.dat' output: 'results.txt' shell: 'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'
Key Points
Use
{output}to refer to the output of the current rule.Use
{input}to refer to the dependencies of the current rule.You can use Python indexing to retrieve individual outputs and inputs (example:
{input[0]})Wildcards can be named (example:
{input.file1}).Naming the code or scripts used by a rule as inputs ensures that the rule is executed if the code or script changes.