Basic syntax
Overview
Teaching: 15 min
Exercises: 15 minQuestions
Where do I start?
Objectives
Understand basic Python syntax and data types.
The most basic use of Python is to use it as a fancy calculator. It is very easy to do basic maths in Python.
print(5 + 1)
6
Note that we don’t always have to use the print()
statement.
Notice how leaving out print()
gives us the same result as above.
5 + 1
6
Python can do all of the normal basic maths operations you’d expect.
5 + 3
2 - 9
4 * 6
14 / 3
8
-7
24
4.666666666666667
You can also use it to more complicated operations, like exponentiation (**
):
5 ** 2
25
Along with floor and remainder division.
Floor division (//
) gives the results of division, rounded down.
Remainder division (%
), gives the remainder after division.
5 // 2 # floor division
5 % 2 # remainder division
2
1
Python follows the normal order of operations for maths.
4 + 1 * 6
10
However, if you want Python to change the order it does things in, you can use parentheses to specify what to do first. Note that there is no limit to the number of parentheses you can use.
(4 + 1) * 6
30
Variables
Of course, we will probably want to save our answers at some point.
We can do this by assigning a variable.
In Python, a variable is a name for a saved result.
We can set them with the =
sign.
weight_kg = 55
If we want to retrieve the information we’ve stored, we can do it by simply typing the name of the variable again.
weight_kg
55
We can perform maths on variables the same way we would normally.
print('weight in pounds:', 2.2 * weight_kg)
weight in pounds: 121.00000000000001
As the example above shows, we can print several things at once by separating them with commas. Note that in this case, the number might appear as 121.00000000000001 due to the way numbers are internally represented in Python.
We can also change a variable’s value by assigning it a new one:
weight_lb = 2.2 * weight_kg
print(weight_lb)
121.00000000000001
What happens when we change a variable?
Let’s update weight_kg
and see what happens to weight_lb
.
print('weight_kg starting value is', weight_kg)
weight_kg = 10000
print('after updating, weight_kg ending value is', weight_kg)
print('weight in lb ending value is', weight_lb)
weight_kg starting value is 55
after updating, weight_kg ending value is 10000
weight in lb ending value is 121.00000000000001
Notice how even though we changed the value of weight_kg
, weight_lb
did not
update.
This demonstrates a very important property of programming languages:
a computer will not do anything unless you specifically tell it to —
nothing ever happens automatically.
This is different from the behaviour of a spreadsheets,
where a cell will automatically update when the cells it refers to are updated.
If we want to tell Python to update weight_lb
to reflect the new value of
weight_kg
, we will need to perform this operation explicitly.
weight_lb = weight_kg * 2.2
print('new value for weight_lb is', weight_lb)
new value for weight_lb is 22000.0
One more thing to note: what we just did is the best way to learn Python. Don’t know how something works? Try it and find out!
Where are variables stored?
Your computer has two places where it stores information: hard disk and memory. What are they and what are they used for? Where do variables get stored?
Memory is where temporary information on your computer gets placed. It is very fast and easy to access, but has one important drawback: data here is erased when your program quits or your computer shuts down. All information you save as variables in Python will be stored in memory! When programming, we always need to save our data as a file (on our hard disk) if we want to keep it!
Your computer’s hard disk is used to store information long-term. This is where files get stored, and the information on your hard drive is more or less permanent. Hard drives can also store lots of data very cheaply — a terabyte of hard drive space is very cheap, whereas the same amount of memory costs a lot more. So if hard drive space is permanent and super-cheap, why don’t we use it to store all of our data? The biggest reason is speed — memory is typically hundreds, if not thousands of times faster to access. If we stored our variables to our hard disk, our programs would be incredibly slow!
Errors
Of course, not everything will always work perfectly. We are going to run into errors. For instance, what happens if we accidentally don’t finish a command?
1 +
SyntaxError: invalid syntax (<ipython-input-15-70475fc083df, line 1)
File "<ipython-input-15-70475fc083df", line 1
1 +
^
SyntaxError: invalid syntax
This is an error. Errors are good! When we do something that Python doesn’t like, it will give us an error message. These error messages are called tracebacks, and often tell us exactly how to fix our stuff!
Let’s walk through this error:
SyntaxError: invalid syntax
All errors have types.
This one is a SyntaxError
, indicating, well… an error in our syntax.
Syntax is “computer-speak” for how things are supposed to be typed.
Python only understands certain commands, and typos will mess it up.
If we type a command in such a way that Python can’t understand it, we need to
fix our syntax (make sure we’ve typed a valid command).
Takeaway message: We made an error when typing things.
File "<ipython-input-15-70475fc083df", line 1
1 +
^
Python is trying to be helpful and tell us exactly where our error occurred.
The first thing it does is tell us which file had the error in it.
Since we are using the terminal, it gives us the semi-confusing
<ipython-input-15-70475fc083df
instead of a filename.
The line 1
bit tells us that our error was on line 1 of our last command.
Specifically, Python has printed out the offending line for us,
and pointed an arrow (^
) at the bad part.
Takeaway message: The error came right after we typed the +
sign.
Different types of data
Computers are not smart, and have to be explicitly told how to handle different types of data. Although a human might know that you can’t do maths on a word, our computer does not. To work around this problem, programming languages store different types of data in different ways.
For reasons that are likely obvious, we will need to store text differently than numbers. What is less obvious is that Python also has special ways of handling integers vs. decimals, Boolean values (True/False), and a special value used to indicate no data whatsoever.
Strings
We’ve already encountered 3 of these “data types” already.
The first is strings, which are used to store text.
Strings are indicated with either single ('
) or double ("
) quotes.
To see what data type something is, we can use the type()
command.
It will print the data type of whatever is inside the parentheses.
type('this is a string')
type("this is also a string")
str
str
We can also make multiline strings with 3 of either set of quotes.
multiline = '''
This string
spans
multiple
lines
!!!!
'''
print(multiline)
type(multiline)
This string
spans
multiple
lines
!!!!
str
Python makes it very easy to work with basic text.
For instance, we can even use the +
sign to put strings together!
'some text' + 'MORE TEXT'
'repetition' * 3
'some textMORE TEXT'
'repetitionrepetitionrepetition'
Note that maths operations on strings will only work within reason. Attempting to add a string to a number doesn’t work!
'5' + 5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-f9dbf5f0b234 in <module()
---- 1 '5' + 5
TypeError: Can't convert 'int' object to str implicitly
This error indicates that Python doesn’t know how to convert a string to an integer (without our help)!
Numbers
Integers are used to store any whole number, either positive or negative. Any number without a decimal place is an int, or integer.
type(5)
type(-1000)
type(6 + -33)
int
int
int
But what happens when we perform a maths operation that would result in a decimal?
type(10 / 3)
float
Any operation that would result in a decimal being created converts the number to a “float”. Floats are used to represent decimals in Python. To explicitly set a number as a float, just add a decimal point.
type(1.3)
type(22.)
float
float
Other data types
Python has two special “Boolean” values to indicate whether or not something is
true or false.
Unsurprisingly, these are defined as True
and False
.
type(True)
type(False)
bool
bool
Finally, there is a special value called None
used to indicate no data.
We will revisit None
in more detail later, so for now, just be aware it
exists.
type(None)
NoneType
Converting between data types
Data often isn’t the format you want it to be. For instance, we got an error earlier while attempting to perform addition between a string and a number (
'5' + 5
). What if we really needed to do that? Fortunately, Python makes it rather easy to convert between data types. Each data type has a function used to convert another piece of data.To convert a string to an integer, for instance, we can use the
int()
command:print(int('5') + 5)
10
Likewise, we can use the following commands to convert data to other types:
str()
- creates a stringint()
- creates an integerfloat()
- creates a floatbool()
- creates a BooleanUsing this information, see if you can fix the left side of these statements to equal the right side of each statement. Use only the commands shown above.
1 + '1' == '11' '6' - 7 == -1 7.23 == 7 '5' == True 4 / 1.3 == 4
Data type conversion pitfalls
You may have noticed something weird when converting a float to an int in the last example. Is Python simply rounding floats to the nearest integer, or is it doing something else?
Key Points
Errors are there to help us.
Scripts and imports
Overview
Teaching: 15 min
Exercises: 15 minQuestions
What is a Python program?
Objectives
Explain what constitutes a Python program.
Import Python modules.
Everything we’ve learned so far is pretty cool. But if we want to run a set of commands more than once? How do we write a program in Python?
Python programs are simply a set of Python commands saved in a file.
No compiling required!
To demonstrate this, let’s write our first program!
Enter the following text in a text editor and save it under any name you like
(Python files are typically given the extension .py
).
print('it works!!!')
We can now run this program in several ways.
If we were to open up a terminal in the folder where we had saved our program,
we could run the command python3 our-script-name.py
to run it.
it works!!!
What’s the point of print()?
We saw earlier that there was no difference between printing something with
print()
and just entering a command on the command line. But is this really the case? Is there a difference after all?Try executing the following code:
print('this involves print') 'this does not'
What gets printed if you execute this as a script? What gets printed if you execute things line by line? Using this information, what’s the point of
print()
?
import
-ing things
IPython has a neat trick to run command line commands without exiting IPython.
Any command that begins with !
gets run on your computer’s command line, and
not the IPython terminal.
We can use this fact to run the command python3 our-script-name.py
.
I’ve called my script test.py
as an example.
!python3 test.py
it works!!!
What if we wanted to pass additional information to Python?
For example, what if we want Python to print whatever we type back at us?
To do this, we’ll need to use a bit of extra functionality:
the sys
package.
Python includes a lot of extra features in the form of packages,
but not all of them get loaded by default.
To access a package, we need to import
it.
import sys
You’ll notice that there’s no output.
Only one thing is changed:
We can now use the bonuses provided by the sys
package.
For now, all we will use is sys.argv
.
sys.argv
is a special variable
that stores any additional arguments we provide on the command-line
after python3 our-script-name.py
.
Let’s make a new script called command-args.py
to try this out.
import sys
print('we typed: ', sys.argv)
We can then execute this program with:
!python3 test.py word1 word2 3
we typed: ['test.py', 'word1', 'word2', '3']
You’ll notice that sys.argv
looks different from other data types we’ve seen
so far. sys.argv
is a list (more about this in the next session).
Key Points
To run a Python program, use
python3 program_name.py
.
Numpy arrays and lists
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do we store large amounts of data?
Objectives
Learn to use lists and Numpy arrays, and explain the difference between each.
At the end of the last lesson, we noticed that sys.argv
gave us a new data
structure: a list.
A list is a set of objects enclosed by a set of square brackets ([]
).
example = [1, 2, 4, 5]
example
[1, 2, 4, 5]
Note that a list can hold any type of item, even other lists!
example = [1, True, None, ["word", 123], "test"]
example
[1, True, None, ['word', 123], 'test']
We can get different pieces of a list via indexing. We add a set of square brackets after the list in question along with the index of the values we want. Note that in Python, all indices start from 0 — the first element is actually the 0th element (this is different from languages like R or MATLAB). The best way to think about array indices is that they are the number of offsets from the first position — the first element does not require an offset to get to.
A few examples of this in action:
# first element
example[0]
# second element
example[1]
# fetch the list inside the list
example[3]
1
True
['word', 123]
Note that we can index a range using the colon (:
) operator.
A colon by itself means fetch everything.
example[:]
[1, True, None, ['word', 123], 'test']
A colon on the right side of an index means everything after the specified index.
example[2:]
[None, ['word', 123], 'test']
A colon on the left side of an index means everything before, but not including, the index.
example[:2]
[1, True]
And if we use a negative index, it means get elements from the end, going backwards.
# last element
example[-1]
# everything except the last two elements
example[:-2]
'test'
[1, True, None]
Note that we can use the index multiple times to retrieve information from nested objects.
example[3][0]
'word'
If we index out of range, it is an error:
example[5]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-12-98429cb6526b> in <module>()
----> 1 example[5]
IndexError: list index out of range
We can also add two lists together to create a larger list.
[45, 2] + [3]
[45, 2, 3]
Lists as objects
Like other objects in Python, lists have a unique behaviour that can catch a lot of people off guard. What happens when we run the following code?
list1 = [1, 2, 3, 4]
list2 = list1
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
List 2 is: [1, 2, 3, 4, 5, 6, 7]
List 1 is: [1, 2, 3, 4, 5, 6, 7]
Modifying list2
actually modified list1
as well.
In Python, lists are objects.
Objects are not copied when we assign them to a new value (like in R).
This is an important optimisation, as we won’t accidentally fill up all of our
computer’s memory by renaming a variable a couple of times.
When we ran list2 = list1
, it just created a new name for list1
.
list1
still points at the same underlying object.
We can verify this with the id()
function.
id()
prints an objects unique identifier.
Two objects will not have the same ID unless they are the same object.
id(list1)
id(list2)
140319556870408
140319556870408
In order to create list2
as a unique copy of list1
.
We have to use the .copy()
method.
list1 = [1, 2, 3, 4]
list2 = list1.copy()
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
id(list2)
id(list1)
List 2 is: [1, 2, 3, 4, 5, 6, 7]
List 1 is: [1, 2, 3, 4]
140319554648072
140319554461896
.copy()
is a method.
Methods are special functions associated with an object and define what it can
do.
They always follow the syntax object.method(arg1, arg2)
and have predefined
number of arguments mostly with default values. We may also specify a subset of
arguments, e.g. object.method(arg1, arg4=some_value)
.
Other frequently used methods of lists include .append()
:
list1.append(77)
[1, 2, 3, 4, 77]
# this adds a one-element list
list1.append([88])
[1, 2, 3, 4, 77, [88]]
And .extend()
(combines two lists, instead of adding the second list as an
element):
list1.extend([99, 88, 101])
[1, 2, 3, 4, 77, [88], 99, 88, 101]
And of course, .remove()
and .clear()
(both do exactly what you think they
should do):
list1.remove([88])
print(list1)
list1.clear()
print(list1)
[1, 2, 3, 4, 77, 99, 88, 101]
[]
Dynamic resizing of lists
Python’s lists are an extremely optimised data structure. Unlike R’s vectors, there is no time penalty to continuously adding elements to list. You never need to pre-allocate a list at a certain size for performance reasons.
Iterating through lists
We’ll very frequently want to iterate over lists and perform an operation with every element. We do this using a for loop.
A for loop generally looks like the following:
for variable in things_to_iterate_over:
do_stuff_with(variable)
An example of an actually functioning for loop is shown below:
for i in range(10):
print(i)
0
1
2
3
4
5
6
7
8
9
In this case we are iterating over the values provided by range()
.
range()
is a special generator function we can use to provide
a sequence of numbers.
We can also iterate over a list, or any collection of elements:
for element in ['a', True, None]:
print(type(element))
<class 'str'>
<class 'bool'>
<class 'NoneType'>
Vectorised operations with Numpy
Numpy is a numerical library designed to make working with numbers easier than it would otherwise be.
For example, say we had a list of a thousand numbers. There’s no way to do vector maths without iterating through all the elements!
vals = list(range(1000))
new_vals = vals.copy()
print(new_vals[:5])
for idx in range(1000):
new_vals[idx] += 10
print(new_vals[:5])
[0, 1, 2, 3, 4]
[10, 11, 12, 13, 14]
That was a lot of work.
Numpy lets us do vector maths like in R, saving us a lot of effort.
The most basic function is np.array()
which creates a numerical
array from a list.
A numpy array is a collection of numbers that can have any number of
dimensions.
In this case, there is only one dimension, since we created the array from a
list.
import numpy as np
new_vals = np.array(vals)
new_vals += 10
new_vals[:5]
array([10, 11, 12, 13, 14])
One very nice thing about Numpy is that it’s much more performant than ordinary
Python lists.
A nice trick we can use with IPython to measure execution times is the
%timeit
magic function.
Anything following the %timeit
gets measured for speed.
Adding %%
to the timeit
command instead of %
means that timeit
is run
on the entire cell, not just a single line. Note that %%timeit
must be on the
first line of an IPython/Jupyter cell for it to work, whereas the %timeit
command can be used anywhere.
Using Python’s lists:
%%timeit
for idx in range(1000):
vals[idx] + 10
10000 loops, best of 3: 165 µs per loop
Using numpy:
%timeit new_vals + 10
The slowest run took 22.13 times longer than the fastest.
This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.63 µs per loop
Numpy was about 100x faster, though %timeit
did mention that Numpy could be
cheating a bit.
Even in Numpy’s worst case scenario however, it still ran 5x faster than using
Python’s basic lists.
Working with multiple dimensions
Sometimes, you’ll encounter a dataset with multiple dimensions and will need to be able to retrieve elements from it as such.
arr2d = np.arange(0, 40) # sequence of numbers from 0 to 39
arr2d = arr2d.reshape([5, 8]) # reshape so it has 5 rows and 8 columns
arr2d
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39]])
In this case, we must index using multiple indices, separated by a comma.
To grab the first element, we would use [0, 0]
arr2d[0, 0]
0
The first index, corresponds to rows, the second corresponds to columns, and the third to the next dimension…
arr2d[0, :]
arr2d[:, 0]
array([0, 1, 2, 3, 4, 5, 6, 7])
array([ 0, 8, 16, 24, 32])
Practising indexing
Retrieve everything defined in the range of rows 4-5 and columns 1-4.
Key Points
Lists store a sequence of elements.
Numpy allows vector maths in Python.
Storing data with dicts
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do I store structured data?
Objectives
Be able to store data using Python’s dict objects.
Dictionaries (also called dicts) are another key data structure we’ll need to use to write a pipeline. In particular, dicts allow efficient key-value storage of any type of data.
To create a dict, we use syntax like the following.
example = {}
type(example)
dict
We can then store values in our dict using indexing. The index is referred to as the “key”, and the stored data is referred to as the “value”.
example['key'] = 'value'
example['key']
'value'
In addition, keys can be stored using any type of value. Let’s add several more values to demonstrate this.
example[1] = 2
example[4] = False
example['test'] = 5
example[7] = 'myvalue'
To retrieve all keys in the dictionary, we can use the .keys()
method.
Note how we used the list()
function to turn our resulting output into a
list.
list(example.keys())
['key', 1, 4, 'test', 7]
Likewise, we can retrieve all the values at once, using .values()
list(example.values())
['value', 2, False, 5, 'myvalue']
Dictionary order
Note that the order of keys and values in a dictionary should not be relied upon. We’ll create dictionary another way to demonstrate this:
unordered = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
Depending on your version of Python, the dictionary will either be in order, or out of order. If you are on Python 3.6+ dictionaries are ordered.
Iterate through and print the dictionary’s keys in both forward and reverse order.
(To iterate through the dict in a specific order, you will need to sort the keys using the
sorted()
function.)
Key Points
Dicts provide key-value storage of information.
Functions and Conditions
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do I write functions?
Objectives
Be able to write our own functions and use basic functional programming constructs like
map()
andfilter()
.
Of course, at some point, we are going to want to define our own functions rather than just use the ones provided by Python and its various modules.
The general syntax for defining a function is as follows:
def function(arg1):
# do stuff with arg1
return answer
So, an example function that adds two numbers together might look a little like this:
def adder(x, y):
return x + y
adder(1, 2)
3
We can also add a default argument (say if we wanted y to be equal to 10 unless we otherwise specified), by using an equals sign and a default value in our function definition:
def adder(x, y=10):
return x + y
adder(5)
15
Practice defining functions
Define a function that converts from temperatures in Fahrenheit to temperatures in Kelvin, and another function that converts back again.
The general formula for the conversion from Fahrenheit to Kelvin is:
kelvin = (fahr - 32) * 5 / 9 + 273.15
Conditional statements
We may also need to have our functions do specific things in some conditions, but not in others. This relies upon comparisons between items:
In python, comparison is done using the ==
operator:
True == True
True == False
'words' == 'words'
True
False
True
not
indicates the opposite of True or False, and !=
means not equal to.
not True == False
True != False
True
True
As with other programming languages, we can make the usual comparisons with the
>
and <
operators.
Adding an equals sign (>=
, <=
) indicates less than or equal to or greater
than or equal to.
5 < 10
5 > 10
-4 >= -4
1 <= 2
True
False
True
True
These statements can be combined with the if
statement to produce code that
executes at various times.
number = 5
if number <= 10:
print('number was less than 10')
number was less than 10
If the if
statement is not equal to True
,
the statement does not execute:
number = 11
if number <= 10:
print('number was less than 10')
However, we can add code to execute when the if
condition is not met by
adding an else
statement.
number = 11
if number <= 10:
print('number was less than 10')
else:
print('number was greater than 10')
number was greater than 10
And if we want to check an additional statement,
we can use the elif
keyword (else-if):
number = 10
if number < 10:
print('number was less than 10')
elif number == 10:
print('number was equal to 10')
else:
print('number was greater than 10')
One final note, to check if a value is equal to None
in Python
we must use is None
and is not None
.
Normal ==
operators will not work.
None is None
5 is not None
True
True
Additionally, we can check if one value is in another set of values with the
in
operator:
5 in [4, 5, 6]
43 in [4, 5, 6]
True
False
map(), filter(), and anonymous (lambda) functions
Python has good support for functional programming, and has its own equivalents for map/reduce-style functionality. To “map” a function means to apply it to a set of elements. To “reduce” means to collapse a set of values to a single value. Finally, “filtering” means returning only a set of elements where a certain value is true.
Let’s explore what that means with our own functions. The syntax of map/reduce/filter is identical:
map(function, thing_to_iterate_over, next_thing_to_iterate_over)
Let’s apply this to a few test cases using map. Note that when selecting which function we are going to “map” with,
import math
values = [0, 1, 2, 3, 4, 5, 6]
map(math.sin, values)
<map object at 0x7f31c246cba8>
To retrieve the actual values, we typically need to make the resulting output a list.
list(map(math.sin, values))
[0.0,
0.8414709848078965,
0.9092974268256817,
0.1411200080598672,
-0.7568024953079282,
-0.9589242746631385,
-0.27941549819892586]
filter()
applies a similar operation,
but instead of applying a function to every piece,
it only returns points where a function returns true.
def less_than_3(val):
return val < 3
list(filter(less_than_3, values))
[0, 1, 2]
That was very inconvenient. We had to define an entire function just to only use it once. The solution for this is to write a one-time use function that has no name. Such functions are called either anonymous functions or lamdba functions (both mean the same thing).
To define a lambda function in python, the general syntax is as follows:
lambda x: x + 54
In this case, lambda x:
indicates we are defining a lambda function with a
single argument, x
.
Everything following the :
is our function.
Whatever value this evaluates to is automatically returned.
So lambda x: x + 54
equates to:
def some_func(x):
return x + 54
Rewriting our filter statement to use a lambda function:
list(filter(lambda x: x < 3, values))
[0, 1, 2]
And a side-by-side example that demonstrates the difference between map()
and
filter()
.
list(map(lambda x: x+100, [1,2,3,4,5]))
list(filter(lambda x: x<3, [1,2,3,4,5]))
[101, 102, 103, 104, 105] # map()
[1, 2] # filter()
Using lambdas in practice
Add
'-cheesecake'
to every word in the following list usingmap()
.
['new york', 'chocolate', 'new york', 'ketchup', 'mayo']
Using
filter()
, remove the items which would be absolutely terrible to eat.
map/filter style functionality with Numpy arrays
Although you could use a for-loop to apply a custom function to a numpy array
in a single go, there is a handy np.vectorize()
function you can use to
convert your functions to a vectorised numpy equivalent.
Note that this is purely for convenience — this uses a for-loop
internally.
import numpy as np
# create a function to perform cubes of a number
vector_cube = np.vectorize(lambda x: x ** 3)
vector_cube(np.array([1, 2, 3, 4, 5]))
array([ 1, 8, 27, 64, 125])
To perform a similar option to filter()
,
you can actually specify a conditional statement inside the []
when indexing a Numpy array.
arr = np.array([1, 2, 3, 4, 5])
arr[arr >= 3]
Removing np.nan values
Remove all of the
np.nan
values from the following sequence using logical indexing.
np.array([np.nan, np.nan, 2, 3, 4, np.nan])
Key Points
map()
applies a function to every object in a data structure.
filter()
returns only the data objects for which some condition is true.
Introduction to parallel computing
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do I run code in parallel?
Objectives
Understand how to run parallel code with
multiprocessing
.
The primary goal of these lesson materials is to accelerate your workflows by executing them in a massively parallel (and reproducible!) manner. Of course, what does this actually mean?
The basic concept of parallel computing is simple to understand: we divide our job in tasks that can be executed at the same time, so that we finish the job in a fraction of the time that it would have taken if the tasks are executed one by one. There are a lot of different ways of parallelizing things however - we need to cover these concepts before running our workflows in parallel.
Let’s start with an analogy: suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks.
Concurrent vs. parallel execution
If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel.
In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited.
Synchronous vs. asynchronous execution
Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced.
In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous.
Distributed vs. shared memory
Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour.
Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required).
Processes vs. threads
Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter.
In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously.
How does parallelization work in practice?
These concepts translate into several different types of parallel computing, each good at certain types of tasks:
Asynchronous programming
Often times, certain computations involve a lot of waiting. Perhaps you sent some information to a webserver on the internet and are waiting back on a response. In this case, if you needed to make lots of requests over the internet, your program would spend ages just waiting to hear back. In this scenario, it would be very advantageous to fire off a bunch of requests to the internet, and then instead of waiting on each one, check back periodically to see if the request has completed before processing each request individually.
This is an example of asynchronous programming. One thread executes many tasks at the same time, periodically checking on each one, and only taking an action once some external task has completed. Asynchronous programming is very important when programming for the web, where lots of waiting around happens. To do this in Python, you’d typically want to use something like the asyncio module. It’s not very useful for scientific programming, because only one core/thread is typically doing any work — a normal program that doesn’t run in parallel at all would be just as fast!
Shared memory programming
Shared memory programming means using the resources on a single computer,
and having multiple threads or processes work together on a single copy of a
dataset in memory.
This is the most common form of parallel programming and is relatively easy to
do.
We will cover basic shared-memory programming in Python using the
multiprocess
/ multiprocessing
packages in this lesson.
Distributed memory programming
Shared memory programming, although very useful, has one major limitation: we can only use the number of CPU cores present on a single computer. If we want to increase speed any more, we need a better computer. Big computers cost lots and lots of money. Wouldn’t it be more efficient to just use a lot of smaller, cheap computers instead?
This is the rationale behind distributed memory programming — a task is farmed out to a large number of computers, each of which tackle an individual portion of a problem. Results are communicated back and forth between compute nodes.
This is most advantageous when a dataset is too large to fit into a computer’s memory (depending on the hardware you have access to this can be anything from several dozen gigabytes, to several terabytes). Frameworks like MPI, Hadoop, and Spark see widespread use for these types of problems (and are not covered in this lesson).
Serial farming
In many cases, we’ll need to repeat the same computation multiple times. Maybe we need to run the same set of steps on 10 different samples. There doesn’t need to be any communication at all, and each task is completely independent of the others.
In this scenario, why bother with all of these fancy parallel programming techniques, let’s just start the same program 10 times on 10 different datasets on 10 different computers. The work is still happening in parallel, and we didn’t need to change anything about our program to achieve this. As an extra benefit, this works the same for every program, regardless of what it does or what language it was written in.
This technique is known as serial farming, and is the primary focus of this lesson. We will learn to use Snakemake to coordinate the parallel launch of dozens, if not hundreds or thousands of independent tasks.
Parallelization in Python
Python does not thread very well.
Specifically, Python has a very nasty drawback known as a Global Interpreter
Lock (GIL).
The GIL ensures that only one compute thread can run at a time.
This makes multithreaded processing very difficult.
Instead, the best way to go about doing things is to use multiple independent
processes to perform the computations.
This method skips the GIL,
as each individual process has it’s own GIL that does not block the others.
This is typically done using the multiprocessing
module.
Before we start, we will need the number of CPU cores in our computer.
To get the number of cores in our computer, we can use the psutil
module.
We are using psutil
instead of multiprocessing
because psutil
counts
cores instead of threads.
Long story short, cores are the actual computation units,
threads allow additional multitasking using the cores you have.
For heavy compute jobs, you are generally interested in cores.
import psutil
# logical=True counts threads, but we are interested in cores
psutil.cpu_count(logical=False)
8
Using this number, we can create a pool of worker processes with which to parallelize our jobs:
from multiprocessing import Pool
pool = Pool(psutil.cpu_count(logical=False))
The pool
object gives us a set of parallel workers we can
use to parallelize our calculations.
In particular, there is a map function
(with identical syntax to the map()
function used earlier),
that runs a workflow in parallel.
Let’s try map()
out with a test function that just runs sleep.
import time
def sleeping(arg):
time.sleep(0.1)
%timeit list(map(sleeping, range(24)))
1 loop, best of 3: 2.4 s per loop
Now let’s try it in parallel:
%timeit pool.map(sleeping, range(24))
If you are using a Jupyter notebook, this will fail:
# more errors omitted
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>
Differences between Jupyter notebooks versus and the Python interpreters
The last command may have succeeded if you are running in a Python or IPython shell. This is due to a difference in the way Jupyter executes user-defined functions):
1 loop, best of 3: 302 ms per loop
Jupyter notebooks define user functions under a special Python module called
__main__
. This does not work withmultiprocessing
. However these issues are not limited to Jupyter notebooks — a similar error will occur if you use a lambda function instead:pool.map(lambda x: time.sleep(0.1), range(24))
--------------------------------------------------------------------------- PicklingError Traceback (most recent call last) <ipython-input-10-df8237b4b421> in <module>() ----> 1 pool.map(lambda x: time.sleep(0.1), range(24)) # more errors omitted
The multiprocessing
module has a major limitation:
it only accepts certain functions, and in certain situations.
For instance any class methods, lambdas, or functions defined in __main__
wont’ work.
This is due to the way Python “pickles” (read: serialises) data
and sends it to the worker processes.
“Pickling” simply can’t handle a lot of different types of Python objects.
Fortunately, there is a fork of the multiprocessing
module called
multiprocess
that works just fine (pip install --user multiprocess
).
multiprocess
uses dill
instead of pickle
to serialise Python objects
(read: send your data and functions to the Python workers),
and does not suffer the same issues.
Usage is identical:
# shut down the old workers
pool.close()
from multiprocess import Pool
pool = Pool(8)
%timeit pool.map(lambda x: time.sleep(0.1), range(24))
pool.close()
1 loop, best of 3: 309 ms per loop
This is a general purpose parallelization recipe that you can use for your Python projects.
# make sure to always use multiprocess
from multiprocess import Pool
# start your parallel workers at the beginning of your script
pool = Pool(number_of_cores)
# execute a computation(s) in parallel
result = pool.map(your_function, something_to_iterate_over)
result2 = pool.map(another_function, more_stuff_to_iterate_over)
# turn off your parallel workers at the end of your script
pool.close()
Parallel workers (with their own copy of everything) are created, data are sent
to these workers, and then results are combined back together again.
There is also an optional chunksize
argument (for pool.map()
) that lets you
control how big each chunk of data is before it’s sent off to each worker.
A larger chunk size means that less time is spent shuttling data to and from
workers, and will be more useful if you have a large number of very fast
computations to perform.
When each iteration takes a very long time to run, you will want to use a
smaller chunk size.
Key Points
Pool.map()
will perform an operation in parallel.
Introduction to Snakemake
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I make my results easier to reproduce?
Objectives
Understand our example problem.
Let’s imagine that we’re interested in seeing the frequency of various words in various books.
We’ve compiled our raw data i.e. the books we want to analyse and have prepared several Python scripts that together make up our analysis pipeline.
Let’s take quick look at one of the books using the command
$ head books/isles.txt
By default, head
displays the first 10 lines of the specified file.
A JOURNEY TO THE WESTERN ISLANDS OF SCOTLAND
INCH KEITH
I had desired to visit the Hebrides, or Western Islands of Scotland, so
long, that I scarcely remember how the wish was originally excited; and
was in the Autumn of the year 1773 induced to undertake the journey, by
finding in Mr. Boswell a companion, whose acuteness would help my
Our directory has the Python scripts and data files we we will be working with:
|- books
| |- abyss.txt
| |- isles.txt
| |- last.txt
| |- LICENSE_TEXTS.md
| |- sierra.txt
|- plotcount.py
|- wordcount.py
|- zipf_test.py
The first step is to count the frequency of each word in a book.
The first argument (books/isles.txt
) to wordcount.py is the file to analyse,
and the last argument (isles.dat
) specifies the output file to write.
$ python wordcount.py books/isles.txt isles.dat
Let’s take a quick peek at the result.
$ head -5 isles.dat
This shows us the top 5 lines in the output file:
the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181
We can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.
We can do the same thing for a different book:
$ python wordcount.py books/abyss.txt abyss.dat
$ head -5 abyss.dat
the 4044 6.35449402891
and 2807 4.41074795726
of 1907 2.99654305468
a 1594 2.50471401634
to 1515 2.38057825267
Let’s visualise the results.
The script plotcount.py
reads in a data file and plots the 10 most
frequently occurring words as a text-based bar plot:
$ python plotcount.py isles.dat ascii
the ########################################################################
of ##############################################
and ################################
to ############################
a #########################
in ###################
is #################
that ############
by ###########
it ###########
plotcount.py
can also show the plot graphically:
$ python plotcount.py isles.dat show
Close the window to exit the plot.
plotcount.py
can also create the plot as an image file (e.g. a PNG file):
$ python plotcount.py isles.dat isles.png
Finally, let’s test Zipf’s law for these books:
$ python zipf_test.py abyss.dat isles.dat
Book First Second Ratio
abyss 4044 2807 1.44
isles 3822 2460 1.55
Zipf’s Law
Zipf’s Law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions.
Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). Only 135 vocabulary items are needed to account for half the Corpus.
Source: Wikipedia
Together these scripts implement a common workflow:
- Read a data file.
- Perform an analysis on this data file.
- Write the analysis results to a new file.
- Plot a graph of the analysis results.
- Save the graph as an image, so we can put it in a paper.
- Make a summary table of the analyses
Running wordcount.py
and plotcount.py
at the shell prompt, as we
have been doing, is fine for one or two files. If, however, we had 5
or 10 or 20 text files,
or if the number of steps in the pipeline were to expand, this could turn into
a lot of work.
Plus, no one wants to sit and wait for a command to finish, even just for 30
seconds.
The most common solution to the tedium of data processing is to write a shell script that runs the whole pipeline from start to finish.
Using your text editor of choice (e.g. nano), add the following to a new file
named run_pipeline.sh
.
# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table for the Zipf's law tests
python wordcount.py books/isles.txt isles.dat
python wordcount.py books/abyss.txt abyss.dat
python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png
# Generate summary table
python zipf_test.py abyss.dat isles.dat > results.txt
Run the script and check that the output is the same as before:
$ bash run_pipeline.sh
$ cat results.txt
This shell script solves several problems in computational reproducibility:
- It explicitly documents our pipeline, making communication with colleagues (and our future selves) more efficient.
- It allows us to type a single command,
bash run_pipeline.sh
, to reproduce the full analysis. - It prevents us from repeating typos or mistakes. You might not get it right the first time, but once you fix something it’ll stay fixed.
Despite these benefits it has a few shortcomings.
Let’s adjust the width of the bars in our plot produced by plotcount.py
.
Edit plotcount.py
so that the bars are 0.8 units wide instead of 1 unit.
(Hint: replace width = 1.0
with width = 0.8
in the definition of
plot_word_counts
.)
Now we want to recreate our figures.
We could just bash run_pipeline.sh
again.
That would work, but it could also be a big pain if counting words takes
more than a few seconds.
The word counting routine hasn’t changed; we shouldn’t need to recreate
those files.
Alternatively, we could manually rerun the plotting for each word-count file. (Experienced shell scripters can make this easier on themselves using a for-loop.)
$ for book in abyss isles; do python plotcount.py $book.dat $book.png; done
With this approach, however, we don’t get many of the benefits of having a shell script in the first place.
Another popular option is to comment out a subset of the lines in
run_pipeline.sh
:
# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table
# These lines are commented out because they don't need to be rerun.
#python wordcount.py books/isles.txt isles.dat
#python wordcount.py books/abyss.txt abyss.dat
python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png
# This line is also commented out because it doesn't need to be rerun.
# python zipf_test.py abyss.dat isles.dat > results.txt
Then, we would run our modified shell script using bash run_pipeline.sh
.
But commenting out these lines, and subsequently un-commenting them, can be a hassle and source of errors in complicated pipelines. What happens if we have hundreds of input files? No one wants to enter the same command a hundred times, and then edit the result.
What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what tasks need to be run where and when, then perform those tasks for us.
What is Snakemake and why are we using it?
There are many different tools that researchers use to automate this type of work. Snakemake is a very popular tool, and the one we have selected for this tutorial. There are several reasons this tool was chosen:
-
It’s free, open-source, and installs in about 5 seconds flat via
pip
. -
Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC schedulers. More importantly, the same workflow will work and scale appropriately regardless of whether it’s on a laptop or cluster without modification.
-
Snakemake uses pure Python syntax. There is no tool specific-language to learn like in GNU Make, NextFlow, WDL, etc.. Even if students end up not liking Snakemake, you’ve still taught them how to program in Python at the end of the day.
-
Anything that you can do in Python, you can do with Snakemake (since you can pretty much execute arbitrary Python code anywhere).
-
Snakemake was written to be as similar to GNU Make as possible. Users already familiar with Make will find Snakemake quite easy to use.
-
It’s easy. You can (hopefully!) learn Snakemake in an afternoon!
The rest of these lessons aim to teach you how to use Snakemake by example. Our goal is to automate our example workflow, and have it do everything for us in parallel regardless of where and how it is run (and have it be reproducible!).
Key Points
Bash scripts are not an efficient way of storing a workflow.
Snakemake is one method of managing a complex computational workflow.
Snakefiles
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do I write a simple workflow?
Objectives
Understand the components of a Snakefile: rules, inputs, outputs, and actions.
Write a simple Snakefile.
Run Snakemake from the shell.
Create a file, called Snakefile
, with no file extension and containing the
following content:
# Count words.
rule count_words:
input: 'books/isles.txt'
output: 'isles.dat'
shell: 'python wordcount.py books/isles.txt isles.dat'
This is a build file, which for
Snakemake is called a Snakefile —
a file executed by Snakemake. Note that aside from a few keyword additions like
rule
, it follows standard Python 3 syntax.
Let us go through each line in turn:
#
denotes a comment. Any text from#
to the end of the line is ignored by Snakemake.isles.dat
is a target, a file to be created, or built. In Snakemake, these are called “outputs”, for simplicity’s sake.books/isles.txt
is a dependency, a file that is needed to build or update the target. Targets can have zero or more dependencies. Dependencies in Snakemake are called “inputs”.python wordcount.py books/isles.txt isles.dat
is an action, a command to run to build or update the target using the dependencies. In this case the action is a set of shell commands (we can also use Python code… more on that later).- Like Python, you can use either tabs or spaces for indentation — don’t use both!
- Together, the target, dependencies, and actions form a rule. A rule is a recipe for how to make things.
Our rule above describes how to build the target isles.dat
using the
action python wordcount.py
and the dependency books/isles.txt
.
Information that was implicit in our shell script - that we are
generating a file called isles.dat
and that creating this file
requires books/isles.txt
- is now made explicit by Snakemake’s syntax.
Let’s first ensure we start from scratch and delete the .dat
and .png
files we created earlier:
$ rm *.dat *.png
By default, Snakemake looks for a file called Snakefile
, and we can
run Snakemake as follows:
$ snakemake
By default, Snakemake tells us what it’s doing as it executes actions:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1
rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Depending on your setup, you may receive an error
Error: you need to specify the maximum number of CPU cores to be used at the
same time with --cores.
This can be fixed using an argument to the sankemake command.
Try running the following:
$ snakemake --cores 1
If you see a different error, check your syntax and the filepaths of the files
in your Snakefile.
You can check your present working directory using the command pwd
.
Remember, aside from stuff like rule
and input
,
Snakemake follows Python syntax.
Let’s see if we got what we expected:
$ head -5 isles.dat
The first 5 lines of isles.dat
should look exactly like before.
Snakefiles Do Not Have to be Called
Snakefile
We don’t have to call our Snakefile
Snakefile
. However, if we call it something else we need to tell Snakemake where to find it. This we can do using-s
flag. For example, if our Snakefile is namedMyOtherSnakefile
:$ snakemake -s MyOtherSnakefile
When we re-run our Snakefile, Snakemake now informs us that:
Nothing to be done.
This is because our target, isles.dat
, has now been created, and
Snakemake will not create it again. To see how this works, let’s pretend to
update one of the text files. Rather than opening the file in an
editor, we can use the shell touch
command to update its timestamp
(which would happen if we did edit the file):
$ touch books/isles.txt
If we compare the timestamps of books/isles.txt
and isles.dat
,
$ ls -l books/isles.txt isles.dat
then we see that isles.dat
, the target, is now older
thanbooks/isles.txt
, its dependency:
-rw-r--r-- 1 mjj Administ 323972 Jun 12 10:35 books/isles.txt
-rw-r--r-- 1 mjj Administ 182273 Jun 12 09:58 isles.dat
If we run Snakemake again,
$ snakemake
then it recreates isles.dat
:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1
rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
When it is asked to build a target, Snakemake checks the “last modification time” of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.
Snakefiles as Documentation
By explicitly recording the inputs to and outputs from steps in our analysis and the dependencies between files, Snakefiles act as a type of documentation, reducing the number of things we have to remember.
Let’s add another rule to the end of Snakefile
.
Note that rules cannot have the same name,
so we’ll call this one count_words_abyss
.
rule count_words_abyss:
input: 'books/abyss.txt'
output: 'abyss.dat'
shell: 'python wordcount.py books/abyss.txt abyss.dat'
If we run Snakemake,
$ snakemake
then we get:
Nothing to be done.
Nothing happens because Snakemake attempts to build the first target it
finds in the Snakefile, the default target,
which is isles.dat
which is already up-to-date.
We need to explicitly tell Snakemake we want to build abyss.dat
:
$ snakemake abyss.dat
Now, we get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words_abyss
1
rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 0
Finished job 0.
1 of 1 steps (100%) doneat
“Up to Date” Versus “Nothing to be Done”
If we ask Snakemake to build a file that already exists and is up to date, then Snakemake informs us that:
Nothing to be done
If we ask Snakemake to build a file that exists but for which there is no rule in our Snakefile, then we get a message like:
$ snakemake wordcount.py
MissingRuleException: No rule to produce wordcount.py (if you use input functions make sure that they don't raise unexpected exceptions).
When we see this error, double-check that you have a rule to produce that file, and also that the filename has been specified correctly. Even a small difference in a filename will result in a
MissingRuleException
.
We may want to remove all our data files so we can explicitly recreate
them all. We can introduce a new target, and associated rule, to do
this. We will call it clean
, as this is a common name for rules that
delete auto-generated files, like our .dat
files:
rule clean:
shell: 'rm -f *.dat'
This is an example of a rule that has no inputs or outputs! We just want to remove the data files whether or not they exist. If we run Snakemake and specify this target,
$ snakemake clean
then we get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 clean
1
rule clean:
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
An ls
of our current directory reveals that all of our troublesome output
files are now gone (as planned)!
We can add a similar command to create all the data files. We can put
this at the top of our Snakefile so that it is the default target, which is executed by default
if no target is given to the snakemake
command:
rule dats:
input:
'isles.dat',
'abyss.dat'
This is an example of a rule that has dependencies that are targets of
other rules. When snakemake
runs, it will check to see if the dependencies
exist and, if not, will see if rules are available that will create
these. If such rules exist it will invoke these first, otherwise
snakemake
will raise an error.
Dependencies
The order of rebuilding dependencies is arbitrary. You should not assume that they will be built in the order in which they are listed.
Dependencies must form a directed acyclic graph. A target cannot depend on a dependency which itself, or one of its dependencies, depends on that target.
This rule is also an example of a rule that has no actions. It is used purely to trigger the build of its dependencies, if needed.
If we run,
$ snakemake dats
then snakemake creates the data files:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 dats
3
rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 1
Finished job 1.
1 of 3 steps (33%) done
rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 2
Finished job 2.
2 of 3 steps (67%) done
localrule dats:
input: isles.dat, abyss.dat
jobid: 0
Finished job 0.
3 of 3 steps (100%) done
If we run dats
again, then snakemake will see that the dependencies
(isles.dat
and abyss.dat
) are already up to date.
Given the target dats
has no actions, there is nothing to be done
:
$ snakemake dats
Nothing to be done
Our Snakefile now looks like this:
rule dats:
input:
'isles.dat',
'abyss.dat'
# delete everything so we can re-run things
rule clean:
shell: 'rm -f *.dat'
# count words in one of our "books"
rule count_words:
input: 'books/isles.txt'
output: 'isles.dat'
shell: 'python wordcount.py books/isles.txt isles.dat'
rule count_words_abyss:
input: 'books/abyss.txt'
output: 'abyss.dat'
shell: 'python wordcount.py books/abyss.txt abyss.dat'
The following figure shows a graph of the dependencies embodied within
our Snakefile, involved in building the dats
target:
At this point, it becomes important to see what Snakemake is doing behind the
scenes.
What commands is Snakemake actually running?
Snakemake has a special option (-p
),
that prints every command it is about to run.
Additionally, we can also perform a dry run with -n
.
A dry run does nothing, and simply prints out commands instead of actually
executing them.
Very useful for debugging!
$ snakemake clean
$ snakemake -n -p isles.dat
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 0
wildcards: file=isles
python wordcount.py books/isles.txt isles.dat
Job counts:
count jobs
1 count_words
1
Write Two New Rules
- Write a new rule for
last.dat
, created frombooks/last.txt
.- Update the
dats
rule with this target.- Write a new rule for
results.txt
, which creates the summary table. The rule needs to:
- Depend upon each of the three
.dat
files.- Invoke the action
python zipf_test.py abyss.dat isles.dat last.dat > results.txt
.- Put this rule at the top of the Snakefile so that it is the default target.
- Update
clean
so that it removesresults.txt
.
The following figure shows the dependencies embodied within our
Snakefile, involved in building the results.txt
target:
Key Points
Snakemake follows Python syntax
Rules can have an input and/or outputs, and a command to be run.
Wildcards
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can I abbreviate the rules in my pipeline?
Objectives
Use snakemake wildcards to simplify our rules.
Output files are a product not only of input files but of the scripts or code that created the output files.
After the exercise at the end of the previous episode, our Snakefile looked like this:
# generate summary table
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'
rule dats:
input: 'isles.dat', 'abyss.dat', 'last.dat'
# delete everything so we can re-run things
rule clean:
shell: 'rm -f *.dat results.txt'
# count words in one of our "books"
rule count_words:
input: 'books/isles.txt'
output: 'isles.dat'
shell: 'python wordcount.py books/isles.txt isles.dat'
rule count_words_abyss:
input: 'books/abyss.txt'
output: 'abyss.dat'
shell: 'python wordcount.py books/abyss.txt abyss.dat'
rule count_words_last:
input: 'books/last.txt'
output: 'last.dat'
shell: 'python wordcount.py books/last.txt last.dat'
Our Snakefile has a lot of duplication. For example, the names of text files and data files are repeated in many places throughout the Snakefile. Snakefiles are a form of code and, in any code, repeated code can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).
D.R.Y. (Don’t Repeat Yourself)
In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python or R or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.
Let us set about removing some of the repetition from our Snakefile.
In our zipf_test
rule we duplicate the data file names and the
name of the results file name:
rule zipf_test:
input:
'abyss.dat',
'last.dat',
'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'
Looking at the results file name first, we can replace it in the action
with {output}
:
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py abyss.dat isles.dat last.dat > {output}'
{output}
is a Snakemake wildcard which is equivalent to the value
we specified for the output
section of the rule.
We can replace the dependencies in the action with {input}
:
rule zipf_test:
input: 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python zipf_test.py {input} > {output}'
{input}
is another wildcard which means ‘all the dependencies of the current
rule’. Again, when Snakemake is run it will replace this variable with the
dependencies.
Let’s update our text files and re-run our rule:
$ touch books/*.txt
$ snakemake results.txt
We get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 count_words_last
1 zipf_test
4
rule count_words_last:
input: books/last.txt
output: last.dat
jobid: 1
Finished job 1.
1 of 4 steps (25%) done
rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
2 of 4 steps (50%) done
rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 3
Finished job 3.
3 of 4 steps (75%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
4 of 4 steps (100%) done
Update Dependencies
What will happen if you now execute:
$ touch *.dat $ snakemake results.txt
- nothing
- all files recreated
- only
.dat
files recreated- only
results.txt
recreatedSolution
4.
Onlyresults.txt
recreated.The rules for
*.dat
are not executed because their corresponding.txt
files haven’t been modified.If you run:
$ touch books/*.txt $ snakemake results.txt
you will find that the
.dat
files as well asresults.txt
are recreated.
As we saw, {input}
means ‘all the dependencies of the current rule’.
This works well for results.txt
as its action treats all the dependencies the
same — as the input for the zipf_test.py
script.
Rewrite
.dat
rules to use wildcardsRewrite each
.dat
rule to use the{input}
and{output}
wildcards.
Handling dependencies differently
For many rules, we may want to treat some dependencies
differently. For example, our rules for .dat
use their first (and
only) dependency specifically as the input file to wordcount.py
. If
we add additional dependencies (as we will soon do) then we don’t want
these being passed as input files to wordcount.py
as it expects only
one input file to be named when it is invoked.
Snakemake provides several solutions to this. Depending on what we want to do, it’s possible to both index and name our wildcards.
Suppose we want to add wordcount.py
as a dependency of each data file.
In this case, we can use {input[0]}
to refer to the first dependency,
and {input[1]}
to refer to the second.
rule count_words:
input: 'wordcount.py', 'books/isles.txt'
output: 'isles.dat'
shell: 'python {input[0]} {input[1]} {output}'
Alternatively, we can name our dependencies.
rule count_words_abyss:
input:
wc='wordcount.py',
book='books/abyss.txt'
output: 'abyss.dat'
shell: 'python {input.wc} {input.book} {output}'
Let’s mark wordcount.py
as updated, and re-run the pipeline.
$ touch wordcount.py
$ snakemake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 zipf_test
3
rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
1 of 3 steps (33%) done
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1
Finished job 1.
2 of 3 steps (67%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
3 of 3 steps (100%) done
Notice how last.dat
(which does not depend on wordcount.py
) is not rebuilt.
Intuitively, we should also add wordcount.py
as dependency for
results.txt
, as the final table should be rebuilt as we remake the
.dat
files. However, it turns out we don’t have to! Let’s see what
happens to results.txt
when we update wordcount.py
:
$ touch wordcount.py
$ snakemake results.txt
then we get:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1 count_words_abyss
1 zipf_test
3
rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2
Finished job 2.
1 of 3 steps (33%) done
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1
Finished job 1.
2 of 3 steps (67%) done
rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
3 of 3 steps (100%) done
The whole pipeline is triggered, even the creation of the
results.txt
file! To understand this, note that according to the
dependency figure, results.txt
depends on the .dat
files. The
update of wordcount.py
triggers an update of the *.dat
files. Thus, snakemake
sees that the dependencies (the .dat
files) are
newer than the target file (results.txt
) and thus it recreates
results.txt
. This is an example of the power of snakemake
: updating a
subset of the files in the pipeline triggers rerunning the appropriate
downstream steps.
Updating One Input File
What will happen if you now execute:
touch books/last.txt snakemake results.txt
- only
last.dat
is recreated- all
.dat
files are recreated- only
last.dat
andresults.txt
are recreated- all
.dat
andresults.txt
are recreated
More dependencies…
Add
zipf_test.py
as a dependency ofresults.txt
Which method do you prefer here, indexing or named input files? Yes, this will be clunky, but we’ll fix that part later! Remember that you can do a dry run withsnakemake -n -p
!
Key Points
Use
{output}
to refer to the output of the current rule.Use
{input}
to refer to the dependencies of the current rule.You can use Python indexing to retrieve individual outputs and inputs (example:
{input[0]}
)Wildcards can be named (example:
{input.file1}
).
Pattern Rules
Overview
Teaching: 15 min
Exercises: 0 minQuestions
How can I define rules to operate on similar files?
Objectives
Write Snakemake pattern rules.
Our Snakefile still has a ton of repeated content.
The rules for each .dat
file all do the same thing for the part.
We can replace these rules with a single pattern rule which can be used to build any
.dat
file from a .txt
file in books/
:
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: '{file}.dat'
shell: 'python {input.wc} {input.book} {output}'
{file}
is another arbitrary wildcard,
that we can use as a placeholder for any generic book to analyse.
Note that we don’t have to use {file}
as the name of our wildcard —
it can be anything we want!
This rule can be interpreted as:
“In order to build a file named something.dat
(the output)
find a file named books/something.txt
(the input)
and run wordcount.py input output
.”
$ snakemake clean
# use the -p option to show that it is running things correctly!
$ snakemake -p dats
We should see the same output as before.
Note that we can still use snakemake to build individual .dat
targets as
before, and that our new rule will work no matter what stem is being matched.
$ snakemake -p sierra.dat
which gives the output below:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 count_words
1
rule count_words:
input: wordcount.py, books/sierra.txt
output: sierra.dat
jobid: 0
wildcards: file=sierra
python wordcount.py books/sierra.txt sierra.dat
Finished job 0.
1 of 1 steps (100%) done
Using wildcards
Our arbitrary wildcards like
{file}
can only be used ininput:
andoutput:
fields. It cannot be used in actions.
Our Snakefile is now much shorter and cleaner:
# generate summary table
rule zipf_test:
input: 'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'
rule dats:
input:
'isles.dat', 'abyss.dat', 'last.dat'
# delete everything so we can re-run things
rule clean:
shell: 'rm -f *.dat results.txt'
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: '{file}.dat'
shell: 'python {input.wc} {input.book} {output}'
Key Points
Use any named wildcard (
{some_name}
) as a placeholder in targets and dependencies.
Snakefiles are Python code
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can I automatically manage dependencies and outputs?
How can I use Python code to add features to my pipeline?
Objectives
Use variables, functions, and imports in a Snakefile.
Learn to use the
run
action to execute Python code as an action.
Despite our efforts, our pipeline still has repeated content,
for instance the names of output files/dependencies.
Our zipf_test
rule, for instance, is extremely clunky.
What happens if we want to analyse books/sierra.txt
as well?
We’d have to update everything!
rule zipf_test:
input: 'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell: 'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'
First, let’s cut down on a little bit of the clunkiness of the shell
directive.
One thing you’ve probably noticed is that all of our rules are using Python
strings.
Other data structures work too — let’s try a list:
rule zipf_test:
input:
zipf='zipf_test.py',
books=['abyss.dat', 'last.dat', 'isles.dat']
output: 'results.txt'
shell: 'python {input.zipf} {input.books} > {output}'
(snakemake clean
and snakemake -p
should show that the pipeline still
works!)
This illustrates a key feature of Snakemake.
Snakefiles are just Python code.
We can make our list into a variable to demonstrate this.
Let’s create the variable DATS
and use it in our zipf_test
and dats
rules.
DATS=['abyss.dat', 'last.dat', 'isles.dat']
# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=DATS
output: 'results.txt'
shell: 'python {input.zipf} {input.books} > {output}'
rule dats:
input: DATS
Try re-creating both the dats
and results.txt
targets
(run snakemake clean
in between).
When are Snakefiles executed?
The last example illustrated that we can use arbitrary Python code in our
Snakefile.
It’s important to understand when this code gets executed.
Let’s add a print
instruction to the top of our Snakefile.
print('Snakefile is being executed!')
DATS=['abyss.dat', 'last.dat', 'isles.dat']
# generate summary table
rule zipf_test:
input:
# more output below
Now let’s clean up our workspace with snakemake clean
snakemake clean
Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 clean
1
rule clean:
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Now let’s re-run the pipeline…
$ snakemake
Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
3 count_words
1 zipf_test
4
rule count_words:
input: wordcount.py, books/last.txt
output: last.dat
jobid: 3
wildcards: file=last
Finished job 3.
1 of 4 steps (25%) done
rule count_words:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 1
wildcards: file=abyss
Finished job 1.
2 of 4 steps (50%) done
rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 2
wildcards: file=isles
Finished job 2.
3 of 4 steps (75%) done
rule zipf_test:
input: zipf_test.py, abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0
Finished job 0.
4 of 4 steps (100%) done
Let’s do a dry-run:
$ snakemake -n
Snakefile is being executed!
Nothing to be done.
In every case, the print()
statement ran before any of the actual
pipeline code was run.
What we can take away from this is that Snakemake executes the entire Snakefile
every time we run snakemake
(regardless of if it’s a dry run!).
Because of this, we need to be careful,
and only put tasks that do “real work” (changing files on disk) inside rules.
Using functions in Snakefiles
In our example here, we only have 4 books.
But what if we had 700 books to be processed?
It would be a massive effort to update our DATS
variable to
add the name of every single book’s corresponding .dat
filename.
Fortunately, Snakemake ships with several functions that make working with
large numbers of files much easier.
The two most helpful ones are glob_wildcards()
and expand()
.
Let’s start an interactive Python session to see how they work:
$ python3
Python 3.6.1 (default, Jun 27 2017, 14:35:15)
Type "copyright", "credits" or "license" for more information.
In this example, we will import these Snakemake functions directly in our interactive Python session. It is not necessary to import these functions within your Snakefile — these functions are always imported for you.
from snakemake.io import expand, glob_wildcards
Generating file names with expand()
The first function we’ll use is expand()
.
expand()
is used quite literally,
to expand a snakemake wildcard(s) into a set of filenames.
>>> expand('folder/{wildcard1}_{wildcard2}.txt',
... wildcard1=['a', 'b', 'c'],
... wildcard2=[1, 2, 3])
['folder/a_1.txt',
'folder/a_2.txt',
'folder/a_3.txt',
'folder/b_1.txt',
'folder/b_2.txt',
'folder/b_3.txt',
'folder/c_1.txt',
'folder/c_2.txt',
'folder/c_3.txt']
In this case, expand()
created every possible combination of filenames from
the two wildcards. Useful!
Of course, this still leaves us needing somehow get the values for wildcard1
and wildcard2
in the first place.
Get wildcard values with glob_wildcards()
To get a set of wildcards from a list of files, we can use the
glob_wildcards()
function.
Let’s try grabbing all of the book titles in our books
folder.
>>> glob_wildcards('books/{example}.txt')
Wildcards(example=['isles', 'last', 'abyss', 'sierra'])
glob_wildcards()
returns a Wildcards
object as output.
Wildcards
is a special object defined by Snakemake that
provides named lists.
In this case, there is only one wildcard, {example}
.
We can extract the values for the file names by getting the example
property from the output of glob_wildcards()
>>> glob_wildcards('books/{example}.txt').example
['isles', 'last', 'abyss', 'sierra']
Putting it all together
Using the
expand()
andglob_wildcards()
functions, modify the pipeline so that it automatically detects and analyses all the files in thebooks/
folder.
Using Python code as actions
One very useful feature of Snakemake is the ability to execute Python code
instead of just shell commands.
Instead of shell:
as an action, we can use run:
instead.
Add the following to our snakefile:
# at the top of the file
import os
import glob
# add this wherever
rule print_book_names:
run:
print('These are all the book names:')
for book in glob.glob('books/*.txt'):
print(book)
Upon execution of the corresponding rule, Snakemake dutifully runs our Python
code in the run:
block:
$ snakemake print_book_names
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 print_book_names
1
rule print_book_names:
jobid: 0
These are all the book names:
books/isles.txt
books/last.txt
books/abyss.txt
books/sierra.txt
Finished job 0.
1 of 1 steps (100%) done
Moving output locations
Alter the rules in your Snakefile so that the
.dat
files are created in their owndats/
folder. Note that creating this folder beforehand is unnecessary. Snakemake automatically creates any folders for you, as needed.
Creating PNGs
Add new rules and update existing rules to:
- Create
.png
files from.dat
files usingplotcount.py
.- Remove all auto-generated files (
.dat
,.png
,results.txt
).Finally, many Snakefiles define a default target called
all
as first target, that will build what the Snakefile has been written to build (e.g. in our case, the.png
files and theresults.txt
file). Add anall
target to your Snakefile (Hint: this rule has theresults.txt
file and the.png
files as dependencies, but no actions). With that in place, instead of runningsnakemake results.txt
, you should now runsnakemake all
, or just simplysnakemake
.
Creating an Archive
Update your pipeline to:
- Create an archive,
zipf_analysis.tar.gz
, to hold all our.dat
files, plots, and the Zipf summary table.- Update
all
to expectzipf_analysis.tar.gz
as input.- Remove
zipf_analysis.tar.gz
whensnakemake clean
is called.The syntax to create an archive is shown below:
tar -czvf zipf_analysis.tar.gz file1 directory2 file3 etc
After these exercises our final workflow should look something like the following:
Adding more books
We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website. Project Gutenberg offers thousands of free e-books to download.
Exercise instructions
- Go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ by Oscar Wilde.
- Download the ‘Plain Text UTF-8’ version and save it to the
books
folder; choose a short name for the file- Optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase
End of Project Gutenberg's [title], by [author]
)- Run
snakemake
and check that the correct commands are run- Check the
results.txt
file to see how this book compares to the others
Key Points
Snakefiles are Python code.
The entire Snakefile is executed whenever you run
snakemake
.All actual work should be done by rules.
Resources and parallelism
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How do I scale a pipeline across multiple cores?
How do I manage access to resources while working in parallel?
Objectives
Modify your pipeline to run in parallel.
After the exercises at the end of our last lesson, our Snakefile looks something like this:
# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book
rule all:
input:
'zipf_analysis.tar.gz'
# delete everything so we can re-run things
rule clean:
shell:
'''
rm -rf results dats plots
rm -f results.txt zipf_analysis.tar.gz
'''
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
shell: 'python {input.wc} {input.book} {output}'
# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
shell: 'python {input.plotcount} {input.book} {output}'
# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=expand('dats/{book}.dat', book=DATS)
output: 'results.txt'
shell: 'python {input.zipf} {input.books} > {output}'
# create an archive with all of our results
rule make_archive:
input:
expand('plots/{book}.png', book=DATS),
expand('dats/{book}.dat', book=DATS),
'results.txt'
output: 'zipf_analysis.tar.gz'
shell: 'tar -czvf {output} {input}'
At this point, we have a complete data analysis pipeline. Very cool. But how do we make it run as efficiently as possible?
Running in parallel
Up to this point, Snakemake has printed out an interesting message whenever we run our pipeline.
Provided cores: 1
Rules claiming more threads will be scaled down.
So far, Snakemake has been run with one core.
Let’s scale up our pipeline to run in parallel.
The only change we need to make is run Snakemake with the -j
argument.
-j
is used to indicate number of CPU cores available,
and on a cluster, maximum number of jobs (we’ll get to that part later).
Note that 4 cores is usually a safe assumption when working on a laptop.
$ snakemake clean
$ snakemake -j 4
Provided cores: 4
Rules claiming more threads will be scaled down.
# more output follows
Our pipeline ran in parallel and finished roughly 4 times as quickly!
The takeaway here is that all we need to do to scale from a
serial pipeline is run snakemake
with the -j
option.
How many CPUs does your computer have?
Now that we can have our pipeline use multiple CPUs, how do we know how many CPUs to provide to the
-j
option? Note that for all of these options, it’s best to use CPU cores, and not CPU threads.Linux: You can use the
lscpu
command.All platforms: Python’s
psutil
module can be used to fetch the number of cores in your computer. Usinglogical=False
returns the number of true CPU cores.logical=True
gives the number of CPU threads on your system.import psutil psutil.cpu_count(logical=False)
Managing CPUs
Each rule has a number of optional keywords aside from the usual
input
, output
, and shell
/run
.
The threads
keyword is used to specify how many CPU cores a rule
needs while executing.
Though in reality CPU threads are not quite the same as CPU cores,
the two terms are interchangeable when working with Snakemake.
Let’s pretend that our count_words
rule is actually very CPU-intensive.
We’ll say that it needs a whopping 4 CPUs per run.
We can specify this with the threads
keyword in our rule.
We will also modify the rule to print out the number of threads it thinks it is
using.
Please note that just giving a Snakemake rule 4 threads does not automatically
make its action run in parallel! The action also needs to be threads-capable
and to explicitly use the {threads}
information for this to happen.
In this case wordcount.py
is actually still running with 1 core,
we are simply using it as a demonstration of how to go about
running something with multiple cores.
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
threads: 4
shell:
'''
echo "Running {input.wc} with {threads} cores."
python {input.wc} {input.book} {output}
'''
Now when we run snakemake -j 4
, the jobs from count_words
are run one at a
time so as to give each job the resources it needs.
Since each job of the count_words
rule requires 4 threads (as per the newly
added thread directive), and because all jobs have a maximum of 4 cores
available to them as per the -j 4
option, the count_words jobs are run one at
a time.
All of our other rules will still run in parallel since they default to
requesting a single thread.
Unless otherwise specified with {threads}
, rules will use 1 core by default.
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
4 count_words
1 make_archive
4 make_plot
1 zipf_test
11
rule count_words:
input: wordcount.py, books/last.txt
output: dats/last.dat
jobid: 3
wildcards: file=last
threads: 4
Running wordcount.py with 4 cores.
Finished job 3.
1 of 11 steps (9%) done
# other output follows
What happens when we don’t have 4 cores available? What if we tell Snakemake to run with 2 cores instead?
$ snakemake -j 2
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
4 count_words
1 make_archive
4 make_plot
1 zipf_test
11
rule count_words:
input: wordcount.py, books/last.txt
output: dats/last.dat
jobid: 6
wildcards: file=last
threads: 2
Running wordcount.py with 2 cores.
Finished job 6.
1 of 11 steps (9%) done
# more output below
The key bit of output is Rules claiming more threads will be scaled down.
.
When Snakemake doesn’t have enough cores to run a rule (as defined by
{threads}
),
Snakemake will run that rule with the maximum available number of cores
instead.
After all, Snakemake’s job is to get our workflow done.
It automatically scales our workload to match the maximum number of cores
available without us editing the Snakefile.
Chaining multiple commands
Up until now, all of our commands have fit on one line.
To execute multiple bash commands, the only modification we need to make
is use a Python multiline string (begin and end with '''
)
One important addition we should be aware of is the &&
operator.
&&
is a bash operator that runs commands as part of a chain.
If the first command fails, the remaining steps are not run.
This is more forgiving than bash’s default “hit an error and keep going”
behavior.
After all, if the first command failed,
it’s unlikely the other steps will work.
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
threads: 4
shell:
'''
echo "Running {input.wc} with {threads} cores on {input.book}." &&
python {input.wc} {input.book} {output}
'''
Managing other types of resources
Not all compute resources are CPUs.
Examples might include limited amounts of RAM, number of GPUs, database locks,
or perhaps we simply don’t want multiple processes writing to the same file at
once.
All non-CPU resources are handled using the resources
keyword.
For our example, let’s pretend that creating a plot with plotcount.py
requires dedicated access to a GPU (it doesn’t),
and only one GPU is available.
How do we indicate this to Snakemake so that it knows to give dedicated access
to a GPU for rules that need it?
Let’s modify the make_plot
rule as an example:
# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
resources: gpu=1
shell: 'python {input.plotcount} {input.book} {output}'
We can execute our pipeline using the following (using 8 cores and 1 gpu):
$ snakemake clean
$ snakemake -j 8 --resources gpu=1
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: gpu=1
# other output removed for brevity
Resources are entirely arbitrary — like wildcards,
they can be named anything.
Snakemake knows nothing about them aside from the fact that they have a name
and a value.
In this case gpu
indicates simply that there is a resource called gpu
used
by make_plot
.
We provided 1 gpu
to the workflow,
and the gpu
is considered in use as long as the rule is running.
Once the make_plot
rule completes,
the gpu
it consumed is added back to the pool of available gpu
s.
To be extra clear: gpu
in this case does not actually represent a GPU,
it is an arbitrary limit used to prevent multiple tasks that use a gpu
from
executing at the same time.
But what happens if we run our pipeline without specifying the number of GPUs?
$ snakemake clean
$ snakemake -j 8
Provided cores: 8
Rules claiming more threads will be scaled down.
Unlimited resources: gpu
If you have specified that a rule needs a certain resource, but do not specify how many you have, Snakemake will assume that the resources in question are unlimited.
Other uses for
resources
Resources do not have to correspond to actual compute resources. Perhaps one rule is particularly I/O heavy, and it’s best if only a limited number of these jobs run at a time. Or maybe a type of rule uses a lot of network bandwidth as it downloads data. In all of these cases,
resources
can be used to constrain access to arbitrary compute resources so that each rule can run at it’s most efficient. Snakemake will run your rules in such a way as to maximise throughput given your resource constraints.
Key Points
Use
threads
to indicate the number of cores used by a rule.Resources are arbitrary and can be used for anything.
The
&&
operator is a useful tool when chaining bash commands.
Scaling a pipeline across a cluster
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How do I run my workflow on an HPC system?
Objectives
Understand the Snakemake cluster job submission workflow.
Right now we have a reasonably effective pipeline that scales nicely on our local computer. However, for the sake of this course, we’ll pretend that our workflow actually takes significant computational resources and needs to be run on a cluster.
HPC cluster architecture
Most HPC clusters are run using a scheduler. The scheduler is a piece of software that handles which compute jobs are run on which compute nodes and where. It allows a set of users to share a shared computing system as efficiently as possible. In order to use it, users typically must write their commands to be run into a shell script and then “submit” it to the scheduler.
A good analogy would be a university’s room booking system. No one gets to use a room without going through the booking system. The booking system decides which rooms people get based on their requirements (# of students, time allotted, etc.).
Normally, moving a workflow to be run by a cluster scheduler requires a lot of work. Batch scripts need to be written, and you’ll need to monitor and babysit the status of each of your jobs. This is especially difficult if one batch job depends on the output from another. Even moving from one cluster to another (especially ones using a different scheduler) requires a large investment of time and effort — all the batch scripts from before need to be rewritten.
Snakemake does all of this for you. All details of running the pipeline through the cluster scheduler are handled by Snakemake — this includes writing batch scripts, submitting, and monitoring jobs. In this scenario, the role of the scheduler is limited to ensuring each Snakemake rule is executed with the resources it needs.
We’ll explore how to port our example Snakemake pipeline. Our current Snakefile is shown below:
# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book
rule all:
input:
'zipf_analysis.tar.gz'
# delete everything so we can re-run things
rule clean:
shell:
'''
rm -rf results dats plots
rm -f results.txt zipf_analysis.tar.gz
'''
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
threads: 4
shell:
'''
python {input.wc} {input.book} {output}
'''
# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
resources: gpu=1
shell: 'python {input.plotcount} {input.book} {output}'
# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=expand('dats/{book}.dat', book=DATS)
output: 'results.txt'
shell: 'python {input.zipf} {input.books} > {output}'
# create an archive with all of our results
rule make_archive:
input:
expand('plots/{book}.png', book=DATS),
expand('dats/{book}.dat', book=DATS),
'results.txt'
output: 'zipf_analysis.tar.gz'
shell: 'tar -czvf {output} {input}'
To run Snakemake on a cluster, we need to create a profile to tell snakemake
how to submit jobs to our cluster.
We can then submit jobs with this profile using the --profile
argument
followed by the name of our profile.
In this configuration,
Snakemake runs on the cluster head node and submits jobs.
Each cluster job executes a single rule and then exits.
Snakemake detects the creation of output files,
and submits new jobs (rules) once their dependencies are created.
Transferring our workflow
Let’s port our workflow to Compute Canada’s Graham cluster as an example (you will probably be using a different cluster, adapt these instructions to your cluster). The first step will be to transfer our files to the cluster and log on via SSH. Snakemake has a powerful archiving utility that we can use to bundle up our workflow and transfer it.
$ snakemake clean
$ tar -czvf pipeline.tar.gz .
# transfer the pipeline via scp
$ scp pipeline.tar.gz [email protected]:
# log on to the cluster
$ ssh -X [email protected]
snakemake --archive
and Conda deploymentSnakemake has a built-in method to archive all input files and scripts under version control:
snakemake --archive
. What’s more, it also installs any required dependencies if they can be installed using Anaconda’sconda
package manager. You can use this feature for this tutorial (I’ve already added all of the files to version control for you), but if you want to use this feature in your own work, you should familiarise yourself with a version control tool like Git.For more information on how to use this feature, see http://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html
At this point we’ve archived our entire pipeline, sent it to the cluster, and logged on. Let’s create a folder for our pipeline and unpack it there.
$ mkdir pipeline
$ mv pipeline.tar.gz pipeline
$ cd pipeline
$ tar -xvzf pipeline.tar.gz
If Snakemake and Python are not already installed on your cluster, you can install them in an Anaconda Python environment using the following commands:
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
This is an interactive installation through the command line. Review and accept
the license agreement, then work through the prompts. The defaults are probably
fine. Accept its offer to initialize your environment (conda init
), then run
the suggested command to load the conda
base environment so you can use it
straight away. Finally, install Snakemake from the bioconda channel:
$ conda install -y -c bioconda graphviz matplotlib numpy snakemake
Assuming you’ve transferred your files and everything is set to go,
the command snakemake -n
should work without errors.
Creating a cluster profile
Snakemake uses a YAML-formatted configuration file to retrieve cluster submission parameters; we will use the SLURM scheduler for an example. When we use the ‘–profile slurm’ argument, snakemake looks for a directory with the name of our profile (slurm) containing a ‘config.yaml’ file such as the one below.
cluster: "sbatch --time={resources.time_min} --mem={resources.mem_mb}
-c {resources.cpus} -o slurm/logs/{rule}_{wildcards}
-e slurm/logs/{rule}_{wildcards}"
jobs: 25
default-resources: [cpus=1, mem_mb=1000, time_min=5]
resources: [cpus=100, mem_mb=100000]
This file has several components.
cluster
and the arguments that follow tell snakemake how to submit jobs to
the cluster.
Here we’ve used SLURM’s sbatch
command and arguments for setting time limits
and resources with snakemake wildcards defining the requested values.
We’ve also specified where to save SLURM logs and what to call them. Note that this folder must already exist. If the folders don’t exist, Snakemake will hang.
Values for any command line argument to snakemake can be defined in our
profile, although a value is required (e.g. the --use-conda
argument could be
included in our profile with use-conda: true
).
jobs
specifies the maximum number of jobs that will be submitted at one time.
We also specified the default-resources
that will be requested for each job,
while resources
defines the resource limits.
With these parameters, snakemake will use no more than 100 cpus and 100000 MB (100 GB) at a time between all currently submitted jobs. While it does not come into play here, a generally sensible default is slightly above the maximum number of jobs you are allowed to have submitted at a time.
The defaults won’t always be perfect, however — chances are some rules
may need to run with non-default amounts of memory or time limits.
We are using the count_words
rule as an example of this.
To request non-default resources for a job, we can modify the rule in our
snakefile to include a resources
section like this:
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
threads: 4
resources: cpus=4, mem_mb=8000, time_min=20
shell:
'''
python {input.wc} {input.book} {output}
'''
Local rule execution
Some Snakemake rules perform trivial tasks where job submission might be
overkill (i.e. less than 1 minute worth of compute time).
It would be a better idea to have these rules execute locally
(i.e. where the snakemake
command is run)
instead of as a job.
Let’s define all
, clean
, and make_archive
as localrules near the top of
our Snakefile
.
localrules: all, clean, make_archive
Running our workflow on the cluster
OK, time for the moment we’ve all been waiting for — let’s run our workflow on the cluster with the profile we’ve created. Use this command:
$ snakemake --profile slurm
While things execute, you may wish to SSH to the cluster in another window so
you can watch the pipeline’s progress with watch squeue -u $(whoami)
.
Notes on
$PATH
As with any cluster jobs, jobs started by Snakemake need to have the commands they are running on
$PATH
. For some schedulers (SLURM), no modifications are necessary — variables are passed to the jobs by default. Other schedulers (SGE) need to have this enabled through a command line flag when submitting jobs (-V
for SGE). If this is possible, just run themodule load
commands you need ahead of the job and run Snakemake as normal.If this is not possible, you have several options:
- You can edit your
.bashrc
file to modify$PATH
for all jobs and sessions you start on a cluster.- Inserting
shell.prefix('some command')
in a Snakefile means that all rules run will be prefixed bysome_command
. You can use this to modify$PATH
, e.g.,shell.prefix('PATH=/extra/directory:$PATH ')
.- You can modify rules directly to run the appropriate
module load
commands beforehand. This is not recommended, only if because it is more work than the other options available.
Submitting a workflow with nohup
nohup some_command &
runs a command in the background and lets it keep running if you log off. Try running the pipeline in cluster mode usingnohup
(runsnakemake clean
beforehand). Where does the Snakemake log go to? Why might this technique be useful? Can we also submit thesnakemake --profile slurm
pipeline as a job? Where does the Snakemake command run in each scenario?You can kill the running Snakemake process with
killall snakemake
. Notice that if you try to run Snakemake again, it says the directory is locked. You can unlock the directory withsnakemake --unlock
.
Key Points
Snakemake generates and submits its own batch scripts for your scheduler.
localrules
defines rules that are executed on the Snakemake head node.
$PATH
must be passed to Snakemake rules.
nohup <command> &
prevents<command>
from exiting when you log off.
Final notes
Overview
Teaching: 15 min
Exercises: 15 minQuestions
What are some tips and tricks I can use to make this easier?
Objectives
Understand how to make a DAG graph.
Now that we know how to write and scale a pipeline, here are some tips and tricks for making the process go more smoothly.
snakemake -n
is your friend
Whenever you edit your Snakefile, run snakemake -n
immediately afterwards.
This will check for errors and make sure that the pipeline is able to run.
The most common source of errors is a mismatch in filenames
(Snakemake doesn’t know how to produce a particular output file) —
snakemake -n
will catch this as long as the troublesome output files haven’t
already been made.
Configuring logging
By default, Snakemake prints all output from stderr and stdout from rules. This is useful, but if a failure occurs (or we otherwise need to inspect the logs) it can be extremely difficult to determine what happened or which rule had an issue, especially when running in parallel.
The solution to this issue is to redirect the output from each rule/
set of inputs to a dedicated log file.
We can do this using the log
keyword.
Let’s modify our count_words
rule to be slightly more verbose and redirect
this output to a dedicated log file.
Two things before we start:
&>
is a handy operator in bash that redirects bothstdout
andstderr
to a file.&>>
does the same thing as&>
, but appends to a file instead of overwriting it.
# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
threads: 4
log: 'dats/{file}.log'
shell:
'''
echo "Running {input.wc} with {threads} cores on {input.book}." &> {log}
python {input.wc} {input.book} {output} &>> {log}
'''
$ snakemake clean
$ snakemake -j 8
$ cat dats/abyss.log
# snakemake output omitted
Running wordcount.py with 4 cores on books/abyss.txt.
Notice how the pipeline no longer prints to the pipeline’s log, and instead redirects this to a log file.
Choosing a good log file location
Though you can put a log anywhere (and name it anything), it is often a good practice to put the log in the same directory where the rule’s output will be created. If you need to investigate the output for a rule and associated log files, this means that you only have to check one location!
Token files
Often, a rule does not generate a unique output, and merely modifies a file.
In these cases it is often worthwhile to create a placeholder, or “token file”
as output.
A token file is simply an empty file that you can create with the touch command
(touch some_file.txt
creates an empty file called some_file.txt
).
An example rule using this technique is shown below:
rule token_example:
input: 'some_file.txt'
output: 'some_file.tkn' # marks some_file.txt as modified
shell:
'''
some_command --do-things {input} &&
touch {output}
'''
Directory locks
Only one instance of Snakemake can run in a directory at a time.
If a Snakemake run fails without unlocking the directory
(if you killed the process, for instance), you can run
snakemake --unlock
to unlock it.
Python as a fallback
Remember, you can use Python imports and functions anywhere in a Snakefile.
If something seems a little tricky to implement - Python can do it.
The os
, shutil
, and subprocess
packages are useful tools for using Python
to execute command line actions.
In particular, os.system('some command')
will run a command on the
command-line and block until execution is complete.
Creating a workflow diagram
Assuming graphviz is installed (conda install graphviz
),
you can create a diagram of your workflow with the command:
snakemake --dag | dot -Tsvg > dag.svg
.
This creates a plot of your “directed acyclic graph”
(a plot of all of the rules Snakemake thinks it needs to complete),
which you can view using any picture viewing program.
In fact this was the tool used to create all of the diagrams in this lesson:
snakemake --dag | dot -Tsvg > dag.svg
eog dag.svg # eog is an image viewer installed on many Linux systems
Rules that have yet to be completed are indicated with solid outlines.
Already completed tasks will be indicated with dashed outlines.
In this case, I ran snakemake clean
,
just before creating the diagram — no rules have been run yet.
Viewing the GUI
Snakemake has an experimental web browser GUI. I personally haven’t used it for anything, but it’s cool to know it’s there and can be used to view your workflow on the fly.
snakemake --gui
Where to go for documentation / help
The Snakemake documentation is located at snakemake.readthedocs.io
Key Points
Token files can be used to take the place of output files if none are created.
snakemake --dag | dot -Tsvg > dag.svg
creates a graphic of your workflow.
snakemake --gui
opens a browser window with your workflow.