Content from Introduction to Chapel
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “What is Chapel and why is it useful?”
Objectives
- “Write and execute our first Chapel program.”
Chapel is a modern, open-source programming language that supports HPC via high-level abstractions for data parallelism and task parallelism. These abstractions allow the users to express parallel codes in a natural, almost intuitive, manner. In contrast with other high-level parallel languages, however, Chapel was designed around a multi-resolution philosophy. This means that users can incrementally add more detail to their original code prototype, to optimise it to a particular computer as closely as required.
In a nutshell, with Chapel we can write parallel code with the simplicity and readability of scripting languages such as Python or MATLAB, but achieving performance comparable to compiled languages like C or Fortran (+ traditional parallel libraries such as MPI or OpenMP).
In this lesson we will learn the basic elements and syntax of the language; then we will study task parallelism, the first level of parallelism in Chapel, and finally we will use parallel data structures and data parallelism, which is the higher level of abstraction, in parallel programming, offered by Chapel.
Getting started
Chapel is a compilable language which means that we must compile our source code to generate a binary or executable that we can then run in the computer.
Chapel source code must be written in text files with the extension
.chpl. Let’s write a simple “hello world”-type
program to demonstrate how we write Chapel code! Using your favourite
text editor, create the file hello.chpl
with the following
content:
writeln('If we can see this, everything works!');
This program can then be compiled with the following bash command:
The flag --fast
indicates the compiler to optimise the
binary to run as fast as possible in the given architecture. By default,
the compiler will produce a program with the same name as the source
file. In our case, the program will be called hello
. The
-o
option can be used to change the name of the generated
binary.
To run the code, you execute it as you would any other program:
OUTPUT
If we can see this, everything works!
Running on a cluster
Depending on the code, it might utilise several or even all cores on the current node. The command above implies that you are allowed to utilise all cores. This might not be the case on an HPC cluster, where a login node is shared by many people at the same time, and where it might not be a good idea to occupy all cores on a login node with CPU-intensive tasks. Instead, you will need to submit your Chapel run as a job to the scheduler asking for a specific number of CPU cores.
Use module avail chapel
to list Chapel packages on your
HPC cluster, and select the best fit for Chapel, e.g. the single-locale
Chapel module:
Then, for running a test code on a cluster you would submit an interactive job to the queue
and then inside that job compile and run the test code
For production jobs, you would compile the code and then submit a batch script to the queue:
where the script script.sh
would set all Slurm variables
and call the executable mybinary
.
Case study
Along all the Chapel lessons we will be using the following case study as the leading thread of the discussion. Essentially, we will be building, step by step, a Chapel code to solve the Heat transfer problem described below. Then we will parallelize the code to improve its performance.
Suppose that we have a square metallic plate with some initial heat distribution or initial conditions. We want to simulate the evolution of the temperature across the plate when its border is in contact with a different heat distribution that we call the boundary conditions.
The Laplace equation is the mathematical model for the evolution of the temperature in the plate. To solve this equation numerically, we need to discretise it, i.e. to consider the plate as a grid, or matrix of points, and to evaluate the temperature on each point at each iteration, according to the following difference equation:
temp_new[i,j] = 0.25 * (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1])
Here temp_new
stands for the new temperature at the
current iteration, while temp
contains the temperature
calculated at the past iteration (or the initial conditions in case we
are at the first iteration). The indices i
and
j
indicate that we are working on the point of the grid
located at the ith row and the jth column.
So, our objective is to:
Goals
Write a code to implement the difference equation above. The code should have the following requirements:
- It should work for any given number of rows and columns in the grid.
- It should run for a given number of iterations, or until the difference between
temp_new
andtemp
is smaller than a given tolerance value.- It should output the temperature at a desired position on the grid every given number of iterations.
Use task parallelism to improve the performance of the code and run it in the cluster
Use data parallelism to improve the performance of the code and run it in the cluster.
Key Points
- “Chapel is a compiled language - any programs we make must be
compiled with
chpl
.” - “The
--fast
flag instructs the Chapel compiler to optimise our code.” - “The
-o
flag tells the compiler what to name our output (otherwise it gets named after the source file)”
Content from Basic syntax and variables
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I write basic Chapel code?”
Objectives
- “Perform basic maths in Chapel.”
- “Understand Chapel’s basic data types.”
- “Understand how to read and fix errors.”
- “Know how to define and use data stored as variables.”
Using basic maths in Chapel is fairly intuitive. Try compiling the following code to see how the different mathematical operators work.
writeln(4 + 5);
writeln(4 - 5);
writeln(4 * 5);
writeln(4 / 5); // integer division
writeln(4.0 / 5.0); // floating-point division
writeln(4 ** 5); // exponentiation
In this example, our code is called operators.chpl
. You
can compile it with the following commands:
You should see output that looks something like the following:
OUTPUT
9
-1
20
0
0.8
1024
Code beginning with //
is interpreted as a comment — it
does not get run. Comments are very valuable when writing code, because
they allow us to write notes to ourselves about what each piece of code
does. You can also create block comments with /*
and
*/
:
/* This is a block comment.
It can span as many lines as you want!
(like this) */
Variables
Granted, we probably want to do more than basic maths with Chapel. We will need to store the results of complex operations using variables. Variables in programming are not the same as the mathematical concept. In programming, a variable represents (or references) a location in the memory of the computer where we can store information or data while executing a program. A variable has three elements:
- a name or label, to identify the variable
- a type, that indicates the kind of data that we can store in it, and
- a value, the actual information or data stored in the variable.
Variables in Chapel are declared with the var
or
const
keywords. When a variable declared as
const
is initialised, its value cannot be modified anymore
during the execution of the program. What happens if we try to modify a
constant variable like test
below?
const test = 100;
test = 200;
writeln('The value of test is: ', test);
writeln(test / 4);
ERROR
variables.chpl:2: error: cannot assign to const variable
The compiler threw an error, and did not compile our program. This is
a feature of compiled languages - if there is something wrong, we will
typically see an error at compile-time, instead of while running it.
Although we already kind of know why the error was caused (we tried to
reassign the value of a const
variable, which by definition
cannot be changed), let’s walk through the error as an example of how to
troubleshoot our programs.
variables.chpl:2:
indicates that the error was caused on line 2 of ourvariables.chpl
file.error:
indicates that the issue was an error, and blocks compilation. Sometimes the compiler will just give us warning or information, not necessarily errors. When we see something that is not an error, we should carefully read the output and consider if it necessitates changing our code. Errors must be fixed, as they will block the code from compiling.cannot assign to const variable
indicates that we were trying to reassign aconst
variable, which is explicitly not allowed in Chapel.
To fix this error, we can change const
to
var
when declaring our test
variable.
var
indicates a variable that can be reassigned.
var test = 100;
test = 200;
writeln('The value of test is: ', test);
writeln(test / 4);
OUTPUT
The value of test is: 200
50
In Chapel, to initialize a variable we must specify the type of the variable, or initialise it in place with some value. The common variable types in Chapel are:
- integer
int
(positive or negative whole numbers) - floating-point number
real
(decimal values) - Boolean
bool
(true or false) - string
string
(any type of text)
These two variables below are initialized with the type. If no initial value is given, Chapel will initialise a variable with a default value depending on the declared type, for example 0 for integers and 0.0 for real variables.
var counter: int;
var delta: real;
writeln("counter is ", counter, " and delta is ", delta);
OUTPUT
counter is 0 and delta is 0.0
If a variable is initialised with a value but without a type, Chapel will infer its type from the given initial value:
const test = 100;
writeln('The value of test is ', test, ' and its type is ', test.type:string);
OUTPUT
The value of test is 100 and its type is int(64)
When initialising a variable, we can also assign its type in addition to its value:
const tolerance: real = 0.0001;
const outputFrequency: int = 20;
Callout
Note that these two notations below are different, but produce the same result in the end:
var a: real = 10.0; // we specify both the type and the value
var a = 10: real; // we specify only the value (10 converted to real)
Callout
In the following code (saved as variables.chpl
) we have
not initialised the variable test
before trying to use it
in line 2:
const test; // declare 'test' variable
writeln('The value of test is: ', test);
ERROR
variables.chpl:1: error: 'test' is not initialized and has no type
variables.chpl:1: note: cannot find initialization point to split-init this variable
variables.chpl:2: note: 'test' is used here before it is initialized
Now we know how to set, use, and change a variable, as well as the
implications of using var
and const
. We also
know how to read and interpret errors.
Let’s practice defining variables and use this as the starting point
of our simulation code. The code will be stored in the file
base_solution.chpl
. We will be solving the heat transfer
problem introduced in the previous section, starting with some initial
temperature and computing a new temperature at each iteration. We will
then compute the greatest difference between the old and the new
temperature and will check if it is smaller than a preset
tolerance
. If no, we will continue iterating. If yes, we
will stop iterations and will print the final temperature. We will also
stop iterations if we reach the maximum number of iterations
niter
.
Our grid will be of size rows
by cols
, and
every outputFrequency
th iteration we will print temperature
at coordinates x
and y
.
The variable delta
will store the greatest difference in
temperature from one iteration to another. The variable tmp
will store some temporary results when computing the temperatures.
Let’s define our variables:
const rows = 100; // number of rows in the grid
const cols = 100; // number of columns in the grid
const niter = 500; // maximum number of iterations
const x = 50; // row number for a printout
const y = 50; // column number for a printout
var delta: real; // greatest difference in temperature from one iteration to another
var tmp: real; // for temporary results
const tolerance: real = 0.0001; // smallest difference in temperature that would be accepted before stopping
const outputFrequency: int = 20; // the temperature will be printed every outputFrequency iterations
Key Points
- “A comment is preceded with
//
or surrounded by/* and
*/`” - “All variables in Chapel have a type, whether assigned explicitly by the user, or chosen by the Chapel compiler based on its value.”
- “Reassigning a new value to a
const
variable will produce an error during compilation. If you want to assign a new value to a variable, declare that variable with thevar
keyword.”
Content from Ranges and arrays
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “What is Chapel and why is it useful?”
Objectives
- “Learn to define and use ranges and arrays.”
Ranges and Arrays
A series of integers (1,2,3,4,5, for example), is called a
range. Ranges are generated with the
..
operator. Let’s examine what a range looks like; we
store the following code as ranges.chpl
. Here we introduce
a very simple loop, cycling through all elements of the range and
printing their values (we will study for
loops in a
separate section):
var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
for x in example_range do writeln(x);
OUTPUT
Our example range was set to: 0..10
0
1
...
9
10
Among other uses, ranges can be used to declare
arrays of variables. An array is a
multidimensional collection of values of the same type. Arrays can be of
any size. Let’s define a 1-dimensional array of the size
example_range
and see what it looks like. Notice how the
size of an array is included with its type.
var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
We can reassign the values in our example array the same way we would reassign a variable. An array can either be set all to a single value, or to a sequence of values.
var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);
OUTPUT
Our example range was set to: 0..10
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
Notice how ranges are “right inclusive”, the last number of a range is included in the range. This is different from languages like Python where this does not happen.
Indexing elements
We can retrieve and reset specific values of an array using
[]
notation. Note that we use the same square bracket
notation in two different contexts: (1) to declare an array, with the
square brackets containing the array’s full index range
[example_range]
, and (2) to access specific array elements,
as we will see below. Let’s try retrieving and setting a specific value
in our example so far:
var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);
// retrieve the 5th index
writeln(example_array[5]);
// set index 5 to a new value
example_array[5] = 99999;
writeln(example_array);
OUTPUT
Our example range was set to: 0..10
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
6.0
1.0 2.0 3.0 4.0 5.0 99999.0 7.0 8.0 9.0 10.0 11.0
One very important thing to note - in this case, index 5 was actually the 6th element. This was caused by how we set up our array. When we defined our array using a range starting at 0, element 5 corresponds to the 6th element. Unlike most other programming languages, arrays in Chapel do not start at a fixed value - they can start at any number depending on how we define them! For instance, let’s redefine example_range to start at 5:
var example_range = 5..15;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);
// retrieve the 5th index
writeln(example_array[5]);
// set index 5 to a new value
example_array[5] = 99999;
writeln(example_array);
OUTPUT
Our example range was set to: 5..15
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
1.0
99999.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
Back to our simulation
Let’s define a two-dimensional array for use in our simulation and set its initial values:
// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real;
temp[1..rows,1..cols] = 25; // set the initial temperature on the internal grid
This is a matrix (2D array) with (rows + 2
) rows and
(cols + 2
) columns of real numbers. The ranges
0..rows+1
and 0..cols+1
used here, not only
define the size and shape of the array, they stand for the indices with
which we could access particular elements of the array using the
[ , ]
notation. For example, temp[0,0]
is the
real variable located at the first row and first column of the array
temp
, while temp[3,7]
is the one at the 4th
row and 8th column; temp[2,3..15]
access columns 4th to
16th of the 3th row of temp
, and temp[0..3,4]
corresponds to the first 4 rows on the 5th column of
temp
.
We divide our “plate” into two parts: (1) the internal grid
1..rows,1..cols
on which we set the initial temperature at
25.0, and (2) the surrounding layer of ghost points with row
indices equal to 0
or rows+1
and column
indices equal to 0
or cols+1
. The temperature
in the ghost layer is equal to 0.0 by default, as we do not assign a
value there.
We must now be ready to start coding our simulations. Let’s print some information about the initial configuration, compile the code, and execute it to see if everything is working as expected.
const rows = 100;
const cols = 100;
const niter = 500;
const x = 50; // row number of the desired position
const y = 50; // column number of the desired position
const tolerance = 0.0001; // smallest difference in temperature that would be accepted before stopping
const outputFrequency: int = 20; // the temperature will be printed every outputFrequency iterations
// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real;
temp[1..rows,1..cols] = 25; // set the initial temperature on the internal grid
writeln('This simulation will consider a matrix of ', rows, ' by ', cols, ' elements.');
writeln('Temperature at start is: ', temp[x, y]);
OUTPUT
This simulation will consider a matrix of 100 by 100 elements.
Temperature at start is: 25.0
Key Points
- “A range is a sequence of integers.”
- “An array holds a non-negative number of values of the same type.”
- “Chapel arrays can start at any index, not just 0 or 1.”
- “You can index arrays with the
[]
brackets.”
Content from Conditional statements
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I add conditional logic to my code?”
Objectives
- “You can use the
==
,>
,>=
, etc. operators to make a comparison that returns true or false.”
Chapel, as most high level programming languages, has
different statements to control the flow of the program or code. The
conditional statements are: the if statement,
and the while statement. These statements both
rely on comparisons between values. Let’s try a few comparisons to see
how they work (conditionals.chpl
):
writeln(1 == 2);
writeln(1 != 2);
writeln(1 > 2);
writeln(1 >= 2);
writeln(1 < 2);
writeln(1 <= 2);
OUTPUT
false
true
false
false
true
true
You can combine comparisons with the &&
(AND)
and ||
(OR) operators. &&
only returns
true
if both conditions are true, while ||
returns true
if either condition is true.
writeln(1 == 2);
writeln(1 != 2);
writeln(1 > 2);
writeln(1 >= 2);
writeln(1 < 2);
writeln(1 <= 2);
writeln(true && true);
writeln(true && false);
writeln(true || false);
OUTPUT
false
true
false
false
true
true
true
false
true
Control flow
The general syntax of a while statement is:
// single-statement form
while condition do
instruction
// multi-statement form
while condition
{
instructions
}
The code flows as follows: first, the condition is evaluated, and
then, if it is satisfied, all the instructions within the curly brackets
or do
are executed one by one. This will be repeated over
and over again until the condition does not hold anymore.
The main loop in our simulation can be programmed using a while statement like this
//this is the main loop of the simulation
var c = 0;
delta = tolerance;
while (c < niter && delta >= tolerance)
{
c += 1;
// actual simulation calculations will go here
}
Essentially, what we want is to repeat all the code inside the curly
brackets until the number of iterations is greater than or equal to
niter
, or the difference of temperature between iterations
is less than tolerance
. (Note that in our case, as
delta
was not initialised when declared – and thus Chapel
assigned it the default real value 0.0 – we need to assign it a value
greater than or equal to 0.001, or otherwise the condition of the while
statement will never be satisfied. A good starting point is to simple
say that delta
is equal to tolerance
).
To count iterations we just need to keep adding 1 to the counter
variable c
. We could do this with c=c+1
, or
with the compound assignment, +=
, as in the code above. To
program the rest of the logic inside the curly brackets, on the other
hand, we will need more elaborated instructions.
Let’s focus, first, on printing the temperature every
outputFrequency = 20
iterations. To achieve this, we only
need to check whether c
is a multiple of
outputFrequency
, and in that case, to print the temperature
at the desired position. This is the type of control that an
if statement give us. The general syntax
is:
// single-statement form
if condition then
instruction A
else
instruction B
// multi-statement form
if condition
{instructions A}
else
{instructions B}
The set of instructions A is executed once if the condition is satisfied; the set of instructions B is executed otherwise (the else part of the if statement is optional).
So, in our case this would do the trick:
if (c % outputFrequency == 0)
{
writeln('Temperature at iteration ', c, ': ', temp[x, y]);
}
Note that when only one instruction will be executed, there is no
need to use the curly brackets. %
is the modulo operator,
it returns the remainder after the division (i.e. it returns zero when
c
is multiple of outputFrequency
).
Let’s compile and execute our code to see what we get until now
const rows = 100;
const cols = 100;
const niter = 500;
const x = 50; // row number of the desired position
const y = 50; // column number of the desired position
const tolerance = 0.0001; // smallest difference in temperature that
// would be accepted before stopping
const outputFrequency: int = 20; // the temperature will be printed every outputFrequency iterations
var delta: real; // greatest difference in temperature from one iteration to another
var tmp: real; // for temporary results
// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real = 25;
writeln('This simulation will consider a matrix of ', rows, ' by ', cols, ' elements.');
writeln('Temperature at start is: ', temp[x, y]);
//this is the main loop of the simulation
var c = 0;
delta = tolerance;
while (c < niter && delta >= tolerance)
{
c += 1;
if (c % outputFrequency == 0)
{
writeln('Temperature at iteration ', c, ': ', temp[x, y]);
}
}
OUTPUT
This simulation will consider a matrix of 100 by 100 elements.
Temperature at start is: 25.0
Temperature at iteration 20: 25.0
Temperature at iteration 40: 25.0
Temperature at iteration 60: 25.0
Temperature at iteration 80: 25.0
Temperature at iteration 100: 25.0
Temperature at iteration 120: 25.0
Temperature at iteration 140: 25.0
Temperature at iteration 160: 25.0
Temperature at iteration 180: 25.0
Temperature at iteration 200: 25.0
Temperature at iteration 220: 25.0
Temperature at iteration 240: 25.0
Temperature at iteration 260: 25.0
Temperature at iteration 280: 25.0
Temperature at iteration 300: 25.0
Temperature at iteration 320: 25.0
Temperature at iteration 340: 25.0
Temperature at iteration 360: 25.0
Temperature at iteration 380: 25.0
Temperature at iteration 400: 25.0
Temperature at iteration 420: 25.0
Temperature at iteration 440: 25.0
Temperature at iteration 460: 25.0
Temperature at iteration 480: 25.0
Temperature at iteration 500: 25.0
Of course the temperature is always 25.0 at any iteration other than the initial one, as we haven’t done any computation yet.
Key Points
- “Use
if <condition> {instructions A} else {instructions B}
syntax to execute one set of instructions if the condition is satisfied, and the other set of instructions if the condition is not satisfied.” - This syntax can be simplified to
if <condition> {instructions}
if we only want to execute the instructions within the curly brackets if the condition is satisfied. - “Use
while <condition> {instructions}
to repeatedly execute the instructions within the curly brackets while the condition is satisfied. The instructions will be executed over and over again until the condition does not hold anymore.”
Content from Getting started with loops
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I run the same piece of code repeatedly?”
Objectives
- “Learn to use
for
loops to run over every element of an iterand.” - “Learn the difference between using
for
loops and using awhile
statement to repeatedly execute a code block.
To compute the new temperature, i.e. each element of
temp_new
, we need to add all the surrounding elements in
temp
and divide the result by 4. And, essentially, we need
to repeat this process for all the elements of temp_new
,
or, in other words, we need to iterate over the elements of
temp_new
. When it comes to iterating over a given number of
elements, the for-loop is what we want to use.
The for-loop has the following general syntax:
// single-statement version
for index in iterand do
instruction;
// multi-statement version
for index in iterand
{instructions}
The iterand is a function or statement that expresses an iteration; it could be the range 1..15, for example. index is a variable that exists only in the context of the for-loop, and that will be taking the different values yielded by the iterand. The code flows as follows: index takes the first value yielded by the iterand, and keeps it until all the instructions inside the curly brackets are executed one by one; then, index takes the second value yielded by the iterand, and keeps it until all the instructions are executed again. This pattern is repeated until index takes all the different values expressed by the iterand.
This for
loop, for example
// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
// do this for every row
}
will allow us to iterate over the rows of temp_new
. Now,
for each row we also need to iterate over all the columns in order to
access every single element of temp_new
. This can be done
with nested for
loops like this:
// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
// do this for every row
for j in 1..cols
{
// and this for every column in the row i
}
}
Now, inside the inner loop, we can use the indices i
and
j
to perform the required computations as follows:
// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
// do this for every row
for j in 1..cols
{
// and this for every column in the row i
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
}
}
temp=temp_new;
Note that at the end of the outer for
loop, when all the
elements in temp_new
are already calculated, we update
temp
with the values of temp_new
; this way
everything is set up for the next iteration of the main
while
statement.
We’re ready to execute our code, but the conditions we have initially
set up will not produce interesting output, because the plate has a
temperature value of 25
everywhere. We can change the
boundaries to have temperature 0
so that the middle will
start cooling down. To do this, we should change the declaration of
temp
to:
var temp: [0..rows+1, 0..cols+1] real = 0; // the whole plate starts at 0
temp[1..rows,1..cols] = 25; // set the non-boundary coordinates to 25
Now let’s compile and execute our code again:
OUTPUT
The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (50,50) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 25.0
Temperature at iteration 40: 25.0
Temperature at iteration 60: 25.0
Temperature at iteration 80: 25.0
Temperature at iteration 100: 25.0
Temperature at iteration 120: 25.0
Temperature at iteration 140: 25.0
Temperature at iteration 160: 25.0
Temperature at iteration 180: 25.0
Temperature at iteration 200: 25.0
Temperature at iteration 220: 24.9999
Temperature at iteration 240: 24.9996
Temperature at iteration 260: 24.9991
Temperature at iteration 280: 24.9981
Temperature at iteration 300: 24.9963
Temperature at iteration 320: 24.9935
Temperature at iteration 340: 24.9893
Temperature at iteration 360: 24.9833
Temperature at iteration 380: 24.9752
Temperature at iteration 400: 24.9644
Temperature at iteration 420: 24.9507
Temperature at iteration 440: 24.9337
Temperature at iteration 460: 24.913
Temperature at iteration 480: 24.8883
Temperature at iteration 500: 24.8595
As we can see, the temperature in the middle of the plate (position 50,50) is slowly decreasing as the plate is cooling down.
Challenge 1: Can you do it?
What would be the temperature at the top right corner of the plate? In our current setup we have a layer of ghost points around the internal grid. While the temperature on the internal grid was initially set to 25.0, the temperature at the ghost points was set to 0.0. Note that during our iterations we do not compute the temperature at the ghost points – it is permanently set to 0.0. Consequently, any point close to the ghost layer will be influenced by this zero temperature, so we expect the temperature near the border of the plate to decrease faster. Modify the code to see the temperature at the top right corner.
To see the evolution of the temperature at the top right corner of
the plate, we just need to modify x
and y
.
This corner correspond to the first row (x=1
) and the last
column (y=cols
) of the plate.
OUTPUT
The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the position (1,100) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 1.48171
Temperature at iteration 40: 0.767179
...
Temperature at iteration 460: 0.068973
Temperature at iteration 480: 0.0661081
Temperature at iteration 500: 0.0634717
Challenge 2: Can you do it?
Now let’s have some more interesting boundary conditions. Suppose that the plate is heated by a source of 80 degrees located at the bottom right corner, and that the temperature on the rest of the border decreases linearly as one gets farther form the corner (see the image below). Utilise for loops to setup the described boundary conditions. Compile and run your code to see how the temperature is changing now.
To get the linear distribution, the 80 degrees must be divided by the
number of rows or columns in our plate. So, the following couple of
for
loops at the start of time iteration will give us what
we want:
// set the boundary conditions
for i in 1..rows do
temp[i,cols+1] = i*80.0/rows; // right side
for j in 1..cols do
temp[rows+1,j] = j*80.0/cols; // bottom side
Note that 80 degrees is written as a real number 80.0. The division
of integers in Chapel returns an integer, then, as rows
and
cols
are integers, we must have 80 as real so that the
result is not truncated.
OUTPUT
The simulation will consider a matrix of 100 by 100 elements, it will run
up to 500 iterations, or until the largest difference in temperature
between iterations is less than 0.0001. You are interested in the evolution
of the temperature at the position (1,100) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152
Challenge 3: Can you do it?
Let us increase the maximum number of iterations to
niter = 10_000
. The code now does 10_000 iterations:
OUTPUT
...
Temperature at iteration 9960: 0.79214
Temperature at iteration 9980: 0.792139
Temperature at iteration 10000: 0.792139
So far, delta
has been always equal to
tolerance
, which means that our main while
loop will always run niter
iterations. So let’s update
delta
after each iteration. Use what we have studied so far
to write the required piece of code.
The idea is simple, after each iteration of the while loop, we must
compare all elements of temp_new
and temp
,
find the greatest difference, and update delta
with that
value. The next nested for loops do the job:
// update delta, the greatest difference between temp_new and temp
delta=0;
for i in 1..rows
{
for j in 1..cols
{
tmp = abs(temp_new[i,j]-temp[i,j]);
if tmp > delta then delta = tmp;
}
}
Clearly there is no need to keep the difference at every single
position in the array, we just need to update delta
if we
find a greater one.
OUTPUT
The simulation will consider a matrix of 100 by 100 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 7460: 0.792283
Temperature at iteration 7480: 0.792281
Temperature at iteration 7500: 0.792279
Final temperature at the desired position after 7505 iterations is: 0.792279
The difference in temperatures between the last two iterations was: 9.99834e-05
Now, after Exercise 3 we should have a working program to simulate our heat transfer equation. Let’s just print some additional useful information,
// print final information
writeln('\nFinal temperature at the desired position after ',c,' iterations is: ',temp[x,y]);
writeln('The difference in temperatures between the last two iterations was: ',delta,'\n');
and compile and execute our final code,
OUTPUT
The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
Temperature at iteration 60: 1.20229
Temperature at iteration 80: 1.09044
Temperature at iteration 100: 1.02391
Temperature at iteration 120: 0.980011
Temperature at iteration 140: 0.949004
Temperature at iteration 160: 0.926011
Temperature at iteration 180: 0.908328
Temperature at iteration 200: 0.894339
Temperature at iteration 220: 0.88302
Temperature at iteration 240: 0.873688
Temperature at iteration 260: 0.865876
Temperature at iteration 280: 0.85925
Temperature at iteration 300: 0.853567
Temperature at iteration 320: 0.848644
Temperature at iteration 340: 0.844343
Temperature at iteration 360: 0.840559
Temperature at iteration 380: 0.837205
Temperature at iteration 400: 0.834216
Temperature at iteration 420: 0.831537
Temperature at iteration 440: 0.829124
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152
Final temperature at the desired position after 500 iterations is: 0.823152
The greatest difference in temperatures between the last two iterations was: 0.0258874
Key Points
- “You can organize loops with
for
andwhile
statements. Use afor
loop to run over every element of the iterand, e.g.for i in 1..rows { ...}
will run over all integers from 1 torows
. Use awhile
statement to repeatedly execute a code block until the condition does not hold anymore, e.g.while (c < niter && delta >= tolerance) {...}
will repeatedly execute the commands in curly braces until one of the two conditions turns false.”
Content from Procedures
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I write functions?”
Objectives
- “Be able to write our own procedures.”
Similar to other programming languages, Chapel lets you define your own functions. These are called ‘procedures’ in Chapel and have an easy-to-understand syntax:
proc addOne(n) { // n is an input parameter
return n + 1;
}
To call this procedure, you would use its name:
writeln(addOne(10));
Procedures can be recursive, as demonstrated below. In this example
the procedure takes an integer number as a parameter and returns an
integer number – more on this below. If the input parameter is 1 or 0,
fibonacci
will return the same input parameter. If the
input parameter is 2 or larger, fibonacci
will call itself
recursively.
proc fibonacci(n: int): int { // input parameter type and procedure return type, respectively
if n <= 1 then return n;
return fibonacci(n-1) + fibonacci(n-2);
}
writeln(fibonacci(10));
The input parameter type n: int
is enforced at
compilation time. For example, if you try to pass a real-type number to
the procedure with fibonacci(10.2)
, you will get an error
“error: unresolved call”. Similarly, the return variable type is also
enforced at compilation time. For example, replacing
return n
with return 1.0
in line 2 will result
in “error: cannot initialize return value of type ‘int(64)’”. While
specifying these types might be optional (see the call out below), we
highly recommend doing so in your code, as it will add additional checks
for your program.
Callout
If not specified, the procedure return type is inferred from the
return variable type. This might not be possible with a recursive
procedure as the return type is the procedure type, and it is not known
to the compiler, so in this case (and in the fibonacci
example above) we need to specify the procedure return type
explicitly.
Procedures can take a varying number of parameters. In this example
the procedure maxOf
takes two or more parameters of the
same type. This group of parameters is referred to as a tuple
and is named x
inside the procedure. The number of elements
k
in this tuple is inferred from the number of parameters
passed to the procedure and is used to organize the calculations inside
the procedure:
proc maxOf(x ...?k) { // take a tuple of one type with k elements
var maximum = x[0];
for i in 1..<k do maximum = if maximum < x[i] then x[i] else maximum;
return maximum;
}
writeln(maxOf(1, -5, 123, 85, -17, 3));
writeln(maxOf(1.12, 0.85, 2.35));
OUTPUT
123
2.35
Procedures can have default parameter values. If a parameter with the
default value (like y
in the example below) is not passed
to the procedure, it takes the default value inside the procedure. If it
is passed with another value, then this new value is used inside the
procedure.
In Chapel a procedure always returns a single value or a single data structure. In this example the procedure returns a tuple (a structure) with two numbers inside, one integer and one real:
proc returnTuple(x: int, y: real = 3.1415926): (int,real) {
return (x,y);
}
writeln(returnTuple(1));
writeln(returnTuple(x=2));
writeln(returnTuple(x=-10, y=10));
writeln(returnTuple(y=-1, x=3)); // the parameters can be named out of order
Chapel procedures have many other useful features, however, they are not essential for learning task and data parallelism, so we refer the interested readers to the official Chapel documentation.
Key Points
- “Functions in Chapel are called procedures.”
- “Procedures can take a varying number of parameters.”
- “Optionally, you can specify input parameter types and the return variable type.”
- “Procedures can have default parameter values.”
- “Procedures can be recursive. Recursive procedures require specifying the return variable type.”
Content from Using command-line arguments
Last updated on 2024-10-08 | Edit this page
Overview
Questions
- “How do I use the same program for multiple use-cases?”
Objectives
- “Modifying code’s constant parameters without re-compiling the code.”
From the last run of our code, we can see that 500 iterations is not
enough to get to a steady state (a state where the difference
in temperature does not vary too much,
i.e. delta
<tolerance
). Now, if we want to
change the number of iterations we would need to modify
niter
in the code, and compile it again. What if we want to
change the number of rows and columns in our grid to have more
precision, or if we want to see the evolution of the temperature at a
different point (x,y)? The answer would be the same, modify the code and
compile it again!
No need to say that this would be very tedious and inefficient. A
better scenario would be if we can pass the desired configuration values
to our binary when it is called at the command line. The Chapel
mechanism for this is to use config variables.
When a variable is declared with the config
keyword, in
addition to var
or const
, like this:
config const niter = 500; //number of iterations
it can be initialised with a specific value, when executing the code at the command line, using the syntax:
OUTPUT
The simulation will consider a matrix of 100 by 100 elements,
it will run up to 3000 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 2980: 0.793969
Temperature at iteration 3000: 0.793947
Final temperature at the desired position after 3000 iterations is: 0.793947
The greatest difference in temperatures between the last two iterations was: 0.000350086
Challenge 4: Can you do it?
Make outputFrequency
, x
, y
,
tolerance
, rows
and cols
configurable variables, and test the code simulating different
configurations. What can you conclude about the performance of the
code?
Let’s prepend config
to the following lines in our
code:
config const rows = 100; // number of rows in the grid
config const cols = 100; // number of columns in the grid
config const niter = 10_000; // maximum number of iterations
config const x = 1; // row number for a printout
config const y = cols; // column number for a printout
config const tolerance: real = 0.0001; // smallest difference in temperature that would be accepted before stopping
config const outputFrequency: int = 20; // the temperature will be printed every outputFrequency iterations
We can then recompile the code and try modifying some of these parameters from the command line. For example, let’s use a 650 x 650 grid and observe the evolution of the temperature at the position (200,300) for 10,000 iterations or until the difference of temperature between iterations is less than 0.002; also, let’s print the temperature every 1000 iterations.
BASH
chpl base_solution.chpl
./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000
OUTPUT
The simulation will consider a matrix of 650 by 650 elements, it will run up to 10000 iterations, or until
the largest difference in temperature between iterations is less than 0.002. You are interested in the
evolution of the temperature at the position (200,300) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
Key Points
- “Config variables accept values from the command line at runtime, without you having to recompile the code.”
Content from Measuring code performance
Last updated on 2024-10-08 | Edit this page
Overview
Questions
- “How do I know how fast my code is?”
Objectives
- “Measuring code performance by instrumenting the code.”
The code generated after Exercise 4 is the basic implementation of our simulation. We will use it as a benchmark, to see how much we can improve the performance when introducing the parallel programming features of the language in the following lessons.
But first, we need a quantitative way to measure the performance of
our code. The easiest way to do it is to see how long it takes to finish
a simulation. The UNIX command time
could be used to this
effect
BASH
time ./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000
OUTPUT
The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the
position (200,300) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
real 0m20.381s
user 0m20.328s
sys 0m0.053s
The real time is what interests us. Our code is taking around 20 seconds from the moment it is called at the command line until it returns.
Some times, however, it could be useful to take the execution time of specific parts of the code. This can be achieved by modifying the code to output the information that we need. This process is called instrumentation of code.
An easy way to instrument our code with Chapel is by using the module
Time
. Modules in Chapel are
libraries of useful functions and methods that can be used once the
module is loaded. To load a module we use the keyword use
followed by the name of the module. Once the Time module is loaded we
can create a variable of the type stopwatch
, and use the
methods start
,stop
and elapsed
to
instrument our code.
use Time;
var watch: stopwatch;
watch.start();
//this is the main loop of the simulation
delta=tolerance;
while (c<niter && delta>=tolerance) do
{
...
}
watch.stop();
//print final information
writeln('\nThe simulation took ',watch.elapsed(),' seconds');
writeln('Final temperature at the desired position after ',c,' iterations is: ',temp[x,y]);
writeln('The greatest difference in temperatures between the last two iterations was: ',delta,'\n');
BASH
chpl base_solution.chpl
./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000
OUTPUT
The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the
position (200,300) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
The simulation took 20.1621 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
Key Points
- “To measure performance, instrument your Chapel code using a
stopwatch from the
Time
module.”
Content from Intro to parallel computing
Last updated on 2024-10-08 | Edit this page
Overview
Questions
- “How does parallel processing work?”
Objectives
- “Discuss some common concepts in parallel computing.”
The basic concept of parallel computing is simple to understand: we divide our job into tasks that can be executed at the same time, so that we finish the job in a fraction of the time that it would have taken if the tasks were executed one by one. Implementing parallel computations, however, is not always easy, nor possible…
Consider the following analogy:
Suppose that we want to paint the four walls in a room. We’ll call this the problem. We can divide our problem into 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start one another. We say that we have 4 concurrent tasks; the tasks can be executed within the same time frame. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks. If there is only one painter, this guy could work for a while in one wall, then start painting another one, then work for a little bit on the third one, and so on. The tasks are being executed concurrently but not in parallel. If we have two painters for the job, then more parallelism can be introduced. Four painters could execute the tasks truly in parallel.
Callout
Think of the CPU cores as the painters or workers that will execute your concurrent tasks.
Now imagine that all workers have to obtain their paint from a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously, however, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser: One must wait while the other is being serviced.
Callout
Think of the shared memory in your computer as the central dispenser for all your workers.
Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete their task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, in different houses in the city, and different cities in the country. We need, however, a communication system in place. Suppose that worker A, for some reason, needs a colour that is only available in the dispenser of worker B, they must then synchronise: worker A must request the paint of worker B and worker B must respond by sending the required colour.
Callout
Think of the memory on each node of a cluster as a separate dispenser for your workers.
A fine-grained parallel code needs lots of communication or synchronisation between tasks, in contrast with a coarse-grained one. An embarrassingly parallel problem is one where all tasks can be executed completely independent from each other (no communications required).
Parallel programming in Chapel
Chapel provides high-level abstractions for parallel programming no matter the grain size of your tasks, whether they run in a shared memory on one node or use memory distributed across multiple compute nodes, or whether they are executed concurrently or truly in parallel. As a programmer you can focus in the algorithm: how to divide the problem into tasks that make sense in the context of the problem, and be sure that the high-level implementation will run on any hardware configuration. Then you could consider the details of the specific system you are going to use (whether it is shared or distributed, the number of cores, etc.) and tune your code/algorithm to obtain a better performance.
Callout
To this effect, concurrency (the creation and execution of multiple tasks), and locality (in which set of resources these tasks are executed) are orthogonal concepts in Chapel.
In summary, we can have a set of several tasks; these tasks could be running:
- concurrently by the same processor in a single compute node,
- in parallel by several processors in a single compute node,
- in parallel by several processors distributed in different compute nodes, or
- serially (one by one) by several processors distributed in different compute nodes.
Similarly, each of these tasks could be using variables
- located in the local memory on the compute node where it is running, or
- stored on other compute nodes.
And again, Chapel could take care of all the stuff required to run our algorithm in most of the scenarios, but we can always add more specific detail to gain performance when targeting a particular scenario.
Key Points
- “Concurrency and locality are orthogonal concepts in Chapel: where the tasks are running may not be indicative of when they run, and you can control both in Chapel.”
- “Problems with a lot of communication between tasks, or so called fine-grained parallel problems, are typically more difficult to parallelize. As we will see later in these lessons, Chapel simplifies writing fine-grained parallel codes by hiding a lot of communication complexity under the hood.”
Content from Fire-and-forget tasks
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do we execute work in parallel?”
Objectives
- “Launching multiple threads to execute tasks in parallel.”
- “Learn how to use
begin
,cobegin
, andcoforall
to spawn new tasks.”
Callout
In the very first chapter where we showed how to run single-node Chapel codes. As a refresher, let’s go over this again. If you are running Chapel on your own computer, then you are all set, and you can simply compile and run Chapel codes. If you are on a cluster, you will need to run Chapel codes inside interactive jobs. Here so far we are covering only single-locale Chapel, so – from the login node – you can submit an interactive job to the scheduler with a command like this one:
The details may vary depending on your cluster, e.g. different scheduler, requirement to specify an account or reservation, etc, but the general idea remains the same: on a cluster you need to ask for resources before you can run calculations. In this case we are asking for 2 hours maximum runtime, single MPI task (sufficient for our parallelism in this chapter), 3 CPU cores inside that task, and 1000M maximum memory per core. The core count means that we can run 3 threads in parallel, each on its own CPU core. Once your interactive job starts, you can compile and run the Chapel codes below. Inside your Chapel code, when new threads start, they will be able to utilize our 3 allocated CPU cores.
A Chapel program always start as a single main thread. You can then
start concurrent tasks with the begin
statement. A task
spawned by the begin
statement will run in a different
thread while the main thread continues its normal execution. Consider
the following example:
var x = 0;
writeln("This is the main thread starting first task");
begin
{
var c = 0;
while c < 10
{
c += 1;
writeln('thread 1: ', x+c);
}
}
writeln("This is the main thread starting second task");
begin
{
var c = 0;
while c < 10
{
c += 1;
writeln('thread 2: ', x+c);
}
}
writeln('this is main thread, I am done...');
OUTPUT
This is the main thread starting first task
This is the main thread starting second task
this is main thread, I am done...
thread 1: 1
thread 1: 2
thread 1: 3
thread 1: 4
thread 1: 5
thread 1: 6
thread 1: 7
thread 1: 8
thread 1: 9
thread 1: 10
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10
As you can see the order of the output is not what we would expected, and actually it is completely unpredictable. This is a well known effect of concurrent tasks accessing the same shared resource at the same time (in this case the screen); the system decides in which order the tasks could write to the screen.
Challenge 1: what if c
is defined
globally?
What would happen if in the last code we move the definition
of c
into the main thread, but try to assign it from
threads 1 and 2? Select one answer from these:
- The code will fail to compile.
- The code will compile and run, but
c
will be updated by both threads at the same time (a race condition), so that its final value will vary from one run to another. - The code will compile and run, and the two threads will be taking
turns updating
c
, so that its final value will always be the same.
We’ll get an error at compilation (“cannot assign to const
variable”), since then c
would be defined within the scope
of the main thread, and we could modify its value only in the main
thread. Any attempt to modify its value inside threads 1 or 2 will
produce a compilation error.
Challenge 2: what if we have a second, local
definition of x
?
What would happen if we try to insert a second definition
var x = 10;
inside the first begin
statement?
Select one answer from these:
- The code will fail to compile.
- The code will compile and run, and the inside the first
begin
statement the valuex = 10
will be used, whereas inside the secondbegin
statement the valuex = 0
will be used. - The new value
x = 10
will overwrite the global valuex = 0
in both threads 1 and 2.
The code will compile and run, and you will see the following output:
OUTPUT
This is the main thread starting first task
This is the main thread starting second task
this is main thread, I am done...
thread 1: 11
thread 1: 12
thread 1: 13
thread 1: 14
thread 1: 15
thread 1: 16
thread 1: 17
thread 1: 18
thread 1: 19
thread 1: 20
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10
Callout
All variables have a scope in which they can be used. The variables declared inside a concurrent task are accessible only by that task. The variables declared in the main task can be read everywhere, but Chapel won’t allow other concurrent tasks to modify them.
Try this …
Are the concurrent tasks, spawned by the last code, running truly in parallel?
The answer is: it depends on the number of cores available to your Chapel code. To verify this, let’s modify the code to get both threads 1 and 2 into an infinite loop:
begin
{
var c=0;
while c > -1
{
c += 1;
// the rest of the code in the thread
}
}
Compile and run the code:
If you are running this on your own computer, you can run
top
or htop
or ps
commands in
another terminal to check Chapel’s CPU usage. If you are running inside
an interactive job on a cluster, you can open a different terminal, log
in to the cluster, and open a bash shell on the node that is running
your job (if your cluster setup allows this):
SH
squeue -u $USER # check the jobID number
srun --jobid=<jobID> --pty bash # put your jobID here
htop -u $USER -s PERCENT_CPU # display CPU usage and other information
In the output of htop
you will see a table with the list
of your processes, and in the “CPU%” column you will see the percentage
consumed by each process. Find the Chapel process, and if it shows that
your CPU usage is close to 300%, you are using 3 CPU cores. What do you
see?
Now exit htop
by pressing Q. Also exit your
interactive run by pressing Ctrl-C.
Callout
To maximise performance, start as many tasks as cores are available.
A slightly more structured way to start concurrent tasks in Chapel is
by using the cobegin
statement. Here you can start a block
of concurrent tasks, one for each statement inside the curly brackets.
The main difference between the begin
and
cobegin
statements is that with the cobegin
,
all the spawned tasks are synchronised at the end of the statement,
i.e. the main thread won’t continue its execution until all tasks are
done.
var x=0;
writeln("This is the main thread, my value of x is ",x);
cobegin
{
{
var x=5;
writeln("this is task 1, my value of x is ",x);
}
writeln("this is task 2, my value of x is ",x);
}
writeln("this message won't appear until all tasks are done...");
OUTPUT
This is the main thread, my value of x is 0
this is task 2, my value of x is 0
this is task 1, my value of x is 5
this message won't appear until all tasks are done...
As you may have conclude from the Discussion exercise above, the variables declared inside a task are accessible only by the task, while those variables declared in the main task are accessible to all tasks.
The last, and most useful way to start concurrent/parallel tasks in
Chapel, is the coforall
loop. This is a combination of the
for-loop and the cobegin
statements. The general syntax
is:
coforall index in iterand
{instructions}
This will start a new task, for each iteration. Each tasks will then perform all the instructions inside the curly brackets. Each task will have a copy of the variable index with the corresponding value yielded by the iterand. This index allows us to customise the set of instructions for each particular task.
var x=1;
config var numoftasks=2;
writeln("This is the main task: x = ",x);
coforall taskid in 1..numoftasks
{
var c=taskid+1;
writeln("this is task ",taskid,": x + ",taskid," = ",x+taskid,". My value of c is: ",c);
}
writeln("this message won't appear until all tasks are done...");
OUTPUT
This is the main task: x = 1
this is task 5: x + 5 = 6. My value of c is: 6
this is task 2: x + 2 = 3. My value of c is: 3
this is task 4: x + 4 = 5. My value of c is: 5
this is task 3: x + 3 = 4. My value of c is: 4
this is task 1: x + 1 = 2. My value of c is: 2
this message won't appear until all tasks are done...
Notice how we are able to customise the instructions inside the coforall, to give different results depending on the task that is executing them. Also, notice how, once again, the variables declared outside the coforall can be read by all tasks, while the variables declared inside, are available only to the particular task.
Challenge 3: Can you do it?
Would it be possible to print all the messages in the right order? Modify the code in the last example as required.
Hint: you can use an array of strings declared in the main task, where all the concurrent tasks could write their messages in the corresponding position. Then, at the end, have the main task printing all elements of the array in order.
The following code is a possible solution:
var x = 1;
config var numoftasks = 2;
var messages: [1..numoftasks] string;
writeln("This is the main task: x = ", x);
coforall taskid in 1..numoftasks {
var c = taskid + 1;
messages[taskid] = 'this is task ' + taskid:string +
': my value of c is ' + c:string + ' and x is ' + x:string;
}
for i in 1..numoftasks do writeln(messages[i]);
writeln("this message won't appear until all tasks are done...");
OUTPUT
This is the main task: x = 1
this is task 1: x + 1 = 2. My value of c is: 2
this is task 2: x + 2 = 3. My value of c is: 3
this is task 3: x + 3 = 4. My value of c is: 4
this is task 4: x + 4 = 5. My value of c is: 5
this is task 5: x + 5 = 6. My value of c is: 6
this message won't appear until all tasks are done...
Note that we need to convert integers to strings first
(taskid:string
converts taskid
integer
variable to a string) before we can add them to other strings to form a
message stored inside each messages
element.
Challenge 4: Can you do it?
Consider the following code:
use Random;
config const nelem = 100_000_000;
var x: [1..nelem] int;
fillRandom(x); //fill array with random numbers
var mymax = 0;
// here put your code to find mymax
writeln("the maximum value in x is: ", mymax);
Write a parallel code to find the maximum value in the array x.
config const numtasks = 12;
const n = nelem/numtasks; // number of elements per thread
const r = nelem - n*numtasks; // these elements did not fit into the last thread
var d: [1..numtasks] int; // local maxima for each thread
coforall taskid in 1..numtasks {
var i, f: int;
i = (taskid-1)*n + 1;
f = (taskid-1)*n + n;
if taskid == numtasks then f += r; // add r elements to the last thread
for j in i..f do
if x[j] > d[taskid] then d[taskid] = x[j];
}
for i in 1..numtasks do
if d[i] > mymax then mymax = d[i];
OUTPUT
the maximum value in x is: 9223372034161572255 # large random integer
We use the coforall
loop to spawn tasks that work
concurrently in a fraction of the array. The trick here is to determine,
based on the taskid, the initial and final indices that the
task will use. Each task obtains the maximum in its fraction of the
array, and finally, after the coforall is done, the main task obtains
the maximum of the array from the maximums of all tasks.
Try this …
Substitute the code to find mymax in the last exercise with:
mymax=max reduce x;
Time the execution of the original code and this new one. How do they compare?
Callout
It is always a good idea to check whether there is built-in functions or methods in the used language, that can do what we want to do as efficiently (or better) than our house-made code. In this case, the reduce statement reduces the given array to a single number using the given operation (in this case max), and it is parallelized and optimised to have a very good performance.
The code in these last Exercises somehow synchronise the tasks to obtain the desired result. In addition, Chapel has specific mechanisms task synchronisation, that could help us to achieve fine-grained parallelization.
Key Points
- “Use
begin
orcobegin
orcoforall
to spawn new tasks.” - “You can run more than one task per core, as the number of cores on a node is limited.”
Content from Synchronising tasks
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How should I access my data in parallel?”
Objectives
- “Learn how to synchronize multiple threads using one of three
mechanisms:
sync
statements, sync variables, and atomic variables.” - “Learn that with shared memory access from multiple threads you can run into race conditions and deadlocks, and learn how to recognize and solve these problems.”
In Chapel the keyword sync
can be either a statement or
a type qualifier, providing two different synchronization mechanisms for
threads. Let’s start with using sync
as a statement.
As we saw in the previous section, the begin
statement
will start a concurrent (or child) task that will run in a
different thread while the main (or parent) thread continues
its normal execution. In this sense the begin
statement is
non-blocking. If you want to pause the execution of the main thread and
wait until the child thread ends, you can prepend the begin
statement with the sync
statement. Consider the following
code; running this code, after the initial output line, you will first
see all output from thread 1 and only then the line “The first task is
done…” and the rest of the output:
var x=0;
writeln("This is the main thread starting a synchronous task");
sync
{
begin
{
var c=0;
while c<10
{
c+=1;
writeln('thread 1: ',x+c);
}
}
}
writeln("The first task is done...");
writeln("This is the main thread starting an asynchronous task");
begin
{
var c=0;
while c<10
{
c+=1;
writeln('thread 2: ',x+c);
}
}
writeln('this is main thread, I am done...');
OUTPUT
This is the main thread starting a synchronous task
thread 1: 1
thread 1: 2
thread 1: 3
thread 1: 4
thread 1: 5
thread 1: 6
thread 1: 7
thread 1: 8
thread 1: 9
thread 1: 10
The first task is done...
This is the main thread starting an asynchronous task
this is main thread, I am done...
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10
Discussion
What would happen if we write instead
begin
{
sync
{
var c=0;
while c<10
{
c+=1;
writeln('thread 1: ',x+c);
}
}
}
writeln("The first task is done...");
Challenge 3: Can you do it?
Use begin
and sync
statements to reproduce
the functionality of cobegin
in
cobegin_example.chpl
.
var x=0;
writeln("This is the main thread, my value of x is ",x);
sync
{
begin
{
var x=5;
writeln("this is task 1, my value of x is ",x);
}
begin writeln("this is task 2, my value of x is ",x);
}
writeln("this message won't appear until all tasks are done...");
A more elaborated and powerful use of sync
is as a type
qualifier for variables. When a variable is declared as sync, a
state that can be full or
empty is associated to it.
To assign a new value to a sync variable, its state must be empty (after the assignment operation is completed, the state will be set as full). On the contrary, to read a value from a sync variable, its state must be full (after the read operation is completed, the state will be set as empty again).
Starting from Chapel 2.x, you must use functions writeEF
and readFF
to perform blocking write and read with sync
variables. Below is an example to demonstrate the use of sync variables.
Here we launch a new task that is busy for a short time executing the
loop. While this loop is running, the main task continues printing the
message “this is main task after launching new task… I will wait until
it is done”. As it takes time to spawn a new thread, it is very likely
that you will see this message before the output from the loop. Next,
the main task will attempt to read x
and assign it to
a
which it can only do when x
is full. We
write into x
after the loop, so you will see the final
message “and now it is done” only after the message “New task finished”.
In other words, reading x
, we pause the execution of the
main thread.
var x: sync int, a: int;
writeln("this is main task launching a new task");
begin {
for i in 1..10 do writeln("this is new task working: ",i);
x.writeEF(2); // assign 2 to x
writeln("New task finished");
}
writeln("this is main task after launching new task... I will wait until it is done");
a = x.readFE(); // don't run this line until the variable x is written in the other task
writeln("and now it is done");
OUTPUT
this is main task launching a new task
this is main task after launching new task... I will wait until it is done
this is new task working: 1
this is new task working: 2
this is new task working: 3
this is new task working: 4
this is new task working: 5
this is new task working: 6
this is new task working: 7
this is new task working: 8
this is new task working: 9
this is new task working: 10
New task finished
and now it is done
Discussion
What would happen if we try to read x
inside the new
task as well, i.e. we have the following begin
statement,
without changing the rest of the code:
begin {
for i in 1..10 do writeln("this is new task working: ",i);
x.writeEF(2);
writeln("New task finished");
x.readFE();
}
The code will block (run forever), and you would need to press
Ctrl-C to halt its execution. In this example we try to read
x
in two places: the main task and the new task. When we
read a sync variable with readFE
, the state of the sync
variable is set to empty when this method completes. In other words, one
of the two readFE
calls will succeed (which one – depends
on the runtime) and will mark the variable as empty. The other
readFE
will then attempt to read it but it will block
waiting for x
to become full again (which will never
happen). In the end, the execution of either the main thread or the
child thread will block, hanging the entire code.
There are a number of methods defined for sync variables. If
x
is a sync variable of a given type, you can use the
following functions:
// non-blocking methods
x.reset() //will set the state as empty and the value as the default of x's type
x.isFull //will return true is the state of x is full, false if it is empty
//blocking read and write methods
x.writeEF(value) //will block until the state of x is empty,
//then will assign the value, and set the state to full
x.writeFF(value) //will block until the state of x is full,
//then will assign the value, and leave the state as full
x.readFE() //will block until the state of x is full,
//then will return x's value, and set the state to empty
x.readFF() //will block until the state of x is full,
//then will return x's value, and leave the state as full
//non-blocking read and write methods
x.writeXF(value) //will assign the value no matter the state of x, and then set the state as full
x.readXX() //will return the value of x regardless its state. The state will remain unchanged
Chapel also implements atomic operations
with variables declared as atomic
, and this provides
another option to synchronise tasks. Atomic operations run completely
independently of any other thread or process. This means that when
several tasks try to write an atomic variable, only one will succeed at
a given moment, providing implicit synchronisation between them. There
is a number of methods defined for atomic variables, among them
sub()
, add()
, write()
,
read()
, and waitfor()
are very useful to
establish explicit synchronisation between tasks, as showed in the next
code:
var lock: atomic int;
const numtasks=5;
lock.write(0); //the main task set lock to zero
coforall id in 1..numtasks
{
writeln("greetings from task ",id,"... I am waiting for all tasks to say hello");
lock.add(1); //task id says hello and atomically adds 1 to lock
lock.waitFor(numtasks); //then it waits for lock to be equal numtasks (which will happen when all tasks say hello)
writeln("task ",id," is done...");
}
OUTPUT
greetings from task 4... I am waiting for all tasks to say hello
greetings from task 5... I am waiting for all tasks to say hello
greetings from task 2... I am waiting for all tasks to say hello
greetings from task 3... I am waiting for all tasks to say hello
greetings from task 1... I am waiting for all tasks to say hello
task 1 is done...
task 5 is done...
task 2 is done...
task 3 is done...
task 4 is done...
Try this…
Comment out the line
lock.waitfor(numtasks)
in the code above to clearly observe the effect of the task synchronisation.
Finally, with all the material studied so far, we should be ready to parallelize our code for the simulation of the heat transfer equation.
Key Points
- “You can explicitly synchronise tasks with
sync
statement.” - “You can also use sync and atomic variables to synchronise tasks.”
Content from Task parallelism with Chapel
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I write parallel code for a real use case?”
Objectives
- “First objective.”
Here is our plan to task-parallelize the heat transfer equation:
- divide the entire grid of points into blocks and assign blocks to individual tasks,
- each task should compute the new temperature of its assigned points,
- perform a reduction over the whole grid,
to update the greatest temperature difference between
temp_new
andtemp
.
For the reduction of the grid we can simply use the
max reduce
statement, which is already parallelized. Now,
let’s divide the grid into rowtasks
x coltasks
sub-grids, and assign each sub-grid to a task using the
coforall
loop (we will have rowtasks*coltasks
tasks in total).
config const rowtasks = 2;
config const coltasks = 2;
// this is the main loop of the simulation
delta = tolerance;
while (c<niter && delta>=tolerance) {
c += 1;
coforall taskid in 0..coltasks*rowtasks-1 {
for i in rowi..rowf {
for j in coli..colf {
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
}
}
}
delta = max reduce (temp_new-temp);
temp = temp_new;
if c%outputFrequency == 0 then writeln('Temperature at iteration ',c,': ',temp[x,y]);
}
Note that now the nested for
loops run from
rowi
to rowf
and from coli
to
colf
which are, respectively, the initial and final row and
column of the sub-grid associated to the task taskid
. To
compute these limits, based on taskid
, we need to compute
the number of rows and columns per task (nr
and
nc
, respectively) and account for possible non-zero
remainders (rr
and rc
) that we should add to
the last row and column:
config const rowtasks = 2;
config const coltasks = 2;
const nr = rows/rowtasks;
const rr = rows-nr*rowtasks;
const nc = cols/coltasks;
const rc = cols-nc*coltasks;
// this is the main loop of the simulation
delta = tolerance;
while (c<niter && delta>=tolerance) {
c+=1;
coforall taskid in 0..coltasks*rowtasks-1 {
var rowi, coli, rowf, colf: int;
var taskr, taskc: int;
taskr = taskid/coltasks;
taskc = taskid%coltasks;
if taskr<rr {
rowi=(taskr*nr)+1+taskr;
rowf=(taskr*nr)+nr+taskr+1;
}
else {
rowi = (taskr*nr)+1+rr;
rowf = (taskr*nr)+nr+rr;
}
if taskc<rc {
coli = (taskc*nc)+1+taskc;
colf = (taskc*nc)+nc+taskc+1;
}
else {
coli = (taskc*nc)+1+rc;
colf = (taskc*nc)+nc+rc;
}
for i in rowi..rowf {
for j in coli..colf {
...
}
As you can see, to divide a data set (the array temp
in
this case) between concurrent tasks, could be cumbersome. Chapel
provides high-level abstractions for data parallelism that take care of
all the data distribution for us. We will study data parallelism in the
following lessons, but for now, let’s compare the benchmark solution
with our coforall
parallelization to see how the
performance improved.
BASH
chpl --fast parallel1.chpl
./parallel1 --rows=650 --cols=650 --x=200 --y=300 --niter=10000 --tolerance=0.002 --outputFrequency=1000
OUTPUT
The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the position (200,300) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
The simulation took 17.0193 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
This parallel solution, using 4 parallel tasks, took around 17
seconds to finish. Compared with the ~20 seconds needed by the benchmark
solution, seems not very impressive. To understand the reason, let’s
analyse the code’s flow. When the program starts, the main thread does
all the declarations and initialisations, and then, it enters the main
loop of the simulation (the while loop).
Inside this loop, the parallel tasks are launched for the first time.
When these tasks finish their computations, the main task resumes its
execution, it updates delta
, and everything is repeated
again. So, in essence, parallel tasks are launched and resumed 7750
times, which introduces a significant amount of overhead (the time the
system needs to effectively start and destroy threads in the specific
hardware, at each iteration of the while loop).
Clearly, a better approach would be to launch the parallel tasks just once, and have them executing all the simulations, before resuming the main task to print the final results.
config const rowtasks = 2;
config const coltasks = 2;
const nr = rows/rowtasks;
const rr = rows-nr*rowtasks;
const nc = cols/coltasks;
const rc = cols-nc*coltasks;
// this is the main loop of the simulation
delta = tolerance;
coforall taskid in 0..coltasks*rowtasks-1 {
var rowi, coli, rowf, colf: int;
var taskr, taskc: int;
var c = 0;
taskr = taskid/coltasks;
taskc = taskid%coltasks;
if taskr<rr {
rowi = (taskr*nr)+1+taskr;
rowf = (taskr*nr)+nr+taskr+1;
}
else {
rowi = (taskr*nr)+1+rr;
rowf = (taskr*nr)+nr+rr;
}
if taskc<rc {
coli = (taskc*nc)+1+taskc;
colf = (taskc*nc)+nc+taskc+1;
}
else {
coli = (taskc*nc)+1+rc;
colf = (taskc*nc)+nc+rc;
}
while (c<niter && delta>=tolerance) {
c = c+1;
for i in rowi..rowf {
for j in coli..colf {
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
}
}
//update delta
//update temp
//print temperature in desired position
}
}
The problem with this approach is that now we have to explicitly
synchronise the tasks. Before, delta
and temp
were updated only by the main task at each iteration; similarly, only
the main task was printing results. Now, all these operations must be
carried inside the coforall loop, which imposes the need of
synchronisation between tasks.
The synchronisation must happen at two points:
- We need to be sure that all tasks have finished with the
computations of their part of the grid
temp
, before updatingdelta
andtemp
safely. - We need to be sure that all tasks use the updated value of
delta
to evaluate the condition of the while loop for the next iteration.
To update delta
we could have each task computing the
greatest difference in temperature in its associated sub-grid, and then,
after the synchronisation, have only one task reducing all the
sub-grids’ maximums.
var delta: atomic real;
var myd: [0..coltasks*rowtasks-1] real;
...
//this is the main loop of the simulation
delta.write(tolerance);
coforall taskid in 0..coltasks*rowtasks-1
{
var myd2: real;
...
while (c<niter && delta.read() >= tolerance) {
c = c+1;
...
for i in rowi..rowf {
for j in coli..colf {
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
myd2 = max(abs(temp_new[i,j]-temp[i,j]),myd2);
}
}
myd[taskid] = myd2
// here comes the synchronisation of tasks
temp[rowi..rowf,coli..colf] = temp_new[rowi..rowf,coli..colf];
if taskid==0 {
delta.write(max reduce myd);
if c%outputFrequency==0 then writeln('Temperature at iteration ',c,': ',temp[x,y]);
}
// here comes the synchronisation of tasks again
}
}
Challenge 4: Can you do it?
Use sync
or atomic
variables to implement
the synchronisation required in the code above.
One possible solution is to use an atomic variable as a lock
that opens (using the waitFor
method) when all the tasks
complete the required instructions
var lock: atomic int;
lock.write(0);
...
//this is the main loop of the simulation
delta.write(tolerance);
coforall taskid in 0..coltasks*rowtasks-1
{
...
while (c<niter && delta>=tolerance)
{
...
myd[taskid]=myd2
//here comes the synchronisation of tasks
lock.add(1);
lock.waitFor(coltasks*rowtasks);
temp[rowi..rowf,coli..colf] = temp_new[rowi..rowf,coli..colf];
...
//here comes the synchronisation of tasks again
lock.sub(1);
lock.waitFor(0);
}
}
Using the solution in the Exercise 4, we can now compare the performance with the benchmark solution
BASH
chpl --fast parallel2.chpl
./parallel2 --rows=650 --cols=650 --x=200 --y=300 --niter=10000 --tolerance=0.002 --outputFrequency=1000
OUTPUT
The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the position (200,300) of the matrix...
and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
The simulation took 4.2733 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
to see that we now have a code that performs 5x faster.
We finish this section by providing another, elegant version of the 2D heat transfer solver (without time stepping) using data parallelism on a single locale:
use Math; /* for exp() */
const n = 100, stride = 20;
var temp: [0..n+1, 0..n+1] real;
var temp_new: [1..n,1..n] real;
var x, y: real;
for (i,j) in {1..n,1..n} { // serial iteration
x = ((i:real)-0.5)/n;
y = ((j:real)-0.5)/n;
temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2)/0.01); // narrow Gaussian peak
}
coforall (i,j) in {1..n,1..n} by (stride,stride) { // 5x5 decomposition into 20x20 blocks => 25 tasks
for k in i..i+stride-1 { // serial loop inside each block
for l in j..j+stride-1 {
temp_new[k,l] = (temp[k-1,l] + temp[k+1,l] + temp[k,l-1] + temp[k,l+1]) / 4;
}
}
}
We will study data parallelism in more detail in the next section.
Key Points
- “To parallelize the diffusion solver with tasks, you divide the 2D domain into blocks and assign each block to a task.”
- “To get the maximum performance, you need to launch the parallel tasks only once, and run the temporal loop of the simulation with the same set of tasks, resuming the main task only to print the final results.”
- “Parallelizing with tasks is more laborious than parallelizing with data (covered in the next section).”
Content from Running code on multiple machines
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “What is a locale?”
Objectives
- “First objective.”
So far we have been working with single-locale Chapel codes that may run on one or many cores on a single compute node, making use of the shared memory space and accelerating computations by launching concurrent tasks on individual cores in parallel. Chapel codes can also run on multiple nodes on a compute cluster. In Chapel this is referred to as multi-locale execution.
If you work inside a Chapel Docker container, e.g.,
chapel/chapel-gasnet, the container environment simulates a multi-locale
cluster, so you would compile and launch multi-locale Chapel codes
directly by specifying the number of locales with -nl
flag:
Inside the Docker container on multiple locales your code will not run any faster than on a single locale, since you are emulating a virtual cluster, and all tasks run on the same physical node. To achieve actual speedup, you need to run your parallel multi-locale Chapel code on a real physical cluster which we hope you have access to for this session.
On a real HPC cluster you would need to submit either an interactive or a batch job asking for several nodes and then run a multi-locale Chapel code inside that job. In practice, the exact commands depend on how the multi-locale Chapel was built on the cluster.
When you compile a Chapel code with the multi-locale Chapel compiler,
two binaries will be produced. One is called mybinary
and
is a launcher binary used to submit the real executable
mybinary_real
. If the Chapel environment is configured
properly with the launcher for the cluster’s physical interconnect
(which might not be always possible due to a number of factors), then
you would simply compile the code and use the launcher binary
mybinary
to submit the job to the queue:
The exact parameters of the job such as the maximum runtime and the requested memory can be specified with Chapel environment variables. One possible drawback of this launching method is that, depending on your cluster setup, Chapel might have access to all physical cores on each node participating in the run – this will present problems if you are scheduling jobs by-core and not by-node, since part of a node should be allocated to someone else’s job.
Note that on Compute Canada clusters this launching method works
without problem. On these clusters multi-locale Chapel is provided by
chapel-ofi
(for the OmniPath interconnect on Cedar) and
chapel-ucx
(for the InfiniBand interconnect on Graham,
Béluga, Narval) modules, so – depending on the cluster – you will load
Chapel using one of the two lines below:
BASH
module load gcc chapel-ofi # for the OmniPath interconnect on Cedar cluster
module load gcc chapel-ucx # for the InfiniBand interconnect on Graham, Béluga, Narval clusters
We can also launch multi-locale Chapel codes using the real
executable mybinary_real
. For example, for an interactive
job you would type:
BASH
salloc --time=0:30:0 --nodes=4 --cpus-per-task=3 --mem-per-cpu=1000 --account=def-guest
chpl --fast mycode.chpl -o mybinary
srun ./mybinary_real -nl 4 # will run on four locales with max 3 cores per locale
Production jobs would be launched with sbatch
command
and a Slurm launch script as usual.
For the rest of this class we assume that you have a working multi-locale Chapel environment, whether provided by a Docker container or by multi-locale Chapel on a physical HPC cluster. We will run all examples on four nodes with three cores per node.
Intro to multi-locale code
Let us test our multi-locale Chapel environment by launching the following code:
writeln(Locales);
This code will print the built-in global array Locales
.
Running it on four locales will produce
OUTPUT
LOCALE0 LOCALE1 LOCALE2 LOCALE3
We want to run some code on each locale (node). For that, we can cycle through locales:
for loc in Locales do // this is still a serial program
on loc do // run the next line on locale `loc`
writeln("this locale is named ", here.name);
This will produce
OUTPUT
this locale is named cdr544
this locale is named cdr552
this locale is named cdr556
this locale is named cdr692
Here the built-in variable class here
refers to the
locale on which the code is running, and here.name
is its
hostname. We started a serial for
loop cycling through all
locales, and on each locale we printed its name, i.e., the hostname of
each node. This program ran in serial starting a task on each locale
only after completing the same task on the previous locale. Note the
order in which locales were listed.
To run this code in parallel, starting four simultaneous tasks, one
per locale, we simply need to replace for
with
forall
:
forall loc in Locales do // now this is a parallel loop
on loc do
writeln("this locale is named ", here.name);
This starts four tasks in parallel, and the order in which the print statement is executed depends on the runtime conditions and can change from run to run:
OUTPUT
this locale is named cdr544
this locale is named cdr692
this locale is named cdr556
this locale is named cdr552
We can print few other attributes of each locale. Here it is actually
useful to revert to the serial loop for
so that the print
statements appear in order:
use MemDiagnostics;
for loc in Locales do
on loc {
writeln("locale #", here.id, "...");
writeln(" ...is named: ", here.name);
writeln(" ...has ", here.numPUs(), " processor cores");
writeln(" ...has ", here.physicalMemory(unit=MemUnits.GB, retType=real), " GB of memory");
writeln(" ...has ", here.maxTaskPar, " maximum parallelism");
}
OUTPUT
locale #0...
...is named: cdr544
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #1...
...is named: cdr552
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #2...
...is named: cdr556
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #3...
...is named: cdr692
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
Note that while Chapel correctly determines the number of cores available inside our job on each node, and the maximum parallelism (which is the same as the number of cores available!), it lists the total physical memory on each node available to all running jobs which is not the same as the total memory per node allocated to our job.
Key Points
- “Locale in Chapel is a shared-memory node on a cluster.”
- “We can cycle in serial or parallel through all locales.”
Content from Domains and data parallelism
Last updated on 2024-11-18 | Edit this page
Overview
Questions
- “How do I store and manipulate data across multiple locales?”
Objectives
- “First objective.”
Domains and single-locale data parallelism
We start this section by recalling the definition of a range in Chapel. A range is a 1D set of integer indices that can be bounded or infinite:
var oneToTen: range = 1..10; // 1, 2, 3, ..., 10
var a = 1234, b = 5678;
var aToB: range = a..b; // using variables
var twoToTenByTwo: range(strides=strideKind.positive) = 2..10 by 2; // 2, 4, 6, 8, 10
var oneToInf = 1.. ; // unbounded range
On the other hand, domains are multi-dimensional (including 1D) sets of integer indices that are always bounded. To stress the difference between domain ranges and domains, domain definitions always enclose their indices in curly brackets. Ranges can be used to define a specific dimension of a domain:
var domain1to10: domain(1) = {1..10}; // 1D domain from 1 to 10 defined using the range 1..10
var twoDimensions: domain(2) = {-2..2,0..2}; // 2D domain over a product of two ranges
var thirdDim: range = 1..16; // a range
var threeDims: domain(3) = {thirdDim, 1..10, 5..10}; // 3D domain over a product of three ranges
for idx in twoDimensions do // cycle through all points in a 2D domain
write(idx, ", ");
writeln();
for (x,y) in twoDimensions { // can also cycle using explicit tuples (x,y)
write("(", x, ", ", y, ")", ", ");
}
Let us define an n^2 domain called mesh
. It is defined
by the single task in our code and is therefore defined in memory on the
same node (locale 0) where this task is running. For each of n^2 mesh
points, let us print out
-
m.locale.id
, the ID of the locale holding that mesh point (should be 0) -
here.id
, the ID of the locale on which the code is running (should be 0) -
here.maxTaskPar
, the number of cores (max parallelism with 1 task/core) (should be 3)
Note: We already saw some of these variables/functions: numLocales, Locales, here.id, here.name, here.numPUs(), here.physicalMemory(), here.maxTaskPar.
config const n = 8;
const mesh: domain(2) = {1..n, 1..n}; // a 2D domain defined in shared memory on a single locale
forall m in mesh { // go in parallel through all n^2 mesh points
writeln((m, m.locale.id, here.id, here.maxTaskPar));
}
OUTPUT
((7, 1), 0, 0, 3)
((1, 1), 0, 0, 3)
((7, 2), 0, 0, 3)
((1, 2), 0, 0, 3)
...
((6, 6), 0, 0, 3)
((6, 7), 0, 0, 3)
((6, 8), 0, 0, 3)
Now we are going to learn two very important properties of Chapel
domains. First, domains can be used to define arrays of variables of any
type on top of them. For example, let us define an n^2 array of real
numbers on top of mesh
:
config const n = 8;
const mesh: domain(2) = {1..n, 1..n}; // a 2D domain defined in shared memory on a single locale
var T: [mesh] real; // a 2D array of reals defined in shared memory on a single locale (mapped onto this domain)
forall t in T { // go in parallel through all n^2 elements of T
writeln((t, t.locale.id));
}
OUTPUT
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
...
(0.0, 0)
(0.0, 0)
(0.0, 0)
By default, all n^2 array elements are set to zero, and all of them are defined on the same locale as the underlying mesh. We can also cycle through all indices of T by accessing its domain:
forall idx in T.domain {
writeln(idx, ' ', T(idx)); // idx is a tuple (i,j); also print the corresponding array element
}
OUTPUT
(7, 1) 0.0
(1, 1) 0.0
(7, 2) 0.0
(1, 2) 0.0
...
(6, 6) 0.0
(6, 7) 0.0
(6, 8) 0.0
Since we use a parallel forall
loop, the print
statements appear in a random runtime order.
We can also define multiple arrays on the same domain:
const grid = {1..100}; // 1D domain
const alpha = 5; // some number
var A, B, C: [grid] real; // local real-type arrays on this 1D domain
B = 2; C = 3;
forall (a,b,c) in zip(A,B,C) do // parallel loop
a = b + alpha*c; // simple example of data parallelism on a single locale
writeln(A);
The second important property of Chapel domains is that they can span multiple locales (nodes).
Distributed domains
Domains are fundamental Chapel concept for distributed-memory data parallelism.
Let us now define an n^2 distributed (over several locales) domain
distributedMesh
mapped to locales in blocks. On top of this
domain we define a 2D block-distributed array A of strings mapped to
locales in exactly the same pattern as the underlying domain. Let us
print out
-
a.locale.id
, the ID of the locale holding the element a of A -
here.name
, the name of the locale on which the code is running -
here.maxTaskPar
, the number of cores on the locale on which the code is running
Instead of printing these values to the screen, we will store this
output inside each element of A as a string
a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string
,
adding a separator ' '
at the end of each element.
use BlockDist; // use standard block distribution module to partition the domain into blocks
config const n = 8;
const mesh: domain(2) = {1..n, 1..n};
const distributedMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = mesh;
var A: [distributedMesh] string; // block-distributed array mapped to locales
forall a in A { // go in parallel through all n^2 elements in A
// assign each array element on the locale that stores that index/element
a = a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string + ' ';
}
writeln(A);
The syntax boundingBox=mesh
tells the compiler that the
outer edge of our decomposition coincides exactly with the outer edge of
our domain. Alternatively, the outer decomposition layer could include
an additional perimeter of ghost points if we specify
const mesh: domain(2) = {1..n, 1..n};
const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1,0..n+1};
but let us not worry about this for now.
Running our code on four locales with three cores per locale produces the following output:
OUTPUT
0-cdr544-3 0-cdr544-3 0-cdr544-3 0-cdr544-3 1-cdr552-3 1-cdr552-3 1-cdr552-3 1-cdr552-3
0-cdr544-3 0-cdr544-3 0-cdr544-3 0-cdr544-3 1-cdr552-3 1-cdr552-3 1-cdr552-3 1-cdr552-3
0-cdr544-3 0-cdr544-3 0-cdr544-3 0-cdr544-3 1-cdr552-3 1-cdr552-3 1-cdr552-3 1-cdr552-3
0-cdr544-3 0-cdr544-3 0-cdr544-3 0-cdr544-3 1-cdr552-3 1-cdr552-3 1-cdr552-3 1-cdr552-3
2-cdr556-3 2-cdr556-3 2-cdr556-3 2-cdr556-3 3-cdr692-3 3-cdr692-3 3-cdr692-3 3-cdr692-3
2-cdr556-3 2-cdr556-3 2-cdr556-3 2-cdr556-3 3-cdr692-3 3-cdr692-3 3-cdr692-3 3-cdr692-3
2-cdr556-3 2-cdr556-3 2-cdr556-3 2-cdr556-3 3-cdr692-3 3-cdr692-3 3-cdr692-3 3-cdr692-3
2-cdr556-3 2-cdr556-3 2-cdr556-3 2-cdr556-3 3-cdr692-3 3-cdr692-3 3-cdr692-3 3-cdr692-3
As we see, the domain distributedMesh
(along with the
string array A
on top of it) was decomposed into 2x2 blocks
stored on the four nodes, respectively. Equally important, for each
element a
of the array, the line of code filling in that
element ran on the same locale where that element was stored. In other
words, this code ran in parallel (forall
loop) on four
nodes, using up to three cores on each node to fill in the corresponding
array elements. Once the parallel loop is finished, the
writeln
command runs on locale 0 gathering remote elements
from other locales and printing them to standard output.
Now we can print the range of indices for each sub-domain by adding the following to our code:
for loc in Locales {
on loc {
writeln(A.localSubdomain());
}
}
On 4 locales we should get:
OUTPUT
{1..4, 1..4}
{1..4, 5..8}
{5..8, 1..4}
{5..8, 5..8}
Let us count the number of threads by adding the following to our code:
var counter = 0;
forall a in A with (+ reduce counter) { // go in parallel through all n^2 elements
counter = 1;
}
writeln("actual number of threads = ", counter);
If n=8
in our code is sufficiently large, there are
enough array elements per node (8*8/4 = 16 in our case) to fully utilise
all three available cores on each node, so our output should be
OUTPUT
actual number of threads = 12
Try reducing the array size n
to see if that changes the
output (fewer tasks per locale), e.g., setting n=3. Also try increasing
the array size to n=20 and study the output. Does the output make
sense?
So far we looked at the block distribution BlockDist
. It
will distribute a 2D domain among nodes either using 1D or 2D
decomposition (in our example it was 2D decomposition 2x2), depending on
the domain size and the number of nodes.
Let us take a look at another standard module for domain partitioning onto locales, called CyclicDist. For each element of the array we will print out again
-
a.locale.id
, the ID of the locale holding the element a of A -
here.name
, the name of the locale on which the code is running -
here.maxTaskPar
, the number of cores on the locale on which the code is running
use CyclicDist; // elements are sent to locales in a round-robin pattern
config const n = 8;
const mesh: domain(2) = {1..n, 1..n}; // a 2D domain defined in shared memory on a single locale
const m2: domain(2) dmapped new cyclicDist(startIdx=mesh.low) = mesh; // mesh.low is the first index (1,1)
var A2: [m2] string;
forall a in A2 {
a = a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string + ' ';
}
writeln(A2);
OUTPUT
0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3
2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3
0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3
2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3
0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3
2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3
0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3 0-cdr544-3 1-cdr552-3
2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3 2-cdr556-3 3-cdr692-3
As the name CyclicDist
suggests, the domain was mapped
to locales in a cyclic, round-robin pattern. We can also print the range
of indices for each sub-domain by adding the following to our code:
for loc in Locales {
on loc {
writeln(A2.localSubdomain());
}
}
OUTPUT
{1..7 by 2, 1..7 by 2}
{1..7 by 2, 2..8 by 2}
{2..8 by 2, 1..7 by 2}
{2..8 by 2, 2..8 by 2}
In addition to BlockDist and CyclicDist, Chapel has several other predefined distributions: BlockCycDist, ReplicatedDist, DimensionalDist2D, ReplicatedDim, BlockCycDim — for details please see https://chapel-lang.org/docs/primers/distributions.html.
Diffusion solver on distributed domains
Now let us use distributed domains to write a parallel version of our original diffusion solver code:
use BlockDist;
use Math;
config const n = 8;
const mesh: domain(2) = {1..n, 1..n}; // local 2D n^2 domain
We will add a larger (n+2)^2 block-distributed domain
largerMesh
with a layer of ghost points on
perimeter locales, and define a temperature array
temp
on top of it, by adding the following to our code:
const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1, 0..n+1};
var temp: [largerMesh] real; // a block-distributed array of temperatures
forall (i,j) in temp.domain[1..n,1..n] {
var x = ((i:real)-0.5)/(n:real); // x, y are local to each task
var y = ((j:real)-0.5)/(n:real);
temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2) / 0.01); // narrow Gaussian peak
}
writeln(temp);
Here we initialised an initial Gaussian temperature peak in the middle of the mesh. As we evolve our solution in time, this peak should diffuse slowly over the rest of the domain.
Question
Why do we have
forall (i,j) in temp.domain[1..n,1..n]
and notforall (i,j) in mesh
?Answer
The first one will run on multiple locales in parallel, whereas the second will run in parallel via multiple threads on locale 0 only, since “mesh” is defined on locale 0.
The code above will print the initial temperature distribution:
OUTPUT
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 2.36954e-17 2.79367e-13 1.44716e-10 3.29371e-09 3.29371e-09 1.44716e-10 2.79367e-13 2.36954e-17 0.0
0.0 2.79367e-13 3.29371e-09 1.70619e-06 3.88326e-05 3.88326e-05 1.70619e-06 3.29371e-09 2.79367e-13 0.0
0.0 1.44716e-10 1.70619e-06 0.000883826 0.0201158 0.0201158 0.000883826 1.70619e-06 1.44716e-10 0.0
0.0 3.29371e-09 3.88326e-05 0.0201158 0.457833 0.457833 0.0201158 3.88326e-05 3.29371e-09 0.0
0.0 3.29371e-09 3.88326e-05 0.0201158 0.457833 0.457833 0.0201158 3.88326e-05 3.29371e-09 0.0
0.0 1.44716e-10 1.70619e-06 0.000883826 0.0201158 0.0201158 0.000883826 1.70619e-06 1.44716e-10 0.0
0.0 2.79367e-13 3.29371e-09 1.70619e-06 3.88326e-05 3.88326e-05 1.70619e-06 3.29371e-09 2.79367e-13 0.0
0.0 2.36954e-17 2.79367e-13 1.44716e-10 3.29371e-09 3.29371e-09 1.44716e-10 2.79367e-13 2.36954e-17 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Let us define an array of strings nodeID
with the same
distribution over locales as temp
, by adding the following
to our code:
var nodeID: [largerMesh] string;
forall m in nodeID do
m = here.id:string;
writeln(nodeID);
The outer perimeter in the partition below are the ghost points:
OUTPUT
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
Challenge 3: Can you do it?
In addition to here.id, also print the ID of the locale holding that value. Is it the same or different from here.id?
Something along the lines:
m = here.id:string + '-' + m.locale.id:string
Now we implement the parallel solver, by adding the following to our code (contains a mistake on purpose!):
var temp_new: [largerMesh] real;
for step in 1..5 { // time-stepping
forall (i,j) in mesh do
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
temp[mesh] = temp_new[mesh]; // uses parallel forall underneath
}
Challenge 4: Can you do it?
Can anyone spot a mistake in the last code?
It should be
forall (i,j) in temp_new.domain[1..n,1..n] do
instead of
forall (i,j) in mesh do
as the last one will likely run in parallel via threads only on locale 0, whereas the former will run on multiple locales in parallel.
Here is the final version of the entire code:
use BlockDist;
use Math;
config const n = 8;
const mesh: domain(2) = {1..n,1..n};
const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1,0..n+1};
var temp, temp_new: [largerMesh] real;
forall (i,j) in temp.domain[1..n,1..n] {
var x = ((i:real)-0.5)/(n:real);
var y = ((j:real)-0.5)/(n:real);
temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2) / 0.01);
}
for step in 1..5 {
forall (i,j) in temp_new.domain[1..n,1..n] {
temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4.0;
}
temp = temp_new;
writeln((step, " ", temp[n/2,n/2], " ", temp[1,1]));
}
This is the entire parallel solver! Note that we implemented an open
boundary: temp
on the ghost points is always 0.
Let us add some printout and also compute the total energy on the mesh,
by adding the following to our code:
writeln((step, " ", temp[n/2,n/2], " ", temp[2,2]));
var total: real = 0;
forall (i,j) in mesh with (+ reduce total) do
total += temp[i,j];
writeln("total = ", total);
Notice how the total energy decreases in time with the open boundary conditions, as the energy is leaving the system.
Challenge 5: Can you do it?
Write a code to print how the finite-difference stencil [i,j], [i-1,j], [i+1,j], [i,j-1], [i,j+1] is distributed among nodes, and compare that to the ID of the node where temp[i,i] is computed.
Here is one possible solution examining the locality of the finite-difference stencil:
var nodeID: [largerMesh] string = 'empty';
forall (i,j) in nodeID.domain[1..n,1..n] do
nodeID[i,j] = here.id:string + nodeID[i,j].locale.id:string + nodeID[i-1,j].locale.id:string +
nodeID[i+1,j].locale.id:string + nodeID[i,j-1].locale.id:string + nodeID[i,j+1].locale.id:string + ' ';
writeln(nodeID);
This produced the following output clearly showing the ghost points and the stencil distribution for each mesh point:
OUTPUT
empty empty empty empty empty empty empty empty empty empty
empty 000000 000000 000000 000001 111101 111111 111111 111111 empty
empty 000000 000000 000000 000001 111101 111111 111111 111111 empty
empty 000000 000000 000000 000001 111101 111111 111111 111111 empty
empty 000200 000200 000200 000201 111301 111311 111311 111311 empty
empty 220222 220222 220222 220223 331323 331333 331333 331333 empty
empty 222222 222222 222222 222223 333323 333333 333333 333333 empty
empty 222222 222222 222222 222223 333323 333333 333333 333333 empty
empty 222222 222222 222222 222223 333323 333333 333333 333333 empty
empty empty empty empty empty empty empty empty empty empty
Note that temp[i,j] is always computed on the same node where that element is stored, which makes sense.
Periodic boundary conditions
Now let us modify the previous parallel solver to include periodic BCs. At the beginning of each time step we need to set elements on the ghost points to their respective values on the opposite ends, by adding the following to our code:
temp[0,1..n] = temp[n,1..n]; // periodic boundaries on all four sides; these will run via parallel forall
temp[n+1,1..n] = temp[1,1..n];
temp[1..n,0] = temp[1..n,n];
temp[1..n,n+1] = temp[1..n,1];
Now total energy should be conserved, as nothing leaves the domain.
I/O
Let us write the final solution to disk. There are several caveats:
- works only with ASCII
- Chapel can also write binary data but nothing can read it (checked: not the endians problem!)
- would love to write NetCDF and HDF5, probably can do this by calling C/C++ functions from Chapel
We’ll add the following to our code to write ASCII:
use IO;
var myFile = open("output.dat", ioMode.cw); // open the file for writing
var myWritingChannel = myFile.writer(); // create a writing channel starting at file offset 0
myWritingChannel.write(temp); // write the array
myWritingChannel.close(); // close the channel
Run the code and check the file output.dat: it should contain the array T after 5 steps in ASCII.
Key Points
- “Domains are multi-dimensional sets of integer indices.”
- “A domain can be defined on a single locale or distributed across many locales.”
- “There are many predefined distribution method: block, cyclic, etc.”
- “Arrays are defined on top of domains and inherit their distribution model.”