Content from Introduction to Chapel


Last updated on 2024-11-18 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • “What is Chapel and why is it useful?”

Objectives

  • “Write and execute our first Chapel program.”

Chapel is a modern, open-source programming language that supports HPC via high-level abstractions for data parallelism and task parallelism. These abstractions allow the users to express parallel codes in a natural, almost intuitive, manner. In contrast with other high-level parallel languages, however, Chapel was designed around a multi-resolution philosophy. This means that users can incrementally add more detail to their original code prototype, to optimise it to a particular computer as closely as required.

In a nutshell, with Chapel we can write parallel code with the simplicity and readability of scripting languages such as Python or MATLAB, but achieving performance comparable to compiled languages like C or Fortran (+ traditional parallel libraries such as MPI or OpenMP).

In this lesson we will learn the basic elements and syntax of the language; then we will study task parallelism, the first level of parallelism in Chapel, and finally we will use parallel data structures and data parallelism, which is the higher level of abstraction, in parallel programming, offered by Chapel.

Getting started


Chapel is a compilable language which means that we must compile our source code to generate a binary or executable that we can then run in the computer.

Chapel source code must be written in text files with the extension .chpl. Let’s write a simple “hello world”-type program to demonstrate how we write Chapel code! Using your favourite text editor, create the file hello.chpl with the following content:

writeln('If we can see this, everything works!');

This program can then be compiled with the following bash command:

BASH

chpl --fast hello.chpl

The flag --fast indicates the compiler to optimise the binary to run as fast as possible in the given architecture. By default, the compiler will produce a program with the same name as the source file. In our case, the program will be called hello. The -o option can be used to change the name of the generated binary.

To run the code, you execute it as you would any other program:

BASH

./hello

OUTPUT

If we can see this, everything works!

Running on a cluster


Depending on the code, it might utilise several or even all cores on the current node. The command above implies that you are allowed to utilise all cores. This might not be the case on an HPC cluster, where a login node is shared by many people at the same time, and where it might not be a good idea to occupy all cores on a login node with CPU-intensive tasks. Instead, you will need to submit your Chapel run as a job to the scheduler asking for a specific number of CPU cores.

Use module avail chapel to list Chapel packages on your HPC cluster, and select the best fit for Chapel, e.g. the single-locale Chapel module:

BASH

module load chapel-multicore

Then, for running a test code on a cluster you would submit an interactive job to the queue

BASH

salloc --time=0:30:0 --ntasks=1 --cpus-per-task=3 --mem-per-cpu=1000 --account=def-guest

and then inside that job compile and run the test code

BASH

chpl --fast hello.chpl
./hello

For production jobs, you would compile the code and then submit a batch script to the queue:

BASH

chpl --fast hello.chpl
sbatch script.sh

where the script script.sh would set all Slurm variables and call the executable mybinary.

Case study


Along all the Chapel lessons we will be using the following case study as the leading thread of the discussion. Essentially, we will be building, step by step, a Chapel code to solve the Heat transfer problem described below. Then we will parallelize the code to improve its performance.

Suppose that we have a square metallic plate with some initial heat distribution or initial conditions. We want to simulate the evolution of the temperature across the plate when its border is in contact with a different heat distribution that we call the boundary conditions.

The Laplace equation is the mathematical model for the evolution of the temperature in the plate. To solve this equation numerically, we need to discretise it, i.e. to consider the plate as a grid, or matrix of points, and to evaluate the temperature on each point at each iteration, according to the following difference equation:

temp_new[i,j] = 0.25 * (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1])

Here temp_new stands for the new temperature at the current iteration, while temp contains the temperature calculated at the past iteration (or the initial conditions in case we are at the first iteration). The indices i and j indicate that we are working on the point of the grid located at the ith row and the jth column.

So, our objective is to:

Goals

  1. Write a code to implement the difference equation above. The code should have the following requirements:

    • It should work for any given number of rows and columns in the grid.
    • It should run for a given number of iterations, or until the difference between temp_new and temp is smaller than a given tolerance value.
    • It should output the temperature at a desired position on the grid every given number of iterations.
  2. Use task parallelism to improve the performance of the code and run it in the cluster

  3. Use data parallelism to improve the performance of the code and run it in the cluster.

Key Points

  • “Chapel is a compiled language - any programs we make must be compiled with chpl.”
  • “The --fast flag instructs the Chapel compiler to optimise our code.”
  • “The -o flag tells the compiler what to name our output (otherwise it gets named after the source file)”

Content from Basic syntax and variables


Last updated on 2024-11-18 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • “How do I write basic Chapel code?”

Objectives

  • “Perform basic maths in Chapel.”
  • “Understand Chapel’s basic data types.”
  • “Understand how to read and fix errors.”
  • “Know how to define and use data stored as variables.”

Using basic maths in Chapel is fairly intuitive. Try compiling the following code to see how the different mathematical operators work.

writeln(4 + 5);
writeln(4 - 5);
writeln(4 * 5);
writeln(4 / 5);   // integer division
writeln(4.0 / 5.0); // floating-point division
writeln(4 ** 5);  // exponentiation

In this example, our code is called operators.chpl. You can compile it with the following commands:

BASH

chpl operators.chpl --fast
./operators

You should see output that looks something like the following:

OUTPUT

9
-1
20
0
0.8
1024

Code beginning with // is interpreted as a comment — it does not get run. Comments are very valuable when writing code, because they allow us to write notes to ourselves about what each piece of code does. You can also create block comments with /* and */:

/* This is a block comment.
It can span as many lines as you want!
(like this) */

Variables


Granted, we probably want to do more than basic maths with Chapel. We will need to store the results of complex operations using variables. Variables in programming are not the same as the mathematical concept. In programming, a variable represents (or references) a location in the memory of the computer where we can store information or data while executing a program. A variable has three elements:

  1. a name or label, to identify the variable
  2. a type, that indicates the kind of data that we can store in it, and
  3. a value, the actual information or data stored in the variable.

Variables in Chapel are declared with the var or const keywords. When a variable declared as const is initialised, its value cannot be modified anymore during the execution of the program. What happens if we try to modify a constant variable like test below?

const test = 100;
test = 200;
writeln('The value of test is: ', test);
writeln(test / 4);

BASH

chpl variables.chpl

ERROR

variables.chpl:2: error: cannot assign to const variable

The compiler threw an error, and did not compile our program. This is a feature of compiled languages - if there is something wrong, we will typically see an error at compile-time, instead of while running it. Although we already kind of know why the error was caused (we tried to reassign the value of a const variable, which by definition cannot be changed), let’s walk through the error as an example of how to troubleshoot our programs.

  • variables.chpl:2: indicates that the error was caused on line 2 of our variables.chpl file.

  • error: indicates that the issue was an error, and blocks compilation. Sometimes the compiler will just give us warning or information, not necessarily errors. When we see something that is not an error, we should carefully read the output and consider if it necessitates changing our code. Errors must be fixed, as they will block the code from compiling.

  • cannot assign to const variable indicates that we were trying to reassign a const variable, which is explicitly not allowed in Chapel.

To fix this error, we can change const to var when declaring our test variable. var indicates a variable that can be reassigned.

var test = 100;
test = 200;
writeln('The value of test is: ', test);
writeln(test / 4);

BASH

chpl variables.chpl

OUTPUT

The value of test is: 200
50

In Chapel, to initialize a variable we must specify the type of the variable, or initialise it in place with some value. The common variable types in Chapel are:

  • integer int (positive or negative whole numbers)
  • floating-point number real (decimal values)
  • Boolean bool (true or false)
  • string string (any type of text)

These two variables below are initialized with the type. If no initial value is given, Chapel will initialise a variable with a default value depending on the declared type, for example 0 for integers and 0.0 for real variables.

var counter: int;
var delta: real;
writeln("counter is ", counter, " and delta is ", delta);

BASH

chpl variables.chpl
./variables

OUTPUT

counter is 0 and delta is 0.0

If a variable is initialised with a value but without a type, Chapel will infer its type from the given initial value:

const test = 100;
writeln('The value of test is ', test, ' and its type is ', test.type:string);

BASH

chpl variables.chpl
./variables

OUTPUT

The value of test is 100 and its type is int(64)

When initialising a variable, we can also assign its type in addition to its value:

const tolerance: real = 0.0001;
const outputFrequency: int = 20;

Callout

Note that these two notations below are different, but produce the same result in the end:

var a: real = 10.0;   // we specify both the type and the value
var a = 10: real;     // we specify only the value (10 converted to real)

Callout

In the following code (saved as variables.chpl) we have not initialised the variable test before trying to use it in line 2:

const test;  // declare 'test' variable
writeln('The value of test is: ', test);

ERROR

variables.chpl:1: error: 'test' is not initialized and has no type
variables.chpl:1: note: cannot find initialization point to split-init this variable
variables.chpl:2: note: 'test' is used here before it is initialized

Now we know how to set, use, and change a variable, as well as the implications of using var and const. We also know how to read and interpret errors.

Let’s practice defining variables and use this as the starting point of our simulation code. The code will be stored in the file base_solution.chpl. We will be solving the heat transfer problem introduced in the previous section, starting with some initial temperature and computing a new temperature at each iteration. We will then compute the greatest difference between the old and the new temperature and will check if it is smaller than a preset tolerance. If no, we will continue iterating. If yes, we will stop iterations and will print the final temperature. We will also stop iterations if we reach the maximum number of iterations niter.

Our grid will be of size rows by cols, and every outputFrequencyth iteration we will print temperature at coordinates x and y.

The variable delta will store the greatest difference in temperature from one iteration to another. The variable tmp will store some temporary results when computing the temperatures.

Let’s define our variables:

const rows = 100;               // number of rows in the grid
const cols = 100;               // number of columns in the grid
const niter = 500;              // maximum number of iterations
const x = 50;                   // row number for a printout
const y = 50;                   // column number for a printout
var delta: real;                // greatest difference in temperature from one iteration to another
var tmp: real;                  // for temporary results
const tolerance: real = 0.0001; // smallest difference in temperature that would be accepted before stopping
const outputFrequency: int = 20;   // the temperature will be printed every outputFrequency iterations

Key Points

  • “A comment is preceded with // or surrounded by /* and*/`”
  • “All variables in Chapel have a type, whether assigned explicitly by the user, or chosen by the Chapel compiler based on its value.”
  • “Reassigning a new value to a const variable will produce an error during compilation. If you want to assign a new value to a variable, declare that variable with the var keyword.”

Content from Ranges and arrays


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “What is Chapel and why is it useful?”

Objectives

  • “Learn to define and use ranges and arrays.”

Ranges and Arrays


A series of integers (1,2,3,4,5, for example), is called a range. Ranges are generated with the .. operator. Let’s examine what a range looks like; we store the following code as ranges.chpl. Here we introduce a very simple loop, cycling through all elements of the range and printing their values (we will study for loops in a separate section):

var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
for x in example_range do writeln(x);

BASH

chpl ranges.chpl
./ranges

OUTPUT

Our example range was set to: 0..10
0
1
...
9
10

Among other uses, ranges can be used to declare arrays of variables. An array is a multidimensional collection of values of the same type. Arrays can be of any size. Let’s define a 1-dimensional array of the size example_range and see what it looks like. Notice how the size of an array is included with its type.

var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);

We can reassign the values in our example array the same way we would reassign a variable. An array can either be set all to a single value, or to a sequence of values.

var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);

BASH

chpl ranges.chpl
./ranges

OUTPUT

Our example range was set to: 0..10
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

Notice how ranges are “right inclusive”, the last number of a range is included in the range. This is different from languages like Python where this does not happen.

Indexing elements


We can retrieve and reset specific values of an array using [] notation. Note that we use the same square bracket notation in two different contexts: (1) to declare an array, with the square brackets containing the array’s full index range [example_range], and (2) to access specific array elements, as we will see below. Let’s try retrieving and setting a specific value in our example so far:

var example_range = 0..10;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);
// retrieve the 5th index
writeln(example_array[5]);
// set index 5 to a new value
example_array[5] = 99999;
writeln(example_array);

BASH

chpl ranges.chpl
./ranges

OUTPUT

Our example range was set to: 0..10
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
6.0
1.0 2.0 3.0 4.0 5.0 99999.0 7.0 8.0 9.0 10.0 11.0

One very important thing to note - in this case, index 5 was actually the 6th element. This was caused by how we set up our array. When we defined our array using a range starting at 0, element 5 corresponds to the 6th element. Unlike most other programming languages, arrays in Chapel do not start at a fixed value - they can start at any number depending on how we define them! For instance, let’s redefine example_range to start at 5:

var example_range = 5..15;
writeln('Our example range was set to: ', example_range);
var example_array: [example_range] real;
writeln('Our example array is now: ', example_array);
example_array = 5;
writeln('When set to 5: ', example_array);
example_array = 1..11;
writeln('When set to a range: ', example_array);
// retrieve the 5th index
writeln(example_array[5]);
// set index 5 to a new value
example_array[5] = 99999;
writeln(example_array);

BASH

chpl ranges.chpl
./ranges

OUTPUT

Our example range was set to: 5..15
Our example array is now: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
When set to 5: 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
When set to a range: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
1.0
99999.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

Back to our simulation


Let’s define a two-dimensional array for use in our simulation and set its initial values:

// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real;
temp[1..rows,1..cols] = 25;     // set the initial temperature on the internal grid

This is a matrix (2D array) with (rows + 2) rows and (cols + 2) columns of real numbers. The ranges 0..rows+1 and 0..cols+1 used here, not only define the size and shape of the array, they stand for the indices with which we could access particular elements of the array using the [ , ] notation. For example, temp[0,0] is the real variable located at the first row and first column of the array temp, while temp[3,7] is the one at the 4th row and 8th column; temp[2,3..15] access columns 4th to 16th of the 3th row of temp, and temp[0..3,4] corresponds to the first 4 rows on the 5th column of temp.

We divide our “plate” into two parts: (1) the internal grid 1..rows,1..cols on which we set the initial temperature at 25.0, and (2) the surrounding layer of ghost points with row indices equal to 0 or rows+1 and column indices equal to 0 or cols+1. The temperature in the ghost layer is equal to 0.0 by default, as we do not assign a value there.

We must now be ready to start coding our simulations. Let’s print some information about the initial configuration, compile the code, and execute it to see if everything is working as expected.

const rows = 100;
const cols = 100;
const niter = 500;
const x = 50;                   // row number of the desired position
const y = 50;                   // column number of the desired position
const tolerance = 0.0001;       // smallest difference in temperature that would be accepted before stopping
const outputFrequency: int = 20;   // the temperature will be printed every outputFrequency iterations

// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real;
temp[1..rows,1..cols] = 25;     // set the initial temperature on the internal grid

writeln('This simulation will consider a matrix of ', rows, ' by ', cols, ' elements.');
writeln('Temperature at start is: ', temp[x, y]);

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

This simulation will consider a matrix of 100 by 100 elements.
Temperature at start is: 25.0

Key Points

  • “A range is a sequence of integers.”
  • “An array holds a non-negative number of values of the same type.”
  • “Chapel arrays can start at any index, not just 0 or 1.”
  • “You can index arrays with the [] brackets.”

Content from Conditional statements


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do I add conditional logic to my code?”

Objectives

  • “You can use the ==, >, >=, etc. operators to make a comparison that returns true or false.”

Chapel, as most high level programming languages, has different statements to control the flow of the program or code. The conditional statements are: the if statement, and the while statement. These statements both rely on comparisons between values. Let’s try a few comparisons to see how they work (conditionals.chpl):

writeln(1 == 2);
writeln(1 != 2);
writeln(1 > 2);
writeln(1 >= 2);
writeln(1 < 2);
writeln(1 <= 2);

BASH

chpl conditionals.chpl
./conditionals

OUTPUT

false
true
false
false
true
true

You can combine comparisons with the && (AND) and || (OR) operators. && only returns true if both conditions are true, while || returns true if either condition is true.

writeln(1 == 2);
writeln(1 != 2);
writeln(1 > 2);
writeln(1 >= 2);
writeln(1 < 2);
writeln(1 <= 2);
writeln(true && true);
writeln(true && false);
writeln(true || false);

BASH

chpl conditionals.chpl
./conditionals

OUTPUT

false
true
false
false
true
true
true
false
true

Control flow


The general syntax of a while statement is:

// single-statement form
while condition do
  instruction

// multi-statement form
while condition
{
  instructions
}

The code flows as follows: first, the condition is evaluated, and then, if it is satisfied, all the instructions within the curly brackets or do are executed one by one. This will be repeated over and over again until the condition does not hold anymore.

The main loop in our simulation can be programmed using a while statement like this

//this is the main loop of the simulation
var c = 0;
delta = tolerance;
while (c < niter && delta >= tolerance)
{
  c += 1;
  // actual simulation calculations will go here
}

Essentially, what we want is to repeat all the code inside the curly brackets until the number of iterations is greater than or equal to niter, or the difference of temperature between iterations is less than tolerance. (Note that in our case, as delta was not initialised when declared – and thus Chapel assigned it the default real value 0.0 – we need to assign it a value greater than or equal to 0.001, or otherwise the condition of the while statement will never be satisfied. A good starting point is to simple say that delta is equal to tolerance).

To count iterations we just need to keep adding 1 to the counter variable c. We could do this with c=c+1, or with the compound assignment, +=, as in the code above. To program the rest of the logic inside the curly brackets, on the other hand, we will need more elaborated instructions.

Let’s focus, first, on printing the temperature every outputFrequency = 20 iterations. To achieve this, we only need to check whether c is a multiple of outputFrequency, and in that case, to print the temperature at the desired position. This is the type of control that an if statement give us. The general syntax is:

// single-statement form
if condition then
  instruction A
else
  instruction B

// multi-statement form
if condition
{instructions A}
else
{instructions B}

The set of instructions A is executed once if the condition is satisfied; the set of instructions B is executed otherwise (the else part of the if statement is optional).

So, in our case this would do the trick:

if (c % outputFrequency == 0)
{
  writeln('Temperature at iteration ', c, ': ', temp[x, y]);
}

Note that when only one instruction will be executed, there is no need to use the curly brackets. % is the modulo operator, it returns the remainder after the division (i.e. it returns zero when c is multiple of outputFrequency).

Let’s compile and execute our code to see what we get until now

const rows = 100;
const cols = 100;
const niter = 500;
const x = 50;                   // row number of the desired position
const y = 50;                   // column number of the desired position
const tolerance = 0.0001;       // smallest difference in temperature that
                                // would be accepted before stopping
const outputFrequency: int = 20;   // the temperature will be printed every outputFrequency iterations
var delta: real;                // greatest difference in temperature from one iteration to another
var tmp: real;                  // for temporary results

// this is our "plate"
var temp: [0..rows+1, 0..cols+1] real = 25;

writeln('This simulation will consider a matrix of ', rows, ' by ', cols, ' elements.');
writeln('Temperature at start is: ', temp[x, y]);

//this is the main loop of the simulation
var c = 0;
delta = tolerance;
while (c < niter && delta >= tolerance)
{
  c += 1;
  if (c % outputFrequency == 0)
  {
    writeln('Temperature at iteration ', c, ': ', temp[x, y]);
  }
}

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

This simulation will consider a matrix of 100 by 100 elements.
Temperature at start is: 25.0
Temperature at iteration 20: 25.0
Temperature at iteration 40: 25.0
Temperature at iteration 60: 25.0
Temperature at iteration 80: 25.0
Temperature at iteration 100: 25.0
Temperature at iteration 120: 25.0
Temperature at iteration 140: 25.0
Temperature at iteration 160: 25.0
Temperature at iteration 180: 25.0
Temperature at iteration 200: 25.0
Temperature at iteration 220: 25.0
Temperature at iteration 240: 25.0
Temperature at iteration 260: 25.0
Temperature at iteration 280: 25.0
Temperature at iteration 300: 25.0
Temperature at iteration 320: 25.0
Temperature at iteration 340: 25.0
Temperature at iteration 360: 25.0
Temperature at iteration 380: 25.0
Temperature at iteration 400: 25.0
Temperature at iteration 420: 25.0
Temperature at iteration 440: 25.0
Temperature at iteration 460: 25.0
Temperature at iteration 480: 25.0
Temperature at iteration 500: 25.0

Of course the temperature is always 25.0 at any iteration other than the initial one, as we haven’t done any computation yet.

Key Points

  • “Use if <condition> {instructions A} else {instructions B} syntax to execute one set of instructions if the condition is satisfied, and the other set of instructions if the condition is not satisfied.”
  • This syntax can be simplified to if <condition> {instructions} if we only want to execute the instructions within the curly brackets if the condition is satisfied.
  • “Use while <condition> {instructions} to repeatedly execute the instructions within the curly brackets while the condition is satisfied. The instructions will be executed over and over again until the condition does not hold anymore.”

Content from Getting started with loops


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do I run the same piece of code repeatedly?”

Objectives

  • “Learn to use for loops to run over every element of an iterand.”
  • “Learn the difference between using for loops and using a while statement to repeatedly execute a code block.

To compute the new temperature, i.e. each element of temp_new, we need to add all the surrounding elements in temp and divide the result by 4. And, essentially, we need to repeat this process for all the elements of temp_new, or, in other words, we need to iterate over the elements of temp_new. When it comes to iterating over a given number of elements, the for-loop is what we want to use. The for-loop has the following general syntax:

// single-statement version
for index in iterand do
  instruction;

// multi-statement version
for index in iterand
{instructions}

The iterand is a function or statement that expresses an iteration; it could be the range 1..15, for example. index is a variable that exists only in the context of the for-loop, and that will be taking the different values yielded by the iterand. The code flows as follows: index takes the first value yielded by the iterand, and keeps it until all the instructions inside the curly brackets are executed one by one; then, index takes the second value yielded by the iterand, and keeps it until all the instructions are executed again. This pattern is repeated until index takes all the different values expressed by the iterand.

This for loop, for example

// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
  // do this for every row
}

will allow us to iterate over the rows of temp_new. Now, for each row we also need to iterate over all the columns in order to access every single element of temp_new. This can be done with nested for loops like this:

// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
  // do this for every row
  for j in 1..cols
  {
    // and this for every column in the row i
  }
}

Now, inside the inner loop, we can use the indices i and j to perform the required computations as follows:

// calculate the new temperatures (temp_new) using the past temperatures (temp)
for i in 1..rows
{
  // do this for every row
  for j in 1..cols
  {
    // and this for every column in the row i
    temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
  }
}
temp=temp_new;

Note that at the end of the outer for loop, when all the elements in temp_new are already calculated, we update temp with the values of temp_new; this way everything is set up for the next iteration of the main while statement.

We’re ready to execute our code, but the conditions we have initially set up will not produce interesting output, because the plate has a temperature value of 25 everywhere. We can change the boundaries to have temperature 0 so that the middle will start cooling down. To do this, we should change the declaration of temp to:

var temp: [0..rows+1, 0..cols+1] real = 0; // the whole plate starts at 0
temp[1..rows,1..cols] = 25;                // set the non-boundary coordinates to 25

Now let’s compile and execute our code again:

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (50,50) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 25.0
Temperature at iteration 40: 25.0
Temperature at iteration 60: 25.0
Temperature at iteration 80: 25.0
Temperature at iteration 100: 25.0
Temperature at iteration 120: 25.0
Temperature at iteration 140: 25.0
Temperature at iteration 160: 25.0
Temperature at iteration 180: 25.0
Temperature at iteration 200: 25.0
Temperature at iteration 220: 24.9999
Temperature at iteration 240: 24.9996
Temperature at iteration 260: 24.9991
Temperature at iteration 280: 24.9981
Temperature at iteration 300: 24.9963
Temperature at iteration 320: 24.9935
Temperature at iteration 340: 24.9893
Temperature at iteration 360: 24.9833
Temperature at iteration 380: 24.9752
Temperature at iteration 400: 24.9644
Temperature at iteration 420: 24.9507
Temperature at iteration 440: 24.9337
Temperature at iteration 460: 24.913
Temperature at iteration 480: 24.8883
Temperature at iteration 500: 24.8595

As we can see, the temperature in the middle of the plate (position 50,50) is slowly decreasing as the plate is cooling down.

Challenge 1: Can you do it?

What would be the temperature at the top right corner of the plate? In our current setup we have a layer of ghost points around the internal grid. While the temperature on the internal grid was initially set to 25.0, the temperature at the ghost points was set to 0.0. Note that during our iterations we do not compute the temperature at the ghost points – it is permanently set to 0.0. Consequently, any point close to the ghost layer will be influenced by this zero temperature, so we expect the temperature near the border of the plate to decrease faster. Modify the code to see the temperature at the top right corner.

To see the evolution of the temperature at the top right corner of the plate, we just need to modify x and y. This corner correspond to the first row (x=1) and the last column (y=cols) of the plate.

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the position (1,100) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 1.48171
Temperature at iteration 40: 0.767179
...
Temperature at iteration 460: 0.068973
Temperature at iteration 480: 0.0661081
Temperature at iteration 500: 0.0634717

Challenge 2: Can you do it?

Now let’s have some more interesting boundary conditions. Suppose that the plate is heated by a source of 80 degrees located at the bottom right corner, and that the temperature on the rest of the border decreases linearly as one gets farther form the corner (see the image below). Utilise for loops to setup the described boundary conditions. Compile and run your code to see how the temperature is changing now.

To get the linear distribution, the 80 degrees must be divided by the number of rows or columns in our plate. So, the following couple of for loops at the start of time iteration will give us what we want:

// set the boundary conditions
for i in 1..rows do
  temp[i,cols+1] = i*80.0/rows;   // right side
for j in 1..cols do
  temp[rows+1,j] = j*80.0/cols;   // bottom side

Note that 80 degrees is written as a real number 80.0. The division of integers in Chapel returns an integer, then, as rows and cols are integers, we must have 80 as real so that the result is not truncated.

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

The simulation will consider a matrix of 100 by 100 elements, it will run
up to 500 iterations, or until the largest difference in temperature
between iterations is less than 0.0001. You are interested in the evolution
of the temperature at the position (1,100) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152

Challenge 3: Can you do it?

Let us increase the maximum number of iterations to niter = 10_000. The code now does 10_000 iterations:

OUTPUT

...
Temperature at iteration 9960: 0.79214
Temperature at iteration 9980: 0.792139
Temperature at iteration 10000: 0.792139

So far, delta has been always equal to tolerance, which means that our main while loop will always run niter iterations. So let’s update delta after each iteration. Use what we have studied so far to write the required piece of code.

The idea is simple, after each iteration of the while loop, we must compare all elements of temp_new and temp, find the greatest difference, and update delta with that value. The next nested for loops do the job:

// update delta, the greatest difference between temp_new and temp
delta=0;
for i in 1..rows
{
  for j in 1..cols
  {
    tmp = abs(temp_new[i,j]-temp[i,j]);
    if tmp > delta then delta = tmp;
  }
}

Clearly there is no need to keep the difference at every single position in the array, we just need to update delta if we find a greater one.

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

The simulation will consider a matrix of 100 by 100 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 7460: 0.792283
Temperature at iteration 7480: 0.792281
Temperature at iteration 7500: 0.792279

Final temperature at the desired position after 7505 iterations is: 0.792279
The difference in temperatures between the last two iterations was: 9.99834e-05

Now, after Exercise 3 we should have a working program to simulate our heat transfer equation. Let’s just print some additional useful information,

// print final information
writeln('\nFinal temperature at the desired position after ',c,' iterations is: ',temp[x,y]);
writeln('The difference in temperatures between the last two iterations was: ',delta,'\n');

and compile and execute our final code,

BASH

chpl base_solution.chpl
./base_solution

OUTPUT

The simulation will consider a matrix of 100 by 100 elements,
it will run up to 500 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
Temperature at iteration 60: 1.20229
Temperature at iteration 80: 1.09044
Temperature at iteration 100: 1.02391
Temperature at iteration 120: 0.980011
Temperature at iteration 140: 0.949004
Temperature at iteration 160: 0.926011
Temperature at iteration 180: 0.908328
Temperature at iteration 200: 0.894339
Temperature at iteration 220: 0.88302
Temperature at iteration 240: 0.873688
Temperature at iteration 260: 0.865876
Temperature at iteration 280: 0.85925
Temperature at iteration 300: 0.853567
Temperature at iteration 320: 0.848644
Temperature at iteration 340: 0.844343
Temperature at iteration 360: 0.840559
Temperature at iteration 380: 0.837205
Temperature at iteration 400: 0.834216
Temperature at iteration 420: 0.831537
Temperature at iteration 440: 0.829124
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152

Final temperature at the desired position after 500 iterations is: 0.823152
The greatest difference in temperatures between the last two iterations was: 0.0258874

Key Points

  • “You can organize loops with for and while statements. Use a for loop to run over every element of the iterand, e.g. for i in 1..rows { ...} will run over all integers from 1 to rows. Use a while statement to repeatedly execute a code block until the condition does not hold anymore, e.g. while (c < niter && delta >= tolerance) {...} will repeatedly execute the commands in curly braces until one of the two conditions turns false.”

Content from Procedures


Last updated on 2024-11-18 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • “How do I write functions?”

Objectives

  • “Be able to write our own procedures.”

Similar to other programming languages, Chapel lets you define your own functions. These are called ‘procedures’ in Chapel and have an easy-to-understand syntax:

proc addOne(n) { // n is an input parameter
  return n + 1;
}

To call this procedure, you would use its name:

writeln(addOne(10));

Procedures can be recursive, as demonstrated below. In this example the procedure takes an integer number as a parameter and returns an integer number – more on this below. If the input parameter is 1 or 0, fibonacci will return the same input parameter. If the input parameter is 2 or larger, fibonacci will call itself recursively.

proc fibonacci(n: int): int { // input parameter type and procedure return type, respectively
  if n <= 1 then return n;
  return fibonacci(n-1) + fibonacci(n-2);
}
writeln(fibonacci(10));

The input parameter type n: int is enforced at compilation time. For example, if you try to pass a real-type number to the procedure with fibonacci(10.2), you will get an error “error: unresolved call”. Similarly, the return variable type is also enforced at compilation time. For example, replacing return n with return 1.0 in line 2 will result in “error: cannot initialize return value of type ‘int(64)’”. While specifying these types might be optional (see the call out below), we highly recommend doing so in your code, as it will add additional checks for your program.

Callout

If not specified, the procedure return type is inferred from the return variable type. This might not be possible with a recursive procedure as the return type is the procedure type, and it is not known to the compiler, so in this case (and in the fibonacci example above) we need to specify the procedure return type explicitly.

Procedures can take a varying number of parameters. In this example the procedure maxOf takes two or more parameters of the same type. This group of parameters is referred to as a tuple and is named x inside the procedure. The number of elements k in this tuple is inferred from the number of parameters passed to the procedure and is used to organize the calculations inside the procedure:

proc maxOf(x ...?k) { // take a tuple of one type with k elements
  var maximum = x[0];
  for i in 1..<k do maximum = if maximum < x[i] then x[i] else maximum;
  return maximum;
}
writeln(maxOf(1, -5, 123, 85, -17, 3));
writeln(maxOf(1.12, 0.85, 2.35));

OUTPUT

123
2.35

Procedures can have default parameter values. If a parameter with the default value (like y in the example below) is not passed to the procedure, it takes the default value inside the procedure. If it is passed with another value, then this new value is used inside the procedure.

In Chapel a procedure always returns a single value or a single data structure. In this example the procedure returns a tuple (a structure) with two numbers inside, one integer and one real:

proc returnTuple(x: int, y: real = 3.1415926): (int,real) {
  return (x,y);
}
writeln(returnTuple(1));
writeln(returnTuple(x=2));
writeln(returnTuple(x=-10, y=10));
writeln(returnTuple(y=-1, x=3)); // the parameters can be named out of order

Chapel procedures have many other useful features, however, they are not essential for learning task and data parallelism, so we refer the interested readers to the official Chapel documentation.

Key Points

  • “Functions in Chapel are called procedures.”
  • “Procedures can take a varying number of parameters.”
  • “Optionally, you can specify input parameter types and the return variable type.”
  • “Procedures can have default parameter values.”
  • “Procedures can be recursive. Recursive procedures require specifying the return variable type.”

Content from Using command-line arguments


Last updated on 2024-10-08 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do I use the same program for multiple use-cases?”

Objectives

  • “Modifying code’s constant parameters without re-compiling the code.”

From the last run of our code, we can see that 500 iterations is not enough to get to a steady state (a state where the difference in temperature does not vary too much, i.e. delta<tolerance). Now, if we want to change the number of iterations we would need to modify niter in the code, and compile it again. What if we want to change the number of rows and columns in our grid to have more precision, or if we want to see the evolution of the temperature at a different point (x,y)? The answer would be the same, modify the code and compile it again!

No need to say that this would be very tedious and inefficient. A better scenario would be if we can pass the desired configuration values to our binary when it is called at the command line. The Chapel mechanism for this is to use config variables. When a variable is declared with the config keyword, in addition to var or const, like this:

config const niter = 500;    //number of iterations

it can be initialised with a specific value, when executing the code at the command line, using the syntax:

BASH

chpl base_solution.chpl
./base_solution --niter=3000

OUTPUT

The simulation will consider a matrix of 100 by 100 elements,
it will run up to 3000 iterations, or until the largest difference
in temperature between iterations is less than 0.0001.
You are interested in the evolution of the temperature at the
position (1,100) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 2980: 0.793969
Temperature at iteration 3000: 0.793947

Final temperature at the desired position after 3000 iterations is: 0.793947
The greatest difference in temperatures between the last two iterations was: 0.000350086

Challenge 4: Can you do it?

Make outputFrequency, x, y, tolerance, rows and cols configurable variables, and test the code simulating different configurations. What can you conclude about the performance of the code?

Let’s prepend config to the following lines in our code:

config const rows = 100;               // number of rows in the grid
config const cols = 100;               // number of columns in the grid
config const niter = 10_000;           // maximum number of iterations
config const x = 1;                    // row number for a printout
config const y = cols;                 // column number for a printout
config const tolerance: real = 0.0001; // smallest difference in temperature that would be accepted before stopping
config const outputFrequency: int = 20;   // the temperature will be printed every outputFrequency iterations

We can then recompile the code and try modifying some of these parameters from the command line. For example, let’s use a 650 x 650 grid and observe the evolution of the temperature at the position (200,300) for 10,000 iterations or until the difference of temperature between iterations is less than 0.002; also, let’s print the temperature every 1000 iterations.

BASH

chpl base_solution.chpl
./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000

OUTPUT

The simulation will consider a matrix of 650 by 650 elements, it will run up to 10000 iterations, or until
the largest difference in temperature between iterations is less than 0.002.  You are interested in the
evolution of the temperature at the position (200,300) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819

Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985

Key Points

  • “Config variables accept values from the command line at runtime, without you having to recompile the code.”

Content from Measuring code performance


Last updated on 2024-10-08 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do I know how fast my code is?”

Objectives

  • “Measuring code performance by instrumenting the code.”

The code generated after Exercise 4 is the basic implementation of our simulation. We will use it as a benchmark, to see how much we can improve the performance when introducing the parallel programming features of the language in the following lessons.

But first, we need a quantitative way to measure the performance of our code. The easiest way to do it is to see how long it takes to finish a simulation. The UNIX command time could be used to this effect

BASH

time ./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000

OUTPUT

The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the
position (200,300) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819

Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985

real	0m20.381s
user	0m20.328s
sys	0m0.053s

The real time is what interests us. Our code is taking around 20 seconds from the moment it is called at the command line until it returns.

Some times, however, it could be useful to take the execution time of specific parts of the code. This can be achieved by modifying the code to output the information that we need. This process is called instrumentation of code.

An easy way to instrument our code with Chapel is by using the module Time. Modules in Chapel are libraries of useful functions and methods that can be used once the module is loaded. To load a module we use the keyword use followed by the name of the module. Once the Time module is loaded we can create a variable of the type stopwatch, and use the methods start,stopand elapsed to instrument our code.

use Time;
var watch: stopwatch;
watch.start();

//this is the main loop of the simulation
delta=tolerance;
while (c<niter && delta>=tolerance) do
{
...
}

watch.stop();

//print final information
writeln('\nThe simulation took ',watch.elapsed(),' seconds');
writeln('Final temperature at the desired position after ',c,' iterations is: ',temp[x,y]);
writeln('The greatest difference in temperatures between the last two iterations was: ',delta,'\n');

BASH

chpl base_solution.chpl
./base_solution --rows=650 --cols=650 --x=200 --y=300 --tolerance=0.002 --outputFrequency=1000

OUTPUT

The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the
position (200,300) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819

The simulation took 20.1621 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985

Key Points

  • “To measure performance, instrument your Chapel code using a stopwatch from the Time module.”

Content from Intro to parallel computing


Last updated on 2024-10-08 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How does parallel processing work?”

Objectives

  • “Discuss some common concepts in parallel computing.”

The basic concept of parallel computing is simple to understand: we divide our job into tasks that can be executed at the same time, so that we finish the job in a fraction of the time that it would have taken if the tasks were executed one by one. Implementing parallel computations, however, is not always easy, nor possible…

Consider the following analogy:

Suppose that we want to paint the four walls in a room. We’ll call this the problem. We can divide our problem into 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start one another. We say that we have 4 concurrent tasks; the tasks can be executed within the same time frame. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks. If there is only one painter, this guy could work for a while in one wall, then start painting another one, then work for a little bit on the third one, and so on. The tasks are being executed concurrently but not in parallel. If we have two painters for the job, then more parallelism can be introduced. Four painters could execute the tasks truly in parallel.

Callout

Think of the CPU cores as the painters or workers that will execute your concurrent tasks.

Now imagine that all workers have to obtain their paint from a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously, however, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser: One must wait while the other is being serviced.

Callout

Think of the shared memory in your computer as the central dispenser for all your workers.

Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete their task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, in different houses in the city, and different cities in the country. We need, however, a communication system in place. Suppose that worker A, for some reason, needs a colour that is only available in the dispenser of worker B, they must then synchronise: worker A must request the paint of worker B and worker B must respond by sending the required colour.

Callout

Think of the memory on each node of a cluster as a separate dispenser for your workers.

A fine-grained parallel code needs lots of communication or synchronisation between tasks, in contrast with a coarse-grained one. An embarrassingly parallel problem is one where all tasks can be executed completely independent from each other (no communications required).

Parallel programming in Chapel


Chapel provides high-level abstractions for parallel programming no matter the grain size of your tasks, whether they run in a shared memory on one node or use memory distributed across multiple compute nodes, or whether they are executed concurrently or truly in parallel. As a programmer you can focus in the algorithm: how to divide the problem into tasks that make sense in the context of the problem, and be sure that the high-level implementation will run on any hardware configuration. Then you could consider the details of the specific system you are going to use (whether it is shared or distributed, the number of cores, etc.) and tune your code/algorithm to obtain a better performance.

Callout

To this effect, concurrency (the creation and execution of multiple tasks), and locality (in which set of resources these tasks are executed) are orthogonal concepts in Chapel.

In summary, we can have a set of several tasks; these tasks could be running:

  1. concurrently by the same processor in a single compute node,
  2. in parallel by several processors in a single compute node,
  3. in parallel by several processors distributed in different compute nodes, or
  4. serially (one by one) by several processors distributed in different compute nodes.

Similarly, each of these tasks could be using variables

  1. located in the local memory on the compute node where it is running, or
  2. stored on other compute nodes.

And again, Chapel could take care of all the stuff required to run our algorithm in most of the scenarios, but we can always add more specific detail to gain performance when targeting a particular scenario.

Key Points

  • “Concurrency and locality are orthogonal concepts in Chapel: where the tasks are running may not be indicative of when they run, and you can control both in Chapel.”
  • “Problems with a lot of communication between tasks, or so called fine-grained parallel problems, are typically more difficult to parallelize. As we will see later in these lessons, Chapel simplifies writing fine-grained parallel codes by hiding a lot of communication complexity under the hood.”

Content from Fire-and-forget tasks


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do we execute work in parallel?”

Objectives

  • “Launching multiple threads to execute tasks in parallel.”
  • “Learn how to use begin, cobegin, and coforall to spawn new tasks.”

Callout

In the very first chapter where we showed how to run single-node Chapel codes. As a refresher, let’s go over this again. If you are running Chapel on your own computer, then you are all set, and you can simply compile and run Chapel codes. If you are on a cluster, you will need to run Chapel codes inside interactive jobs. Here so far we are covering only single-locale Chapel, so – from the login node – you can submit an interactive job to the scheduler with a command like this one:

SH

salloc --time=2:0:0 --ntasks=1 --cpus-per-task=3 --mem-per-cpu=1000

The details may vary depending on your cluster, e.g. different scheduler, requirement to specify an account or reservation, etc, but the general idea remains the same: on a cluster you need to ask for resources before you can run calculations. In this case we are asking for 2 hours maximum runtime, single MPI task (sufficient for our parallelism in this chapter), 3 CPU cores inside that task, and 1000M maximum memory per core. The core count means that we can run 3 threads in parallel, each on its own CPU core. Once your interactive job starts, you can compile and run the Chapel codes below. Inside your Chapel code, when new threads start, they will be able to utilize our 3 allocated CPU cores.

A Chapel program always start as a single main thread. You can then start concurrent tasks with the begin statement. A task spawned by the begin statement will run in a different thread while the main thread continues its normal execution. Consider the following example:

var x = 0;

writeln("This is the main thread starting first task");
begin
{
  var c = 0;
  while c < 10
  {
    c += 1;
    writeln('thread 1: ', x+c);
  }
}

writeln("This is the main thread starting second task");
begin
{
  var c = 0;
  while c < 10
  {
    c += 1;
    writeln('thread 2: ', x+c);
  }
}

writeln('this is main thread, I am done...');

BASH

chpl begin_example.chpl
./begin_example

OUTPUT

This is the main thread starting first task
This is the main thread starting second task
this is main thread, I am done...
thread 1: 1
thread 1: 2
thread 1: 3
thread 1: 4
thread 1: 5
thread 1: 6
thread 1: 7
thread 1: 8
thread 1: 9
thread 1: 10
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10

As you can see the order of the output is not what we would expected, and actually it is completely unpredictable. This is a well known effect of concurrent tasks accessing the same shared resource at the same time (in this case the screen); the system decides in which order the tasks could write to the screen.

Challenge 1: what if c is defined globally?

What would happen if in the last code we move the definition of c into the main thread, but try to assign it from threads 1 and 2? Select one answer from these:

  1. The code will fail to compile.
  2. The code will compile and run, but c will be updated by both threads at the same time (a race condition), so that its final value will vary from one run to another.
  3. The code will compile and run, and the two threads will be taking turns updating c, so that its final value will always be the same.

We’ll get an error at compilation (“cannot assign to const variable”), since then c would be defined within the scope of the main thread, and we could modify its value only in the main thread. Any attempt to modify its value inside threads 1 or 2 will produce a compilation error.

Challenge 2: what if we have a second, local definition of x?

What would happen if we try to insert a second definition var x = 10; inside the first begin statement? Select one answer from these:

  1. The code will fail to compile.
  2. The code will compile and run, and the inside the first begin statement the value x = 10 will be used, whereas inside the second begin statement the value x = 0 will be used.
  3. The new value x = 10 will overwrite the global value x = 0 in both threads 1 and 2.

The code will compile and run, and you will see the following output:

OUTPUT

This is the main thread starting first task
This is the main thread starting second task
this is main thread, I am done...
thread 1: 11
thread 1: 12
thread 1: 13
thread 1: 14
thread 1: 15
thread 1: 16
thread 1: 17
thread 1: 18
thread 1: 19
thread 1: 20
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10

Callout

All variables have a scope in which they can be used. The variables declared inside a concurrent task are accessible only by that task. The variables declared in the main task can be read everywhere, but Chapel won’t allow other concurrent tasks to modify them.

Try this …

Are the concurrent tasks, spawned by the last code, running truly in parallel?

The answer is: it depends on the number of cores available to your Chapel code. To verify this, let’s modify the code to get both threads 1 and 2 into an infinite loop:

begin
{
  var c=0;
  while c > -1
  {
       c += 1;
      // the rest of the code in the thread
   }
}

Compile and run the code:

SH

chpl begin_example.chpl
./begin_example

If you are running this on your own computer, you can run top or htop or ps commands in another terminal to check Chapel’s CPU usage. If you are running inside an interactive job on a cluster, you can open a different terminal, log in to the cluster, and open a bash shell on the node that is running your job (if your cluster setup allows this):

SH

squeue -u $USER                   # check the jobID number
srun --jobid=<jobID> --pty bash   # put your jobID here
htop -u $USER -s PERCENT_CPU      # display CPU usage and other information

In the output of htop you will see a table with the list of your processes, and in the “CPU%” column you will see the percentage consumed by each process. Find the Chapel process, and if it shows that your CPU usage is close to 300%, you are using 3 CPU cores. What do you see?

Now exit htop by pressing Q. Also exit your interactive run by pressing Ctrl-C.

Callout

To maximise performance, start as many tasks as cores are available.

A slightly more structured way to start concurrent tasks in Chapel is by using the cobeginstatement. Here you can start a block of concurrent tasks, one for each statement inside the curly brackets. The main difference between the beginand cobegin statements is that with the cobegin, all the spawned tasks are synchronised at the end of the statement, i.e. the main thread won’t continue its execution until all tasks are done.

var x=0;
writeln("This is the main thread, my value of x is ",x);

cobegin
{
  {
    var x=5;
    writeln("this is task 1, my value of x is ",x);
  }
  writeln("this is task 2, my value of x is ",x);
}

writeln("this message won't appear until all tasks are done...");

BASH

chpl cobegin_example.chpl
./cobegin_example

OUTPUT

This is the main thread, my value of x is 0
this is task 2, my value of x is 0
this is task 1, my value of x is 5
this message won't appear until all tasks are done...

As you may have conclude from the Discussion exercise above, the variables declared inside a task are accessible only by the task, while those variables declared in the main task are accessible to all tasks.

The last, and most useful way to start concurrent/parallel tasks in Chapel, is the coforall loop. This is a combination of the for-loop and the cobeginstatements. The general syntax is:

coforall index in iterand
{instructions}

This will start a new task, for each iteration. Each tasks will then perform all the instructions inside the curly brackets. Each task will have a copy of the variable index with the corresponding value yielded by the iterand. This index allows us to customise the set of instructions for each particular task.

var x=1;
config var numoftasks=2;

writeln("This is the main task: x = ",x);

coforall taskid in 1..numoftasks
{
  var c=taskid+1;
  writeln("this is task ",taskid,": x + ",taskid," = ",x+taskid,". My value of c is: ",c);
}

writeln("this message won't appear until all tasks are done...");

BASH

chpl coforall_example.chpl
./coforall_example --numoftasks=5

OUTPUT

This is the main task: x = 1
this is task 5: x + 5 = 6. My value of c is: 6
this is task 2: x + 2 = 3. My value of c is: 3
this is task 4: x + 4 = 5. My value of c is: 5
this is task 3: x + 3 = 4. My value of c is: 4
this is task 1: x + 1 = 2. My value of c is: 2
this message won't appear until all tasks are done...

Notice how we are able to customise the instructions inside the coforall, to give different results depending on the task that is executing them. Also, notice how, once again, the variables declared outside the coforall can be read by all tasks, while the variables declared inside, are available only to the particular task.

Challenge 3: Can you do it?

Would it be possible to print all the messages in the right order? Modify the code in the last example as required.

Hint: you can use an array of strings declared in the main task, where all the concurrent tasks could write their messages in the corresponding position. Then, at the end, have the main task printing all elements of the array in order.

The following code is a possible solution:

var x = 1;
config var numoftasks = 2;
var messages: [1..numoftasks] string;

writeln("This is the main task: x = ", x);

coforall taskid in 1..numoftasks {
  var c = taskid + 1;
  messages[taskid] = 'this is task ' + taskid:string +
    ': my value of c is ' + c:string + ' and x is ' + x:string;
}

for i in 1..numoftasks do writeln(messages[i]);
writeln("this message won't appear until all tasks are done...");

BASH

chpl exercise_coforall.chpl
./exercise_coforall --numoftasks=5

OUTPUT

This is the main task: x = 1
this is task 1: x + 1 = 2. My value of c is: 2
this is task 2: x + 2 = 3. My value of c is: 3
this is task 3: x + 3 = 4. My value of c is: 4
this is task 4: x + 4 = 5. My value of c is: 5
this is task 5: x + 5 = 6. My value of c is: 6
this message won't appear until all tasks are done...

Note that we need to convert integers to strings first (taskid:string converts taskid integer variable to a string) before we can add them to other strings to form a message stored inside each messages element.

Challenge 4: Can you do it?

Consider the following code:

use Random;
config const nelem = 100_000_000;
var x: [1..nelem] int;
fillRandom(x);	//fill array with random numbers
var mymax = 0;

// here put your code to find mymax

writeln("the maximum value in x is: ", mymax);

Write a parallel code to find the maximum value in the array x.

config const numtasks = 12;
const n = nelem/numtasks;     // number of elements per thread
const r = nelem - n*numtasks; // these elements did not fit into the last thread

var d: [1..numtasks] int;  // local maxima for each thread

coforall taskid in 1..numtasks {
  var i, f: int;
  i  = (taskid-1)*n + 1;
  f = (taskid-1)*n + n;
  if taskid == numtasks then f += r; // add r elements to the last thread
  for j in i..f do
    if x[j] > d[taskid] then d[taskid] = x[j];
}
for i in 1..numtasks do
  if d[i] > mymax then mymax = d[i];

BASH

chpl --fast exercise_coforall_2.chpl
./exercise_coforall_2

OUTPUT

the maximum value in x is: 9223372034161572255   # large random integer

We use the coforall loop to spawn tasks that work concurrently in a fraction of the array. The trick here is to determine, based on the taskid, the initial and final indices that the task will use. Each task obtains the maximum in its fraction of the array, and finally, after the coforall is done, the main task obtains the maximum of the array from the maximums of all tasks.

Try this …

Substitute the code to find mymax in the last exercise with:

mymax=max reduce x;

Time the execution of the original code and this new one. How do they compare?

Callout

It is always a good idea to check whether there is built-in functions or methods in the used language, that can do what we want to do as efficiently (or better) than our house-made code. In this case, the reduce statement reduces the given array to a single number using the given operation (in this case max), and it is parallelized and optimised to have a very good performance.

The code in these last Exercises somehow synchronise the tasks to obtain the desired result. In addition, Chapel has specific mechanisms task synchronisation, that could help us to achieve fine-grained parallelization.

Key Points

  • “Use begin or cobegin or coforall to spawn new tasks.”
  • “You can run more than one task per core, as the number of cores on a node is limited.”

Content from Synchronising tasks


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How should I access my data in parallel?”

Objectives

  • “Learn how to synchronize multiple threads using one of three mechanisms: sync statements, sync variables, and atomic variables.”
  • “Learn that with shared memory access from multiple threads you can run into race conditions and deadlocks, and learn how to recognize and solve these problems.”

In Chapel the keyword sync can be either a statement or a type qualifier, providing two different synchronization mechanisms for threads. Let’s start with using sync as a statement.

As we saw in the previous section, the begin statement will start a concurrent (or child) task that will run in a different thread while the main (or parent) thread continues its normal execution. In this sense the begin statement is non-blocking. If you want to pause the execution of the main thread and wait until the child thread ends, you can prepend the begin statement with the sync statement. Consider the following code; running this code, after the initial output line, you will first see all output from thread 1 and only then the line “The first task is done…” and the rest of the output:

var x=0;
writeln("This is the main thread starting a synchronous task");

sync
{
  begin
  {
    var c=0;
    while c<10
    {
      c+=1;
      writeln('thread 1: ',x+c);
    }
  }
}
writeln("The first task is done...");

writeln("This is the main thread starting an asynchronous task");
begin
{
  var c=0;
  while c<10
  {
    c+=1;
    writeln('thread 2: ',x+c);
  }
}

writeln('this is main thread, I am done...');

BASH

chpl sync_example_1.chpl
./sync_example_1 

OUTPUT

This is the main thread starting a synchronous task
thread 1: 1
thread 1: 2
thread 1: 3
thread 1: 4
thread 1: 5
thread 1: 6
thread 1: 7
thread 1: 8
thread 1: 9
thread 1: 10
The first task is done...
This is the main thread starting an asynchronous task
this is main thread, I am done...
thread 2: 1
thread 2: 2
thread 2: 3
thread 2: 4
thread 2: 5
thread 2: 6
thread 2: 7
thread 2: 8
thread 2: 9
thread 2: 10

Discussion

What would happen if we write instead

begin
{
  sync
  {
    var c=0;
    while c<10
    {
      c+=1;
      writeln('thread 1: ',x+c);
    }
  }
}
writeln("The first task is done...");

Challenge 3: Can you do it?

Use begin and sync statements to reproduce the functionality of cobegin in cobegin_example.chpl.

var x=0;
writeln("This is the main thread, my value of x is ",x);

sync
{
    begin
    {
       var x=5;
       writeln("this is task 1, my value of x is ",x);
    }
    begin writeln("this is task 2, my value of x is ",x);
 }

writeln("this message won't appear until all tasks are done...");

A more elaborated and powerful use of sync is as a type qualifier for variables. When a variable is declared as sync, a state that can be full or empty is associated to it.

To assign a new value to a sync variable, its state must be empty (after the assignment operation is completed, the state will be set as full). On the contrary, to read a value from a sync variable, its state must be full (after the read operation is completed, the state will be set as empty again).

Starting from Chapel 2.x, you must use functions writeEF and readFF to perform blocking write and read with sync variables. Below is an example to demonstrate the use of sync variables. Here we launch a new task that is busy for a short time executing the loop. While this loop is running, the main task continues printing the message “this is main task after launching new task… I will wait until it is done”. As it takes time to spawn a new thread, it is very likely that you will see this message before the output from the loop. Next, the main task will attempt to read x and assign it to a which it can only do when x is full. We write into x after the loop, so you will see the final message “and now it is done” only after the message “New task finished”. In other words, reading x, we pause the execution of the main thread.

var x: sync int, a: int;
writeln("this is main task launching a new task");
begin {
  for i in 1..10 do writeln("this is new task working: ",i);
  x.writeEF(2);   // assign 2 to x
  writeln("New task finished");
}

writeln("this is main task after launching new task... I will wait until it is done");
a = x.readFE();   // don't run this line until the variable x is written in the other task
writeln("and now it is done");

BASH

chpl sync_example_2.chpl
./sync_example_2

OUTPUT

this is main task launching a new task
this is main task after launching new task... I will wait until it is done
this is new task working: 1
this is new task working: 2
this is new task working: 3
this is new task working: 4
this is new task working: 5
this is new task working: 6
this is new task working: 7
this is new task working: 8
this is new task working: 9
this is new task working: 10
New task finished
and now it is done

Discussion

What would happen if we try to read x inside the new task as well, i.e. we have the following begin statement, without changing the rest of the code:

begin {
  for i in 1..10 do writeln("this is new task working: ",i);
  x.writeEF(2);
  writeln("New task finished");
  x.readFE();
}

The code will block (run forever), and you would need to press Ctrl-C to halt its execution. In this example we try to read x in two places: the main task and the new task. When we read a sync variable with readFE, the state of the sync variable is set to empty when this method completes. In other words, one of the two readFE calls will succeed (which one – depends on the runtime) and will mark the variable as empty. The other readFE will then attempt to read it but it will block waiting for x to become full again (which will never happen). In the end, the execution of either the main thread or the child thread will block, hanging the entire code.

There are a number of methods defined for sync variables. If x is a sync variable of a given type, you can use the following functions:

// non-blocking methods
x.reset()	//will set the state as empty and the value as the default of x's type
x.isFull  	//will return true is the state of x is full, false if it is empty

//blocking read and write methods
x.writeEF(value)	//will block until the state of x is empty,
			//then will assign the value,  and set the state to full
x.writeFF(value)	//will block until the state of x is full,
			//then will assign the value, and leave the state as full
x.readFE()		//will block until the state of x is full,
			//then will return x's value, and set the state to empty
x.readFF()		//will block until the state of x is full,
			//then will return x's value, and leave the state as full

//non-blocking read and write methods
x.writeXF(value)	//will assign the value no matter the state of x, and then set the state as full
x.readXX()		//will return the value of x regardless its state. The state will remain unchanged

Chapel also implements atomic operations with variables declared as atomic, and this provides another option to synchronise tasks. Atomic operations run completely independently of any other thread or process. This means that when several tasks try to write an atomic variable, only one will succeed at a given moment, providing implicit synchronisation between them. There is a number of methods defined for atomic variables, among them sub(), add(), write(), read(), and waitfor() are very useful to establish explicit synchronisation between tasks, as showed in the next code:

var lock: atomic int;
const numtasks=5;

lock.write(0);  //the main task set lock to zero

coforall id in 1..numtasks
{
    writeln("greetings from task ",id,"... I am waiting for all tasks to say hello");
    lock.add(1);                //task id says hello and atomically adds 1 to lock
    lock.waitFor(numtasks);     //then it waits for lock to be equal numtasks (which will happen when all tasks say hello)
    writeln("task ",id," is done...");
}

BASH

chpl atomic_example.chpl
./atomic_example

OUTPUT

greetings from task 4... I am waiting for all tasks to say hello
greetings from task 5... I am waiting for all tasks to say hello
greetings from task 2... I am waiting for all tasks to say hello
greetings from task 3... I am waiting for all tasks to say hello
greetings from task 1... I am waiting for all tasks to say hello
task 1 is done...
task 5 is done...
task 2 is done...
task 3 is done...
task 4 is done...

Try this…

Comment out the line lock.waitfor(numtasks) in the code above to clearly observe the effect of the task synchronisation.

Finally, with all the material studied so far, we should be ready to parallelize our code for the simulation of the heat transfer equation.

Key Points

  • “You can explicitly synchronise tasks with sync statement.”
  • “You can also use sync and atomic variables to synchronise tasks.”

Content from Task parallelism with Chapel


Last updated on 2024-11-18 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • “How do I write parallel code for a real use case?”

Objectives

  • “First objective.”

Here is our plan to task-parallelize the heat transfer equation:

  1. divide the entire grid of points into blocks and assign blocks to individual tasks,
  2. each task should compute the new temperature of its assigned points,
  3. perform a reduction over the whole grid, to update the greatest temperature difference between temp_new and temp.

For the reduction of the grid we can simply use the max reduce statement, which is already parallelized. Now, let’s divide the grid into rowtasks x coltasks sub-grids, and assign each sub-grid to a task using the coforall loop (we will have rowtasks*coltasks tasks in total).

config const rowtasks = 2;
config const coltasks = 2;

// this is the main loop of the simulation
delta = tolerance;
while (c<niter && delta>=tolerance) {
  c += 1;

  coforall taskid in 0..coltasks*rowtasks-1 {
    for i in rowi..rowf {
      for j in coli..colf {
        temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
      }
    }
  }

  delta = max reduce (temp_new-temp);
  temp = temp_new;

  if c%outputFrequency == 0 then writeln('Temperature at iteration ',c,': ',temp[x,y]);
}

Note that now the nested for loops run from rowi to rowf and from coli to colf which are, respectively, the initial and final row and column of the sub-grid associated to the task taskid. To compute these limits, based on taskid, we need to compute the number of rows and columns per task (nr and nc, respectively) and account for possible non-zero remainders (rr and rc) that we should add to the last row and column:

config const rowtasks = 2;
config const coltasks = 2;

const nr = rows/rowtasks;
const rr = rows-nr*rowtasks;
const nc = cols/coltasks;
const rc = cols-nc*coltasks;

// this is the main loop of the simulation
delta = tolerance;
while (c<niter && delta>=tolerance) {
  c+=1;

  coforall taskid in 0..coltasks*rowtasks-1 {
    var rowi, coli, rowf, colf: int;
    var taskr, taskc: int;

    taskr = taskid/coltasks;
    taskc = taskid%coltasks;

    if taskr<rr {
      rowi=(taskr*nr)+1+taskr;
      rowf=(taskr*nr)+nr+taskr+1;
    }
    else {
      rowi = (taskr*nr)+1+rr;
      rowf = (taskr*nr)+nr+rr;
    }

    if taskc<rc {
      coli = (taskc*nc)+1+taskc;
      colf = (taskc*nc)+nc+taskc+1;
    }
    else {
      coli = (taskc*nc)+1+rc;
      colf = (taskc*nc)+nc+rc;
    }

    for i in rowi..rowf {
      for j in coli..colf {
      ...
}

As you can see, to divide a data set (the array temp in this case) between concurrent tasks, could be cumbersome. Chapel provides high-level abstractions for data parallelism that take care of all the data distribution for us. We will study data parallelism in the following lessons, but for now, let’s compare the benchmark solution with our coforall parallelization to see how the performance improved.

BASH

chpl --fast parallel1.chpl
./parallel1 --rows=650 --cols=650 --x=200 --y=300 --niter=10000 --tolerance=0.002 --outputFrequency=1000

OUTPUT

The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the position (200,300) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819

The simulation took 17.0193 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985

This parallel solution, using 4 parallel tasks, took around 17 seconds to finish. Compared with the ~20 seconds needed by the benchmark solution, seems not very impressive. To understand the reason, let’s analyse the code’s flow. When the program starts, the main thread does all the declarations and initialisations, and then, it enters the main loop of the simulation (the while loop). Inside this loop, the parallel tasks are launched for the first time. When these tasks finish their computations, the main task resumes its execution, it updates delta, and everything is repeated again. So, in essence, parallel tasks are launched and resumed 7750 times, which introduces a significant amount of overhead (the time the system needs to effectively start and destroy threads in the specific hardware, at each iteration of the while loop).

Clearly, a better approach would be to launch the parallel tasks just once, and have them executing all the simulations, before resuming the main task to print the final results.

config const rowtasks = 2;
config const coltasks = 2;

const nr = rows/rowtasks;
const rr = rows-nr*rowtasks;
const nc = cols/coltasks;
const rc = cols-nc*coltasks;

// this is the main loop of the simulation
delta = tolerance;
coforall taskid in 0..coltasks*rowtasks-1 {
  var rowi, coli, rowf, colf: int;
  var taskr, taskc: int;
  var c = 0;

  taskr = taskid/coltasks;
  taskc = taskid%coltasks;

  if taskr<rr {
    rowi = (taskr*nr)+1+taskr;
    rowf = (taskr*nr)+nr+taskr+1;
  }
  else {
    rowi = (taskr*nr)+1+rr;
    rowf = (taskr*nr)+nr+rr;
  }

  if taskc<rc {
    coli = (taskc*nc)+1+taskc;
    colf = (taskc*nc)+nc+taskc+1;
  }
  else {
    coli = (taskc*nc)+1+rc;
    colf = (taskc*nc)+nc+rc;
  }

  while (c<niter && delta>=tolerance) {
    c = c+1;

    for i in rowi..rowf {
      for j in coli..colf {
        temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
      }
    }

    //update delta
    //update temp
    //print temperature in desired position
  }
}

The problem with this approach is that now we have to explicitly synchronise the tasks. Before, delta and temp were updated only by the main task at each iteration; similarly, only the main task was printing results. Now, all these operations must be carried inside the coforall loop, which imposes the need of synchronisation between tasks.

The synchronisation must happen at two points:

  1. We need to be sure that all tasks have finished with the computations of their part of the grid temp, before updating delta and temp safely.
  2. We need to be sure that all tasks use the updated value of delta to evaluate the condition of the while loop for the next iteration.

To update delta we could have each task computing the greatest difference in temperature in its associated sub-grid, and then, after the synchronisation, have only one task reducing all the sub-grids’ maximums.

var delta: atomic real;
var myd: [0..coltasks*rowtasks-1] real;
...
//this is the main loop of the simulation
delta.write(tolerance);
coforall taskid in 0..coltasks*rowtasks-1
{
  var myd2: real;
  ...

  while (c<niter && delta.read() >= tolerance) {
    c = c+1;
    ...

    for i in rowi..rowf {
      for j in coli..colf {
        temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
        myd2 = max(abs(temp_new[i,j]-temp[i,j]),myd2);
      }
    }
    myd[taskid] = myd2

    // here comes the synchronisation of tasks

    temp[rowi..rowf,coli..colf] = temp_new[rowi..rowf,coli..colf];
    if taskid==0 {
      delta.write(max reduce myd);
      if c%outputFrequency==0 then writeln('Temperature at iteration ',c,': ',temp[x,y]);
    }

    // here comes the synchronisation of tasks again
  }
}

Challenge 4: Can you do it?

Use sync or atomic variables to implement the synchronisation required in the code above.

One possible solution is to use an atomic variable as a lock that opens (using the waitFor method) when all the tasks complete the required instructions

var lock: atomic int;
lock.write(0);
...
//this is the main loop of the simulation
delta.write(tolerance);
coforall taskid in 0..coltasks*rowtasks-1
{
   ...
   while (c<niter && delta>=tolerance)
   {
      ...
      myd[taskid]=myd2

      //here comes the synchronisation of tasks
      lock.add(1);
      lock.waitFor(coltasks*rowtasks);

      temp[rowi..rowf,coli..colf] = temp_new[rowi..rowf,coli..colf];
      ...

      //here comes the synchronisation of tasks again
      lock.sub(1);
      lock.waitFor(0);
   }
}

Using the solution in the Exercise 4, we can now compare the performance with the benchmark solution

BASH

chpl --fast parallel2.chpl
./parallel2 --rows=650 --cols=650 --x=200 --y=300 --niter=10000 --tolerance=0.002 --outputFrequency=1000

OUTPUT

The simulation will consider a matrix of 650 by 650 elements,
it will run up to 10000 iterations, or until the largest difference
in temperature between iterations is less than 0.002.
You are interested in the evolution of the temperature at the position (200,300) of the matrix...

and here we go...
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819

The simulation took 4.2733 seconds
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985

to see that we now have a code that performs 5x faster.

We finish this section by providing another, elegant version of the 2D heat transfer solver (without time stepping) using data parallelism on a single locale:

use Math; /* for exp() */

const n = 100, stride = 20;
var temp: [0..n+1, 0..n+1] real;
var temp_new: [1..n,1..n] real;
var x, y: real;
for (i,j) in {1..n,1..n} { // serial iteration
  x = ((i:real)-0.5)/n;
  y = ((j:real)-0.5)/n;
  temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2)/0.01); // narrow Gaussian peak
}
coforall (i,j) in {1..n,1..n} by (stride,stride) { // 5x5 decomposition into 20x20 blocks => 25 tasks
  for k in i..i+stride-1 { // serial loop inside each block
    for l in j..j+stride-1 {
      temp_new[k,l] = (temp[k-1,l] + temp[k+1,l] + temp[k,l-1] + temp[k,l+1]) / 4;
    }
  }
}

We will study data parallelism in more detail in the next section.

Key Points

  • “To parallelize the diffusion solver with tasks, you divide the 2D domain into blocks and assign each block to a task.”
  • “To get the maximum performance, you need to launch the parallel tasks only once, and run the temporal loop of the simulation with the same set of tasks, resuming the main task only to print the final results.”
  • “Parallelizing with tasks is more laborious than parallelizing with data (covered in the next section).”

Content from Running code on multiple machines


Last updated on 2024-11-18 | Edit this page

Estimated time: 180 minutes

Overview

Questions

  • “What is a locale?”

Objectives

  • “First objective.”

So far we have been working with single-locale Chapel codes that may run on one or many cores on a single compute node, making use of the shared memory space and accelerating computations by launching concurrent tasks on individual cores in parallel. Chapel codes can also run on multiple nodes on a compute cluster. In Chapel this is referred to as multi-locale execution.

If you work inside a Chapel Docker container, e.g., chapel/chapel-gasnet, the container environment simulates a multi-locale cluster, so you would compile and launch multi-locale Chapel codes directly by specifying the number of locales with -nl flag:

BASH

chpl --fast mycode.chpl -o mybinary
./mybinary -nl 4

Inside the Docker container on multiple locales your code will not run any faster than on a single locale, since you are emulating a virtual cluster, and all tasks run on the same physical node. To achieve actual speedup, you need to run your parallel multi-locale Chapel code on a real physical cluster which we hope you have access to for this session.

On a real HPC cluster you would need to submit either an interactive or a batch job asking for several nodes and then run a multi-locale Chapel code inside that job. In practice, the exact commands depend on how the multi-locale Chapel was built on the cluster.

When you compile a Chapel code with the multi-locale Chapel compiler, two binaries will be produced. One is called mybinary and is a launcher binary used to submit the real executable mybinary_real. If the Chapel environment is configured properly with the launcher for the cluster’s physical interconnect (which might not be always possible due to a number of factors), then you would simply compile the code and use the launcher binary mybinary to submit the job to the queue:

BASH

chpl --fast mycode.chpl -o mybinary
./mybinary -nl 2

The exact parameters of the job such as the maximum runtime and the requested memory can be specified with Chapel environment variables. One possible drawback of this launching method is that, depending on your cluster setup, Chapel might have access to all physical cores on each node participating in the run – this will present problems if you are scheduling jobs by-core and not by-node, since part of a node should be allocated to someone else’s job.

Note that on Compute Canada clusters this launching method works without problem. On these clusters multi-locale Chapel is provided by chapel-ofi (for the OmniPath interconnect on Cedar) and chapel-ucx (for the InfiniBand interconnect on Graham, Béluga, Narval) modules, so – depending on the cluster – you will load Chapel using one of the two lines below:

BASH

module load gcc chapel-ofi   # for the OmniPath interconnect on Cedar cluster
module load gcc chapel-ucx   # for the InfiniBand interconnect on Graham, Béluga, Narval clusters

We can also launch multi-locale Chapel codes using the real executable mybinary_real. For example, for an interactive job you would type:

BASH

salloc --time=0:30:0 --nodes=4 --cpus-per-task=3 --mem-per-cpu=1000 --account=def-guest
chpl --fast mycode.chpl -o mybinary
srun ./mybinary_real -nl 4   # will run on four locales with max 3 cores per locale

Production jobs would be launched with sbatch command and a Slurm launch script as usual.

For the rest of this class we assume that you have a working multi-locale Chapel environment, whether provided by a Docker container or by multi-locale Chapel on a physical HPC cluster. We will run all examples on four nodes with three cores per node.

Intro to multi-locale code

Let us test our multi-locale Chapel environment by launching the following code:

writeln(Locales);

This code will print the built-in global array Locales. Running it on four locales will produce

OUTPUT

LOCALE0 LOCALE1 LOCALE2 LOCALE3

We want to run some code on each locale (node). For that, we can cycle through locales:

for loc in Locales do   // this is still a serial program
  on loc do             // run the next line on locale `loc`
    writeln("this locale is named ", here.name);

This will produce

OUTPUT

this locale is named cdr544
this locale is named cdr552
this locale is named cdr556
this locale is named cdr692

Here the built-in variable class here refers to the locale on which the code is running, and here.name is its hostname. We started a serial for loop cycling through all locales, and on each locale we printed its name, i.e., the hostname of each node. This program ran in serial starting a task on each locale only after completing the same task on the previous locale. Note the order in which locales were listed.

To run this code in parallel, starting four simultaneous tasks, one per locale, we simply need to replace for with forall:

forall loc in Locales do   // now this is a parallel loop
  on loc do
    writeln("this locale is named ", here.name);

This starts four tasks in parallel, and the order in which the print statement is executed depends on the runtime conditions and can change from run to run:

OUTPUT

this locale is named cdr544
this locale is named cdr692
this locale is named cdr556
this locale is named cdr552

We can print few other attributes of each locale. Here it is actually useful to revert to the serial loop for so that the print statements appear in order:

use MemDiagnostics;
for loc in Locales do
  on loc {
    writeln("locale #", here.id, "...");
    writeln("  ...is named: ", here.name);
    writeln("  ...has ", here.numPUs(), " processor cores");
    writeln("  ...has ", here.physicalMemory(unit=MemUnits.GB, retType=real), " GB of memory");
    writeln("  ...has ", here.maxTaskPar, " maximum parallelism");
  }

OUTPUT

locale #0...
  ...is named: cdr544
  ...has 3 processor cores
  ...has 125.804 GB of memory
  ...has 3 maximum parallelism
locale #1...
  ...is named: cdr552
  ...has 3 processor cores
  ...has 125.804 GB of memory
  ...has 3 maximum parallelism
locale #2...
  ...is named: cdr556
  ...has 3 processor cores
  ...has 125.804 GB of memory
  ...has 3 maximum parallelism
locale #3...
  ...is named: cdr692
  ...has 3 processor cores
  ...has 125.804 GB of memory
  ...has 3 maximum parallelism

Note that while Chapel correctly determines the number of cores available inside our job on each node, and the maximum parallelism (which is the same as the number of cores available!), it lists the total physical memory on each node available to all running jobs which is not the same as the total memory per node allocated to our job.

Key Points

  • “Locale in Chapel is a shared-memory node on a cluster.”
  • “We can cycle in serial or parallel through all locales.”

Content from Domains and data parallelism


Last updated on 2024-11-18 | Edit this page

Estimated time: 180 minutes

Overview

Questions

  • “How do I store and manipulate data across multiple locales?”

Objectives

  • “First objective.”

Domains and single-locale data parallelism

We start this section by recalling the definition of a range in Chapel. A range is a 1D set of integer indices that can be bounded or infinite:

var oneToTen: range = 1..10; // 1, 2, 3, ..., 10
var a = 1234, b = 5678;
var aToB: range = a..b; // using variables
var twoToTenByTwo: range(strides=strideKind.positive) = 2..10 by 2; // 2, 4, 6, 8, 10
var oneToInf = 1.. ; // unbounded range

On the other hand, domains are multi-dimensional (including 1D) sets of integer indices that are always bounded. To stress the difference between domain ranges and domains, domain definitions always enclose their indices in curly brackets. Ranges can be used to define a specific dimension of a domain:

var domain1to10: domain(1) = {1..10};        // 1D domain from 1 to 10 defined using the range 1..10
var twoDimensions: domain(2) = {-2..2,0..2}; // 2D domain over a product of two ranges
var thirdDim: range = 1..16; // a range
var threeDims: domain(3) = {thirdDim, 1..10, 5..10}; // 3D domain over a product of three ranges
for idx in twoDimensions do // cycle through all points in a 2D domain
  write(idx, ", ");
writeln();
for (x,y) in twoDimensions { // can also cycle using explicit tuples (x,y)
  write("(", x, ", ", y, ")", ", ");
}

Let us define an n^2 domain called mesh. It is defined by the single task in our code and is therefore defined in memory on the same node (locale 0) where this task is running. For each of n^2 mesh points, let us print out

  1. m.locale.id, the ID of the locale holding that mesh point (should be 0)
  2. here.id, the ID of the locale on which the code is running (should be 0)
  3. here.maxTaskPar, the number of cores (max parallelism with 1 task/core) (should be 3)

Note: We already saw some of these variables/functions: numLocales, Locales, here.id, here.name, here.numPUs(), here.physicalMemory(), here.maxTaskPar.

config const n = 8;
const mesh: domain(2) = {1..n, 1..n};  // a 2D domain defined in shared memory on a single locale
forall m in mesh { // go in parallel through all n^2 mesh points
  writeln((m, m.locale.id, here.id, here.maxTaskPar));
}

OUTPUT

((7, 1), 0, 0, 3)
((1, 1), 0, 0, 3)
((7, 2), 0, 0, 3)
((1, 2), 0, 0, 3)
...
((6, 6), 0, 0, 3)
((6, 7), 0, 0, 3)
((6, 8), 0, 0, 3)

Now we are going to learn two very important properties of Chapel domains. First, domains can be used to define arrays of variables of any type on top of them. For example, let us define an n^2 array of real numbers on top of mesh:

config const n = 8;
const mesh: domain(2) = {1..n, 1..n};  // a 2D domain defined in shared memory on a single locale
var T: [mesh] real; // a 2D array of reals defined in shared memory on a single locale (mapped onto this domain)
forall t in T { // go in parallel through all n^2 elements of T
  writeln((t, t.locale.id));
}

OUTPUT

(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
...
(0.0, 0)
(0.0, 0)
(0.0, 0)

By default, all n^2 array elements are set to zero, and all of them are defined on the same locale as the underlying mesh. We can also cycle through all indices of T by accessing its domain:

forall idx in T.domain {
  writeln(idx, ' ', T(idx));   // idx is a tuple (i,j); also print the corresponding array element
}

OUTPUT

(7, 1) 0.0
(1, 1) 0.0
(7, 2) 0.0
(1, 2) 0.0
...
(6, 6) 0.0
(6, 7) 0.0
(6, 8) 0.0

Since we use a parallel forall loop, the print statements appear in a random runtime order.

We can also define multiple arrays on the same domain:

const grid = {1..100}; // 1D domain
const alpha = 5; // some number
var A, B, C: [grid] real; // local real-type arrays on this 1D domain
B = 2; C = 3;
forall (a,b,c) in zip(A,B,C) do // parallel loop
  a = b + alpha*c;   // simple example of data parallelism on a single locale
writeln(A);

The second important property of Chapel domains is that they can span multiple locales (nodes).

Distributed domains

Domains are fundamental Chapel concept for distributed-memory data parallelism.

Let us now define an n^2 distributed (over several locales) domain distributedMesh mapped to locales in blocks. On top of this domain we define a 2D block-distributed array A of strings mapped to locales in exactly the same pattern as the underlying domain. Let us print out

  1. a.locale.id, the ID of the locale holding the element a of A
  2. here.name, the name of the locale on which the code is running
  3. here.maxTaskPar, the number of cores on the locale on which the code is running

Instead of printing these values to the screen, we will store this output inside each element of A as a string a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string, adding a separator ' ' at the end of each element.

use BlockDist; // use standard block distribution module to partition the domain into blocks
config const n = 8;
const mesh: domain(2) = {1..n, 1..n};
const distributedMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = mesh;
var A: [distributedMesh] string; // block-distributed array mapped to locales
forall a in A { // go in parallel through all n^2 elements in A
  // assign each array element on the locale that stores that index/element
  a = a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string + '  ';
}
writeln(A);

The syntax boundingBox=mesh tells the compiler that the outer edge of our decomposition coincides exactly with the outer edge of our domain. Alternatively, the outer decomposition layer could include an additional perimeter of ghost points if we specify

const mesh: domain(2) = {1..n, 1..n};
const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1,0..n+1};

but let us not worry about this for now.

Running our code on four locales with three cores per locale produces the following output:

OUTPUT

0-cdr544-3   0-cdr544-3   0-cdr544-3   0-cdr544-3   1-cdr552-3   1-cdr552-3   1-cdr552-3   1-cdr552-3
0-cdr544-3   0-cdr544-3   0-cdr544-3   0-cdr544-3   1-cdr552-3   1-cdr552-3   1-cdr552-3   1-cdr552-3
0-cdr544-3   0-cdr544-3   0-cdr544-3   0-cdr544-3   1-cdr552-3   1-cdr552-3   1-cdr552-3   1-cdr552-3
0-cdr544-3   0-cdr544-3   0-cdr544-3   0-cdr544-3   1-cdr552-3   1-cdr552-3   1-cdr552-3   1-cdr552-3
2-cdr556-3   2-cdr556-3   2-cdr556-3   2-cdr556-3   3-cdr692-3   3-cdr692-3   3-cdr692-3   3-cdr692-3
2-cdr556-3   2-cdr556-3   2-cdr556-3   2-cdr556-3   3-cdr692-3   3-cdr692-3   3-cdr692-3   3-cdr692-3
2-cdr556-3   2-cdr556-3   2-cdr556-3   2-cdr556-3   3-cdr692-3   3-cdr692-3   3-cdr692-3   3-cdr692-3
2-cdr556-3   2-cdr556-3   2-cdr556-3   2-cdr556-3   3-cdr692-3   3-cdr692-3   3-cdr692-3   3-cdr692-3  

As we see, the domain distributedMesh (along with the string array A on top of it) was decomposed into 2x2 blocks stored on the four nodes, respectively. Equally important, for each element a of the array, the line of code filling in that element ran on the same locale where that element was stored. In other words, this code ran in parallel (forall loop) on four nodes, using up to three cores on each node to fill in the corresponding array elements. Once the parallel loop is finished, the writeln command runs on locale 0 gathering remote elements from other locales and printing them to standard output.

Now we can print the range of indices for each sub-domain by adding the following to our code:

for loc in Locales {
  on loc {
    writeln(A.localSubdomain());
  }
}

On 4 locales we should get:

OUTPUT

{1..4, 1..4}
{1..4, 5..8}
{5..8, 1..4}
{5..8, 5..8}  

Let us count the number of threads by adding the following to our code:

var counter = 0;
forall a in A with (+ reduce counter) { // go in parallel through all n^2 elements
  counter = 1;
}
writeln("actual number of threads = ", counter);

If n=8 in our code is sufficiently large, there are enough array elements per node (8*8/4 = 16 in our case) to fully utilise all three available cores on each node, so our output should be

OUTPUT

actual number of threads = 12

Try reducing the array size n to see if that changes the output (fewer tasks per locale), e.g., setting n=3. Also try increasing the array size to n=20 and study the output. Does the output make sense?

So far we looked at the block distribution BlockDist. It will distribute a 2D domain among nodes either using 1D or 2D decomposition (in our example it was 2D decomposition 2x2), depending on the domain size and the number of nodes.

Let us take a look at another standard module for domain partitioning onto locales, called CyclicDist. For each element of the array we will print out again

  1. a.locale.id, the ID of the locale holding the element a of A
  2. here.name, the name of the locale on which the code is running
  3. here.maxTaskPar, the number of cores on the locale on which the code is running
use CyclicDist; // elements are sent to locales in a round-robin pattern
config const n = 8;
const mesh: domain(2) = {1..n, 1..n};  // a 2D domain defined in shared memory on a single locale
const m2: domain(2) dmapped new cyclicDist(startIdx=mesh.low) = mesh; // mesh.low is the first index (1,1)
var A2: [m2] string;
forall a in A2 {
  a = a.locale.id:string + '-' + here.name + '-' + here.maxTaskPar:string + '  ';
}
writeln(A2);

OUTPUT

0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3
2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3
0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3
2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3
0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3
2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3
0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3   0-cdr544-3   1-cdr552-3
2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3   2-cdr556-3   3-cdr692-3  

As the name CyclicDist suggests, the domain was mapped to locales in a cyclic, round-robin pattern. We can also print the range of indices for each sub-domain by adding the following to our code:

for loc in Locales {
  on loc {
    writeln(A2.localSubdomain());
  }
}

OUTPUT

{1..7 by 2, 1..7 by 2}
{1..7 by 2, 2..8 by 2}
{2..8 by 2, 1..7 by 2}
{2..8 by 2, 2..8 by 2}

In addition to BlockDist and CyclicDist, Chapel has several other predefined distributions: BlockCycDist, ReplicatedDist, DimensionalDist2D, ReplicatedDim, BlockCycDim — for details please see https://chapel-lang.org/docs/primers/distributions.html.

Diffusion solver on distributed domains

Now let us use distributed domains to write a parallel version of our original diffusion solver code:

use BlockDist;
use Math;
config const n = 8;
const mesh: domain(2) = {1..n, 1..n};  // local 2D n^2 domain

We will add a larger (n+2)^2 block-distributed domain largerMesh with a layer of ghost points on perimeter locales, and define a temperature array temp on top of it, by adding the following to our code:

const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1, 0..n+1};
var temp: [largerMesh] real; // a block-distributed array of temperatures
forall (i,j) in temp.domain[1..n,1..n] {
  var x = ((i:real)-0.5)/(n:real); // x, y are local to each task
  var y = ((j:real)-0.5)/(n:real);
  temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2) / 0.01); // narrow Gaussian peak
}
writeln(temp);

Here we initialised an initial Gaussian temperature peak in the middle of the mesh. As we evolve our solution in time, this peak should diffuse slowly over the rest of the domain.

Question

Why do we have forall (i,j) in temp.domain[1..n,1..n] and not forall (i,j) in mesh?

Answer

The first one will run on multiple locales in parallel, whereas the second will run in parallel via multiple threads on locale 0 only, since “mesh” is defined on locale 0.

The code above will print the initial temperature distribution:

OUTPUT

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 2.36954e-17 2.79367e-13 1.44716e-10 3.29371e-09 3.29371e-09 1.44716e-10 2.79367e-13 2.36954e-17 0.0
0.0 2.79367e-13 3.29371e-09 1.70619e-06 3.88326e-05 3.88326e-05 1.70619e-06 3.29371e-09 2.79367e-13 0.0
0.0 1.44716e-10 1.70619e-06 0.000883826 0.0201158 0.0201158 0.000883826 1.70619e-06 1.44716e-10 0.0
0.0 3.29371e-09 3.88326e-05 0.0201158 0.457833 0.457833 0.0201158 3.88326e-05 3.29371e-09 0.0
0.0 3.29371e-09 3.88326e-05 0.0201158 0.457833 0.457833 0.0201158 3.88326e-05 3.29371e-09 0.0
0.0 1.44716e-10 1.70619e-06 0.000883826 0.0201158 0.0201158 0.000883826 1.70619e-06 1.44716e-10 0.0
0.0 2.79367e-13 3.29371e-09 1.70619e-06 3.88326e-05 3.88326e-05 1.70619e-06 3.29371e-09 2.79367e-13 0.0
0.0 2.36954e-17 2.79367e-13 1.44716e-10 3.29371e-09 3.29371e-09 1.44716e-10 2.79367e-13 2.36954e-17 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  

Let us define an array of strings nodeID with the same distribution over locales as temp, by adding the following to our code:

var nodeID: [largerMesh] string;
forall m in nodeID do
  m = here.id:string;
writeln(nodeID);

The outer perimeter in the partition below are the ghost points:

OUTPUT

0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3
2 2 2 2 2 3 3 3 3 3  

Challenge 3: Can you do it?

In addition to here.id, also print the ID of the locale holding that value. Is it the same or different from here.id?

Something along the lines: m = here.id:string + '-' + m.locale.id:string

Now we implement the parallel solver, by adding the following to our code (contains a mistake on purpose!):

var temp_new: [largerMesh] real;
for step in 1..5 { // time-stepping
  forall (i,j) in mesh do
    temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4;
  temp[mesh] = temp_new[mesh]; // uses parallel forall underneath
}

Challenge 4: Can you do it?

Can anyone spot a mistake in the last code?

It should be

forall (i,j) in temp_new.domain[1..n,1..n] do

instead of

forall (i,j) in mesh do

as the last one will likely run in parallel via threads only on locale 0, whereas the former will run on multiple locales in parallel.

Here is the final version of the entire code:

use BlockDist;
use Math;
config const n = 8;
const mesh: domain(2) = {1..n,1..n};
const largerMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = {0..n+1,0..n+1};
var temp, temp_new: [largerMesh] real;
forall (i,j) in temp.domain[1..n,1..n] {
  var x = ((i:real)-0.5)/(n:real);
  var y = ((j:real)-0.5)/(n:real);
  temp[i,j] = exp(-((x-0.5)**2 + (y-0.5)**2) / 0.01);
}
for step in 1..5 {
  forall (i,j) in temp_new.domain[1..n,1..n] {
    temp_new[i,j] = (temp[i-1,j] + temp[i+1,j] + temp[i,j-1] + temp[i,j+1]) / 4.0;
  }
  temp = temp_new;
  writeln((step, " ", temp[n/2,n/2], " ", temp[1,1]));
}

This is the entire parallel solver! Note that we implemented an open boundary: temp on the ghost points is always 0. Let us add some printout and also compute the total energy on the mesh, by adding the following to our code:

  writeln((step, " ", temp[n/2,n/2], " ", temp[2,2]));
  var total: real = 0;
  forall (i,j) in mesh with (+ reduce total) do
    total += temp[i,j];
  writeln("total = ", total);

Notice how the total energy decreases in time with the open boundary conditions, as the energy is leaving the system.

Challenge 5: Can you do it?

Write a code to print how the finite-difference stencil [i,j], [i-1,j], [i+1,j], [i,j-1], [i,j+1] is distributed among nodes, and compare that to the ID of the node where temp[i,i] is computed.

Here is one possible solution examining the locality of the finite-difference stencil:

var nodeID: [largerMesh] string = 'empty';
forall (i,j) in nodeID.domain[1..n,1..n] do
  nodeID[i,j] = here.id:string + nodeID[i,j].locale.id:string + nodeID[i-1,j].locale.id:string +
    nodeID[i+1,j].locale.id:string + nodeID[i,j-1].locale.id:string + nodeID[i,j+1].locale.id:string + '  ';
writeln(nodeID);

This produced the following output clearly showing the ghost points and the stencil distribution for each mesh point:

OUTPUT

empty empty empty empty empty empty empty empty empty empty
empty 000000   000000   000000   000001   111101   111111   111111   111111   empty
empty 000000   000000   000000   000001   111101   111111   111111   111111   empty
empty 000000   000000   000000   000001   111101   111111   111111   111111   empty
empty 000200   000200   000200   000201   111301   111311   111311   111311   empty
empty 220222   220222   220222   220223   331323   331333   331333   331333   empty
empty 222222   222222   222222   222223   333323   333333   333333   333333   empty
empty 222222   222222   222222   222223   333323   333333   333333   333333   empty
empty 222222   222222   222222   222223   333323   333333   333333   333333   empty
empty empty empty empty empty empty empty empty empty empty

Note that temp[i,j] is always computed on the same node where that element is stored, which makes sense.

Periodic boundary conditions

Now let us modify the previous parallel solver to include periodic BCs. At the beginning of each time step we need to set elements on the ghost points to their respective values on the opposite ends, by adding the following to our code:

  temp[0,1..n] = temp[n,1..n]; // periodic boundaries on all four sides; these will run via parallel forall
  temp[n+1,1..n] = temp[1,1..n];
  temp[1..n,0] = temp[1..n,n];
  temp[1..n,n+1] = temp[1..n,1];

Now total energy should be conserved, as nothing leaves the domain.

I/O

Let us write the final solution to disk. There are several caveats:

  • works only with ASCII
  • Chapel can also write binary data but nothing can read it (checked: not the endians problem!)
  • would love to write NetCDF and HDF5, probably can do this by calling C/C++ functions from Chapel

We’ll add the following to our code to write ASCII:

use IO;
var myFile = open("output.dat", ioMode.cw); // open the file for writing
var myWritingChannel = myFile.writer(); // create a writing channel starting at file offset 0
myWritingChannel.write(temp); // write the array
myWritingChannel.close(); // close the channel

Run the code and check the file output.dat: it should contain the array T after 5 steps in ASCII.

Key Points

  • “Domains are multi-dimensional sets of integer indices.”
  • “A domain can be defined on a single locale or distributed across many locales.”
  • “There are many predefined distribution method: block, cyclic, etc.”
  • “Arrays are defined on top of domains and inherit their distribution model.”