Single-locale data parallelism

Table of contents

As we mentioned in the previous section, Data Parallelism is a style of parallel programming in which parallelism is driven by computations over collections of data elements or their indices. The main tool for this in Chapel is a forall loop – it’ll create an appropriate number of threads to execute a loop, dividing the loop’s iterations between them.

forall index in iterand   // iterating over all elements of an array or over a range of indices
{instructions}

What is the appropriate number of tasks/threads?

on a single core: single thread
on multiple cores on the same nodes: all cores, up to the number of elements or iterations
on multiple cores on multiple nodes: all cores, up to the problem size, given the data distribution

Consider a simple code test.chpl:

const n = 1e6: int;
var A: [1..n] real;
forall a in A do
  a += 1;

In this code we update all elements of the array A. The code will run on a single node, lauching as many threads as the number of available cores. It is thread-safe, meaning that no two threads are writing into the same variable at the same time.

if we replace forall with for, we’ll get a serial loop on a sigle core
if we replace forall with coforall (we’ll study it later), we’ll create $10^6$ threads – likely an overkill!
there is also foreach that is specifically for multi-threaded parallelism and that we’ll use later on a GPU

Reduction

Consider a simple code forall.chpl that we’ll run inside a 4-core interactive job. We have a range of indices 1..1000, and they get broken into 4 groups that are processed by individual threads:

var count = 0;
forall i in 1..1000 with (+ reduce count) {   // parallel loop
  count += i;
}
writeln('count = ', count);

If we have not done so, let’s write a script shared.sh for submitting single-locale, two-core Chapel jobs:

#!/bin/bash
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=3600   # in MB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --output=solution.out
./forall

$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out

count = 500500

Number of cores at runtime

We computed the sum of integers from 1 to 1000 in parallel. How many cores did the code run on? Looking at the code or its output, we don’t know. Most likely, on all 4 cores available to us inside the job. But we can actually check that! Do this:

replace count += i; with count = 1;
change the last line to writeln('actual number of threads = ', count);

$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out

actual number of threads = 4

If you see one thread, try running this code as a batch multi-core job.

Alternative syntax

We can also do parallel reduction over a loop in this way:

var count = (+ reduce forall i in 1..1000 do i**2);
writeln('count = ', count);

We can also initialize in array and do parallel reduction over all array elements:

var A = (for i in 1..1000 do i);
var count = (+ reduce A);   // multiple threads
writeln('count = ', count);

Or we could do it this way if we want to do some processing on individual elements:

var A = (for i in 1..1000 do i);
var count = (+ reduce forall a in A do a**2);
writeln('count = ', count);

Question Parallel $\pi$

Using the first version of forall.chpl (where we computed the sum of integers 1..1000) as a template, write a Chapel code to compute $\pi$ by calculating the integral numerically via summation using forall parallelism. Implement the number of intervals as config variable.

To get you started, here is a serial version of this code pi.chpl:

config const n = 1000;
var h, total: real;
h = 1.0 / n;                          // interval width
for i in 1..n {
  var x = h * ( i - 0.5 );
  total += 4.0 / ( 1.0 + x**2);
}
writef('pi is %3.10r\n', total*h);    // C-style formatted write, r stands for real