Single-locale data parallelism



 Table of contents

As we mentioned in the previous section, Data Parallelism is a style of parallel programming in which parallelism is driven by computations over collections of data elements or their indices. The main tool for this in Chapel is a forall loop – it’ll create an appropriate number of threads to execute a loop, dividing the loop’s iterations between them.

forall index in iterand   // iterating over all elements of an array or over a range of indices
{instructions}

What is the appropriate number of tasks/threads?

  • on a single core: single thread
  • on multiple cores on the same nodes: all cores, up to the number of elements or iterations
  • on multiple cores on multiple nodes: all cores, up to the problem size, given the data distribution

Consider a simple code test.chpl:

const n = 1e6: int;
var A: [1..n] real;
forall a in A do
  a += 1;

In this code we update all elements of the array A. The code will run on a single node, lauching as many threads as the number of available cores. It is thread-safe, meaning that no two threads are writing into the same variable at the same time.

  • if we replace forall with for, we’ll get a serial loop on a sigle core
  • if we replace forall with coforall (we’ll study it later), we’ll create 10610^6 threads – likely an overkill!
  • there is also foreach that is specifically for multi-threaded parallelism and that we’ll use later on a GPU

Reduction

Consider a simple code forall.chpl that we’ll run inside a 4-core interactive job. We have a range of indices 1..1000, and they get broken into 4 groups that are processed by individual threads:

var count = 0;
forall i in 1..1000 with (+ reduce count) {   // parallel loop
  count += i;
}
writeln('count = ', count);

If we have not done so, let’s write a script shared.sh for submitting single-locale, two-core Chapel jobs:

#!/bin/bash
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=3600   # in MB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --output=solution.out
./forall
$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out
count = 500500

Number of cores at runtime

We computed the sum of integers from 1 to 1000 in parallel. How many cores did the code run on? Looking at the code or its output, we don’t know. Most likely, on all 4 cores available to us inside the job. But we can actually check that! Do this:

  1. replace count += i; with count = 1;
  2. change the last line to writeln('actual number of threads = ', count);
$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out
actual number of threads = 4

If you see one thread, try running this code as a batch multi-core job.

Alternative syntax

We can also do parallel reduction over a loop in this way:

var count = (+ reduce forall i in 1..1000 do i**2);
writeln('count = ', count);

We can also initialize in array and do parallel reduction over all array elements:

var A = (for i in 1..1000 do i);
var count = (+ reduce A);   // multiple threads
writeln('count = ', count);

Or we could do it this way if we want to do some processing on individual elements:

var A = (for i in 1..1000 do i);
var count = (+ reduce forall a in A do a**2);
writeln('count = ', count);

 

Question Parallel π\pi

Using the first version of forall.chpl (where we computed the sum of integers 1..1000) as a template, write a Chapel code to compute π\pi by calculating the integral numerically via summation using forall parallelism. Implement the number of intervals as config variable.

To get you started, here is a serial version of this code pi.chpl:

config const n = 1000;
var h, total: real;
h = 1.0 / n;                          // interval width
for i in 1..n {
  var x = h * ( i - 0.5 );
  total += 4.0 / ( 1.0 + x**2);
}
writef('pi is %3.10r\n', total*h);    // C-style formatted write, r stands for real