Single-locale data parallelism
As we mentioned in the previous section, Data Parallelism is a style of parallel programming in which
parallelism is driven by computations over collections of data elements or their indices. The main tool for
this in Chapel is a forall
loop – it’ll create an appropriate number of threads to execute a loop,
dividing the loop’s iterations between them.
forall index in iterand // iterating over all elements of an array or over a range of indices
{instructions}
What is the appropriate number of tasks/threads?
- on a single core: single thread
- on multiple cores on the same nodes: all cores, up to the number of elements or iterations
- on multiple cores on multiple nodes: all cores, up to the problem size, given the data distribution
Consider a simple code test.chpl
:
const n = 1e6: int;
var A: [1..n] real;
forall a in A do
a += 1;
In this code we update all elements of the array A
. The code will run on a single node, lauching as many
threads as the number of available cores. It is thread-safe, meaning that no two threads are writing into the
same variable at the same time.
- if we replace
forall
withfor
, we’ll get a serial loop on a sigle core - if we replace
forall
withcoforall
(we’ll study it later), we’ll create threads – likely an overkill! - there is also
foreach
that is specifically for multi-threaded parallelism and that we’ll use later on a GPU
Reduction
Consider a simple code forall.chpl
that we’ll run inside a 4-core interactive job. We have a range of
indices 1..1000, and they get broken into 4 groups that are processed by individual threads:
var count = 0;
forall i in 1..1000 with (+ reduce count) { // parallel loop
count += i;
}
writeln('count = ', count);
If we have not done so, let’s write a script shared.sh
for submitting single-locale, two-core Chapel jobs:
#!/bin/bash
#SBATCH --time=0:5:0 # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=3600 # in MB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --output=solution.out
./forall
$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out
count = 500500
Number of cores at runtime
We computed the sum of integers from 1 to 1000 in parallel. How many cores did the code run on? Looking at the code or its output, we don’t know. Most likely, on all 4 cores available to us inside the job. But we can actually check that! Do this:
- replace
count += i;
withcount = 1;
- change the last line to
writeln('actual number of threads = ', count);
$ chpl forall.chpl -o forall
$ sbatch shared.sh
$ cat solution.out
actual number of threads = 4
If you see one thread, try running this code as a batch multi-core job.
Alternative syntax
We can also do parallel reduction over a loop in this way:
var count = (+ reduce forall i in 1..1000 do i**2);
writeln('count = ', count);
We can also initialize in array and do parallel reduction over all array elements:
var A = (for i in 1..1000 do i);
var count = (+ reduce A); // multiple threads
writeln('count = ', count);
Or we could do it this way if we want to do some processing on individual elements:
var A = (for i in 1..1000 do i);
var count = (+ reduce forall a in A do a**2);
writeln('count = ', count);
Question Parallel
Using the first version of forall.chpl
(where we computed the sum of integers 1..1000) as a template, write
a Chapel code to compute forall
parallelism. Implement the number of intervals as config
variable.
To get you started, here is a serial version of this code pi.chpl
:
config const n = 1000;
var h, total: real;
h = 1.0 / n; // interval width
for i in 1..n {
var x = h * ( i - 0.5 );
total += 4.0 / ( 1.0 + x**2);
}
writef('pi is %3.10r\n', total*h); // C-style formatted write, r stands for real