Multi-locale Chapel
Setup
So far we have been working with single-locale Chapel codes that may run on one or many cores on a single compute node, making use of the shared memory space and accelerating computations by launching parallel threads on individual cores. Chapel codes can also run on multiple nodes on a compute cluster. In Chapel this is referred to as multi-locale execution.
Docker side note
If you work inside a Chapel Docker container, e.g., chapel/chapel-gasnet, the container environment simulates a multi-locale cluster, so you would compile and launch multi-locale Chapel codes directly by specifying the number of locales with
-nl
flag:$ chpl --fast mycode.chpl -o mybinary $ ./mybinary -nl 3
Inside the Docker container on multiple locales your code will not run any faster than on a single locale, since you are emulating a virtual cluster, and all tasks run on the same physical node. To achieve actual speedup, you need to run your parallel multi-locale Chapel code on a real HPC cluster.
On an HPC cluster you would need to submit either an interactive or a batch job asking for several nodes and then run a multi-locale Chapel code inside that job. In practice, the exact commands to run multi-locale Chapel codes depend on how Chapel was built on the cluster.
When you compile a Chapel code with the multi-locale Chapel compiler, two binaries will be produced. One
is called mybinary
and is a launcher binary used to submit the real executable mybinary_real
. If the
Chapel environment is configured properly with the launcher for the cluster’s physical interconnect, then
you would simply compile the code and use the launcher binary mybinary
to run a multi-locale code.
For the rest of this class we assume that you have a working multi-locale Chapel environment, whether provided by a Docker container or by multi-locale Chapel on a physical HPC cluster. We will run all examples on four nodes with two cores per node.
Let’s write a job submission script distributed.sh
:
#!/bin/bash
#SBATCH --time=0:5:0 # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=1000 # in MB
#SBATCH --nodes=3
#SBATCH --cpus-per-task=2
#SBATCH --output=solution.out
./test -nl 3 # in this case the 'srun' launcher is already configured for our interconnect
Simple multi-locale codes
Let us test our multi-locale Chapel environment by launching the following code:
writeln(Locales);
$ module load arch/avx2 # not necessary, unless you land on an avx512 node
$ module load gcc/9.3.0 chapel-ofi/1.31.0
$ chpl test.chpl -o test
$ sbatch distributed.sh
$ cat solution.out
This code will print the built-in global array Locales
. Running it on four locales will produce
LOCALE0 LOCALE1 LOCALE2
We want to run some code on each locale (node). For that, we can cycle through locales:
for loc in Locales do // this is still a serial program
on loc do // run the next line on locale `loc`
writeln("this locale is named ", here.name[0..4]); // `here` is the locale on which the code is running
This will produce
this locale is named node1
this locale is named node2
this locale is named node3
Here the built-in variable class here
refers to the locale on which the code is running, and here.name
is
its hostname. We started a serial for
loop cycling through all locales, and on each locale we printed its
name, i.e., the hostname of each node. This program ran in serial starting a task on each locale only after
completing the same task on the previous locale. Note the order in which locales were listed.
To run this code in parallel, starting four simultaneous tasks, one per locale, we simply need to replace
for
with forall
:
forall loc in Locales do // now this is a parallel loop
on loc do
writeln("this locale is named ", here.name[0..4]);
This starts four tasks in parallel, and the order in which the print statement is executed depends on the runtime conditions and can change from run to run:
this locale is named node1
this locale is named node3
this locale is named node2
We can print few other attributes of each locale. Here it is actually useful to revert to the serial loop
for
so that the print statements appear in order:
use MemDiagnostics;
for loc in Locales do
on loc {
writeln("locale #", here.id, "...");
writeln(" ...is named: ", here.name);
writeln(" ...has ", here.numPUs(), " processor cores");
writeln(" ...has ", here.physicalMemory(unit=MemUnits.GB, retType=real), " GB of memory");
writeln(" ...has ", here.maxTaskPar, " maximum parallelism to expect");
}
$ chpl test.chpl -o test
$ sbatch distributed.sh
$ cat solution.out
locale #0...
...is named: node1
...has 2 processor cores
...has 2.77974 GB of memory
...has 2 maximum parallelism
locale #1...
...is named: node2
...has 2 processor cores
...has 2.77974 GB of memory
...has 2 maximum parallelism
locale #2...
...is named: node3
...has 2 processor cores
...has 2.77974 GB of memory
...has 2 maximum parallelism
Note that while Chapel correctly determines the number of physical cores on each node and the number of cores available inside our job on each node (maximum parallelism), it lists the total physical memory on each node available to all running jobs which is not the same as the total memory per node allocated to our job.