Parallelizing Julia set
Table of contents
The Julia set problem was described in one of the earlier sections.
Parallelizing
How would we parallelize this problem with multi-processing? We have a large array, so we can use DistributedArrays and compute it in parallel. Here are the steps:
- Load
Distributed
on the control process. - Load
DistributedArrays
on all processes. stability
array should be distributed:
stability = dzeros(Int32, height, width); # distributed 2D array of 0's
- Define function
pixel()
on all processes. - Create
fillLocalBlock(stability)
to compute local piecesstability.localpart
on each worker in parallel. If you don’t know where to start, begin with checking the complete example withfillLocalBlock()
from the previous section. This function will cycle through all local indiceslocalindices(stability)
. This function needs to be defined on all processes. - Replace the loop
@btime for i in 1:height, j in 1:width
point = (2*(j-0.5)/width-1) + (2*(i-0.5)/height-1)im
stability[i,j] = pixel(point)
end
with
@btime @sync for w in workers()
@spawnat w fillLocalBlock(stability)
end
- Why do we need
@sync
in the previousfor
block? - To the best of my knowledge, both Plots’
heatmap()
and NetCDF’sncwrite()
are serial in Julia, and they cannot take distributed arrays. How do we convert a distributed array to a local array to pass to one of these functions? - Is your parallel code faster?
To get the full script, click on “Solution” below.
Solution
using Distributed, BenchmarkTools
@everywhere using DistributedArrays
@everywhere function pixel(z)
c = 0.355 + 0.355im
z *= 1.2 # zoom out
for i = 1:255
z = z^2 + c
if abs(z) >= 4
return i
end
end
return 255
end
@everywhere function fillLocalBlock(stability)
height, width = size(stability)
h, w = localindices(stability)
for iGlobal in collect(h)
iLocal = iGlobal - h.start + 1
y = 2*(iGlobal-0.5)/height - 1
for jGlobal in collect(w)
jLocal = jGlobal - w.start + 1
point = (2*(jGlobal-0.5)/width-1) + (y)im # rescale to -1:1 in the complex plane
@inbounds stability.localpart[iLocal,jLocal] = pixel(point)
end
end
end
height, width = repeat([2_000],2) # 2000^2 image
println("Computing Julia set ...")
stability = dzeros(Int32, height, width); # distributed 2D array of 0's
@btime @sync for w in workers()
@spawnat w fillLocalBlock(stability)
end
println("Plotting to PNG ...")
using Plots
gr() # initialize the gr backend
ENV["GKSwstype"] = "100" # operate in headless mode
fname = "$(height)x$(width)"
nonDistributed = zeros(Int32, height, width);
nonDistributed[:,:] = stability[:,:]; # ncwrite does not accept DArray type
png(heatmap(nonDistributed, size=(width,height), color=:gist_ncar), fname)
println("Writing NetCDF ...")
using NetCDF
filename = "test.nc"
isfile(filename) && rm(filename)
nccreate(filename, "stability", "x", collect(1:height), "y", collect(1:width), t=NC_INT, mode=NC_NETCDF4, compress=9);
nonDistributed = zeros(Int32, height, width);
nonDistributed[:,:] = stability[:,:]; # ncwrite does not accept DArray type
ncwrite(nonDistributed, filename, "stability");
Results for 1000^2
Finally, here are my timings on (some old iteration of) the training cluster:
Code | Time on login node (p-flavour vCPUs) | Time on compute node (c-flavour vCPUs) |
---|---|---|
julia juliaSetSerial.jl (serial runtime) |
147.214 ms | 123.195 ms |
julia -p 1 juliaSetDistributedArrays.jl (on 1 worker) |
157.043 ms | 128.601 ms |
julia -p 2 juliaSetDistributedArrays.jl (on 2 workers) |
80.198 ms | 66.449 ms |
julia -p 4 juliaSetDistributedArrays.jl (on 4 workers) |
42.965 ms | 66.849 ms |
julia -p 8 juliaSetDistributedArrays.jl (on 8 workers) |
36.067 ms | 67.644 ms |
Lots of things here to discuss!
One could modify our parallel code to offload some computation to the control process (not just compute on workers as we do now), so that you would see speedup when running on 2 CPUs (control process + 1 worker).