Parallelization
Parallelization is available using either distributed memory based on MPI or multithreading using OpenMP.
OpenMP
It is straightforward to enable some multithreading support. Note that this will speedup assembly and evaluation of the spatial operators but not in general the linear algebra backend so that the overall speedup of the code might not be as high as expected. Since we rely mostly on external packages for the linear algebra, speedup in this step will depend on the multithreading support available in the chosen linear algebra backend - see the discussion on how to switch between linear solver backends to, for example, use the thread parallel solvers from scipy.
By default only a single thread is used. To enable multithreading simply add
[1]:
from dune.fem import threading
print("Using",threading.use,"threads")
threading.use = 4 # use 4 threads
print("Using",threading.use,"threads")
Using 1 threads
Using 4 threads
At startup a maximum number of threads is selected based on the hardware concurrency. This number can be changed by setting the environment variable DUNE_NUM_THREADS
which sets both the maximum and set the number of threads to use. To get this number or set the number of threads to use to the maximum use
[2]:
print("Maximum number of threads available:",threading.max)
threading.useMax()
print("Using",threading.use,"threads")
Maximum number of threads available: 32
Using 32 threads
MPI
It is straightforward to use MPI for parallelization. It requires a parallel grid, in fact most of the DUNE grids work in parallel except albertaGrid
or polyGrid
. Most iterative solvers in DUNE work for parallel runs. Some of the preconditioning methods also work in parallel, a complete list is found at the bottom of the solver discussion.
Running a parallel job can be done by using mpirun
mpirun -np 4 python script.py
in order to use 4
MPI processes. Example scripts that run in parallel are, for example, the Re-entrant Corner Problem.
Tip
When running a parallel program output to the console should only be received on dedicated processors, e.g. process 0. This can be achieved by simply overloading the print function as shown below.
[3]:
from functools import partial
from dune.common import comm
# print can be used as before but will only produce output on rank 0
print = partial(print, flush=True) if comm.rank == 0 else lambda *args, **kwargs: None
Tip
On a cluster where multiple parallel jobs are run simultaneously, it’s advisable to use one separate cache per job. This can be easily done by copying an existing cache with pre-compiled modules. See this SLURM batch script for an example on how to do this.
Load balancing
When running distributed memory jobs load balancing is an issue. The specific load balancing method used, depends on the grid implementation. There are two ways to ensure a balanced work load. For computations without dynamic adaptation this only has to be done once in the beginning of the run.
Tip
Most grid managers will read the grid onto rank zero on construction. While many grid will then perform a load balance step, it is good practice to add a call to loadbalance
right after the grid construction to guarantee the grid is distributed between the available processes.
[4]:
from dune.alugrid import aluCubeGrid
from dune.grid import cartesianDomain
from dune.fem.function import partitionFunction
domain = cartesianDomain([0,0], [1,1], [40,40], overlap=1)
gridView = aluCubeGrid( domain, lbMethod=9 )
# distribute work load, no user data is adjusted
gridView.hierarchicalGrid.loadBalance()
# print possible load balancing methods
help( aluCubeGrid )
Help on function aluCubeGrid in module dune.alugrid._grids:
aluCubeGrid(*args, **kwargs)
Create an ALUGrid instance.
Note: This functions has to be called on all cores and the parameters passed should be the same.
Otherwise unexpected behavior will occur.
Parameters:
-----------
constructor means of constructing the grid, i.e. a grid reader or a
dictionary holding macro grid information
dimgrid dimension of grid, i.e. 2 or 3
dimworld dimension of world, i.e. 2 or 3 and >= dimension
comm MPI communication (not yet implemented)
serial creates a grid without MPI support (default False)
verbose adds some verbosity output (default False)
lbMethod load balancing algorithm. Possible choices are (default is 9):
0 None
1 Collect (to rank 0)
4 ALUGRID_SpaceFillingCurveLinkage (assuming the macro
elements are ordering along a space filling curve)
5 ALUGRID_SpaceFillingCurveSerialLinkage (serial version
of 4 which requires the entire graph to fit to one core)
9 ALUGRID_SpaceFillingCurve (like 4 without linkage
storage), this is the default option.
10 ALUGRID_SpaceFillingCurveSerial (serial version
of 10 which requires the entire graph to fit to one core)
11 METIS_PartGraphKway, METIS method PartGraphKway, see
http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
12 METIS_PartGraphRecursive, METIS method
PartGraphRecursive, see
http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
13 ZOLTAN_LB_HSFC, Zoltan's geometric load balancing based
on a Hilbert space filling curve, see https://sandialabs.github.io/Zoltan/
14 ZOLTAN_LB_GraphPartitioning, Zoltan's load balancing
method based on graph partitioning, see https://sandialabs.github.io/Zoltan/
15 ZOLTAN_LB_PARMETIS, using ParMETIS through Zoltan, see
https://sandialabs.github.io/Zoltan/
lbUnder value between 0.0 and 1.0 (default 0.0)
lbOver value between 1.0 and 2.0 (default 1.2)
Returns:
--------
An ALUGrid instance with given refinement (conforming or nonconforming) and element type (simplex or cube).
Run this code snippet in parallel and then inspect the different domain decompositions. For plotting of parallel data vtk has to be used. In this case we display the rank information for each partition.
[5]:
vtk = gridView.sequencedVTK("data", celldata=[partitionFunction(gridView)])
vtk()
For load balancing with re-distribution of user data the function dune.fem.loadBalance
should be used. This method is similar to the dune.fem.adapt
method discussed in the section on adaptivity. See also the crystal growth or Re-entrant Corner Problem examples.