Parallelization is available using either distributed memory based on MPI or multithreading using OpenMP.
It is straightforward to enable some multithreading support. Note that this will speedup assembly and evaluation of the spatial operators but not in general the linear algebra backend so that the overall speedup of the code might not be as high as expected. Since we rely mostly on external packages for the linear algebra, speedup in this step will depend on the multithreading support available in the chosen linear algebra backend - see the discussion on how to switch between linear solver backends to, for example, use the thread parallel solvers from scipy.
By default only a single thread is used. To enable multithreading simply add
from dune.fem import threading print("Using",threading.use,"threads") threading.use = 4 # use 4 threads print("Using",threading.use,"threads")
Using 1 threads Using 4 threads
At startup a maximum number of threads is selected based on the hardware concurrency. This number can be changed by setting the environment variable
DUNE_NUM_THREADS which sets both the maximum and set the number of threads to use. To get this number or set the number of threads to use to the maximum use
print("Maximum number of threads available:",threading.max) threading.useMax() print("Using",threading.use,"threads")
Maximum number of threads available: 12 Using 12 threads
It is straightforward to use MPI for parallelization. It requires a parallel grid, in fact most of the DUNE grids work in parallel except
polyGrid. Most iterative solvers in DUNE work for parallel runs. Some of the preconditioning methods also work in parallel, a complete list is found at the bottom of the solver discussion.
Running a parallel job can be done by using
mpirun -np 4 python script.py
in order to use
4 MPI processes. Example scripts that run in parallel are, for example, the Re-entrant Corner Problem.
On a cluster where multiple parallel jobs are run simultaneously, it’s advisable to use one separate cache per job. This can be easily done by copying an existing cache with pre-compiled modules. See this SLURM batch script for an example on how to do this.
When running distributed memory jobs load balancing is an issue. The specific load balancing method used, depends on the grid implementation. There are two ways to ensure a balanced work load. For computations without dynamic adaptation this only has to be done once in the beginning of the run. In fact, most grids do this automatically. For adaptation and load balancing with data see the crystal growth or Re-entrant Corner Problem examples.
from dune.alugrid import aluCubeGrid from dune.grid import cartesianDomain from dune.fem.function import partitionFunction domain = cartesianDomain([0,0], [1,1], [40,40], overlap=1) gridView = aluCubeGrid( domain, lbMethod=9 ) # distribute work load, no user data is adjusted gridView.hierarchicalGrid.loadBalance() # print possible load balancing methods help( aluCubeGrid )
Help on function aluCubeGrid in module dune.alugrid._grids: aluCubeGrid(*args, **kwargs) Create an ALUGrid instance. Note: This functions has to be called on all cores and the parameters passed should be the same. Otherwise unexpected behavior will occur. Parameters: ----------- constructor means of constructing the grid, i.e. a grid reader or a dictionary holding macro grid information dimgrid dimension of grid, i.e. 2 or 3 dimworld dimension of world, i.e. 2 or 3 and >= dimension comm MPI communication (not yet implemented) serial creates a grid without MPI support (default False) verbose adds some verbosity output (default False) lbMethod load balancing algorithm. Possible choices are (default is 9): 0 None 1 Collect (to rank 0) 4 ALUGRID_SpaceFillingCurveLinkage (assuming the macro elements are ordering along a space filling curve) 5 ALUGRID_SpaceFillingCurveSerialLinkage (serial version of 4 which requires the entire graph to fit to one core) 9 ALUGRID_SpaceFillingCurve (like 4 without linkage storage), this is the default option. 10 ALUGRID_SpaceFillingCurveSerial (serial version of 10 which requires the entire graph to fit to one core) 11 METIS_PartGraphKway, METIS method PartGraphKway, see http://glaros.dtc.umn.edu/gkhome/metis/metis/overview 12 METIS_PartGraphRecursive, METIS method PartGraphRecursive, see http://glaros.dtc.umn.edu/gkhome/metis/metis/overview 13 ZOLTAN_LB_HSFC, Zoltan's geometric load balancing based on a Hilbert space filling curve, see https://sandialabs.github.io/Zoltan/ 14 ZOLTAN_LB_GraphPartitioning, Zoltan's load balancing method based on graph partitioning, see https://sandialabs.github.io/Zoltan/ 15 ZOLTAN_LB_PARMETIS, using ParMETIS through Zoltan, see https://sandialabs.github.io/Zoltan/ lbUnder value between 0.0 and 1.0 (default 0.0) lbOver value between 1.0 and 2.0 (default 1.2) Returns: -------- An ALUGrid instance with given refinement (conforming or nonconforming) and element type (simplex or cube).
Run this code snippet in parallel and then inspect the different domain decompositions. For plotting of parallel data vtk has to be used. In this case we display the rank information for each partition.
vtk = gridView.sequencedVTK("data", celldata=[partitionFunction(gridView)]) vtk()