Parallelization

Parallelization is available using either distributed memory based on MPI or multithreading using OpenMP.

OpenMP

It is straightforward to enable some multithreading support. Note that this will speedup assembly and evaluation of the spatial operators but not in general the linear algebra backend so that the overall speedup of the code might not be as high as expected. Since we rely mostly on external packages for the linear algebra, speedup in this step will depend on the multithreading support available in the chosen linear algebra backend - see the discussion on how to switch between linear solver backends to, for example, use the thread parallel solvers from scipy.

By default only a single thread is used. To enable multithreading simply add

[1]:

from dune.fem import threading
print("Using",threading.use,"threads")
threading.use = 4 # use 4 threads
print("Using",threading.use,"threads")

Using 1 threads
Using 4 threads

At startup a maximum number of threads is selected based on the hardware concurrency. This number can be changed by setting the environment variable DUNE_NUM_THREADS which sets both the maximum and set the number of threads to use. To get this number or set the number of threads to use to the maximum use

[2]:

print("Maximum number of threads available:",threading.max)
threading.useMax()
print("Using",threading.use,"threads")

Maximum number of threads available: 32
Using 32 threads

MPI

It is straightforward to use MPI for parallelization. It requires a parallel grid, in fact most of the DUNE grids work in parallel except albertaGrid or polyGrid. Most iterative solvers in DUNE work for parallel runs. Some of the preconditioning methods also work in parallel, a complete list is found at the bottom of the solver discussion.

Running a parallel job can be done by using mpirun

mpirun -np 4 python script.py

in order to use 4 MPI processes. Example scripts that run in parallel are, for example, the Re-entrant Corner Problem.

Tip

When running a parallel program output to the console should only be received on dedicated processors, e.g. process 0. This can be achieved by simply overloading the print function as shown below.

[3]:

from functools import partial
from dune.common import comm
# print can be used as before but will only produce output on rank 0
print = partial(print, flush=True) if comm.rank == 0 else lambda *args, **kwargs: None

Tip

On a cluster where multiple parallel jobs are run simultaneously, it’s advisable to use one separate cache per job. This can be easily done by copying an existing cache with pre-compiled modules. See this SLURM batch script for an example on how to do this.

Load balancing

When running distributed memory jobs load balancing is an issue. The specific load balancing method used, depends on the grid implementation. There are two ways to ensure a balanced work load. For computations without dynamic adaptation this only has to be done once in the beginning of the run.

Tip

Most grid managers will read the grid onto rank zero on construction. While many grid will then perform a load balance step, it is good practice to add a call to loadbalance right after the grid construction to guarantee the grid is distributed between the available processes.

[4]:

from dune.alugrid import aluCubeGrid
from dune.grid import cartesianDomain
from dune.fem.function import partitionFunction
domain = cartesianDomain([0,0], [1,1], [40,40], overlap=1)
gridView = aluCubeGrid( domain, lbMethod=9 )

# distribute work load, no user data is adjusted
gridView.hierarchicalGrid.loadBalance()

# print possible load balancing methods
help( aluCubeGrid )

Help on function aluCubeGrid in module dune.alugrid._grids:

aluCubeGrid(*args, **kwargs)
    Create an ALUGrid instance.

    Note: This functions has to be called on all cores and the parameters passed should be the same.
          Otherwise unexpected behavior will occur.

    Parameters:
    -----------

        constructor  means of constructing the grid, i.e. a grid reader or a
                     dictionary holding macro grid information
        dimgrid      dimension of grid, i.e. 2 or 3
        dimworld     dimension of world, i.e. 2 or 3 and >= dimension
        comm         MPI communication (not yet implemented)
        serial       creates a grid without MPI support (default False)
        verbose      adds some verbosity output (default False)
        lbMethod     load balancing algorithm. Possible choices are (default is 9):
                         0  None
                         1  Collect (to rank 0)
                         4  ALUGRID_SpaceFillingCurveLinkage (assuming the macro
                            elements are ordering along a space filling curve)
                         5  ALUGRID_SpaceFillingCurveSerialLinkage (serial version
                            of 4 which requires the entire graph to fit to one core)
                         9  ALUGRID_SpaceFillingCurve (like 4 without linkage
                            storage), this is the default option.
                         10 ALUGRID_SpaceFillingCurveSerial (serial version
                            of 10 which requires the entire graph to fit to one core)
                         11 METIS_PartGraphKway, METIS method PartGraphKway, see
                            http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
                         12 METIS_PartGraphRecursive, METIS method
                            PartGraphRecursive, see
                            http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
                         13 ZOLTAN_LB_HSFC, Zoltan's geometric load balancing based
                            on a Hilbert space filling curve, see https://sandialabs.github.io/Zoltan/
                         14 ZOLTAN_LB_GraphPartitioning, Zoltan's load balancing
                            method based on graph partitioning, see https://sandialabs.github.io/Zoltan/
                         15 ZOLTAN_LB_PARMETIS, using ParMETIS through Zoltan, see
                            https://sandialabs.github.io/Zoltan/
        lbUnder      value between 0.0 and 1.0 (default 0.0)
        lbOver       value between 1.0 and 2.0 (default 1.2)

    Returns:
    --------

    An ALUGrid instance with given refinement (conforming or nonconforming) and element type (simplex or cube).

Run this code snippet in parallel and then inspect the different domain decompositions. For plotting of parallel data vtk has to be used. In this case we display the rank information for each partition.

[5]:

vtk = gridView.sequencedVTK("data", celldata=[partitionFunction(gridView)])
vtk()

For load balancing with re-distribution of user data the function dune.fem.loadBalance should be used. This method is similar to the dune.fem.adapt method discussed in the section on adaptivity. See also the crystal growth or Re-entrant Corner Problem examples.

This page was generated from the notebook parallelization_nb.ipynb and is part of the tutorial for the dune-fem python bindings