Domain Decomposition

Lattice Problems

Partial Differential Equations (PDEs) are encountered in virtually all of sciences, engineering disciplines and even in the world of business. We now want to devise a way to implement the solution of these equations on parallel computers. This leads to domain decomposition.

The numerical solution of PDEs almost always imply a numerical lattice to describe the domain over which we seek a solution. A natural (and often the only) way in which to render parallel the task of solving PDEs is to parcel the global domain into sub-domains and to let each processor of the parallel computer handle individual sub-domain; this is domain decomposition.

Note that in solving the so-called trivially parallelizable problems on parallel computers, i.e., the Mandelbrot Set, the difficulty comes in ensuring load balancing; there is typically little node-to-node message passing. In domain decomposition, just the opposite occurs: minimization of communication is the major hurdle against achieving an efficient parallel code. Load balancing issues occur mostly due to complicated geometry.

We will next review (again) the classification of PDEs and mention the appropriate domain decomposition in each case. We will then describe briefly the implementation of domain decomposition on parallel computers.

Domain Decomposition

PDEs are traditionally classified into three broad categories: elliptic, parabolic and hyperbolic equations. However, from the numerics point of view, PDEs divide into static equations (elliptic) and time dependent equations (parabolic & hyperbolic).

The prototypical elliptic equation is the Poisson equation

\[ \frac{\delta^2 u}{\delta x^2} + \frac{\delta^2 u}{\delta y^2} + \frac{\delta^2 u}{\delta z^2} = \rho(x,y,z) \]

where $u(x,y,z)$ is the field (potential) we look for and the term $\rho(x,y,z)$ represents a source term. This equation is encountered in electrostatic, gravitation,... . The solution is static (time independent) and requires the specification of $u(x,y,z)$ in a 3-dimensional manifold ${x,y,z}$, subject to boundary conditions.

The simplest domain decomposition consists of decomposing the cubic domain into smaller cubes equal in number to the number of processors. The difficulty, as far as load balancing is concerned, arises if the domain is not a cube. Think of what you would do if the domain would be rectangular, or worse of arbitrary shape! The amount of computing to handle each sub-domain obviously depends on the number of lattice points in the sub-domains, an equal number of lattice points in each sub-domain being necessary for load balance...

The hyperbolic and parabolic equations are quite different in that they represent initial value problems. Namely, the wave equation (hyperbolic)

\[ \frac{\delta^2 u}{\delta t^2} = v^2 \frac{\delta^2 u}{\delta x^2}, \]

where $v$ is the constant velocity of wave propagation, and the diffusion equation (parabolic)

\[ \frac{\delta u}{\delta t} = -\frac{\delta}{\delta x} \left( D \frac{\delta u}{\delta x} \right), \]

where $D$ is the diffusion coefficient, are such that we ask for what is the value of the field (wave) $u(x,t)$ at a later time $t$ knowing the field at an initial time $t=0$ and subject to some specific boundary conditions at all times.

In this case, the domain decomposition applies to the space variables. The time evolution is essentially sequential, $u(t + d t, x)$ is obtained from $u(t,x)$ which can not be parallelized. This is also true in obtaining a solution of ODEs.

Parallel Implementation

Domain decomposition is implemented on a parallel computer by letting each processor handle a sub-domain (or at most a few sub-domains). Therefore the choice of these sub-domains becomes crucial in obtaining good performance. The strategy in choosing the sub-domains must be guided by the following requirements:

Equal work be done in each sub-domain for load balancing.
Minimal node-to-node communication.

Most of the time, geometrical and/or physical considerations are what will guide the choice of the sub-domains.

The load balancing is often easy to achieve if the geometry of the domain is simple, i.e., cubic, rectangular, ... These domains naturally lead to the same number of lattice points in each sub-domain. If the domain is irregular, the solution is often to pad the domain into a regular one, or sometimes to linearize the mesh point accessing scheme (done in finite element approaches). This leads to complications and inefficiencies in the implementation; we will not discuss this problem further here and assume a simple geometry in what follows.

The minimization of node-to-node communication is the major bottleneck in domain decomposition. The problem stems from the calculation of derivatives in finite-differencing schemes (same applies to finite-element approaches). Consider the 3-point forms for the first and second derivatives of a function:

\[ \frac{\delta f}{\delta x} = \frac{f_{i+1} - f_{i-1}}{2 h} \]

and

\[ \frac{\delta^2 f}{\delta x^2} = \frac{f_{i+1} - 2 f_i + f_{i-1}}{h^2} \]

In domain decomposition, each sub-domain is handled by different processors. At the boundary of the sub-domains, the value of the field at the neighbor points on the lattice reside in the local memory of the processors handling the adjacent sub-domains; this requires communication in the calculation of the derivatives.

Each sub-domain (at each time step) therefore becomes a mini-problem to solve; this mini-problem is of the same type as the larger problem, but with boundary conditions defined by the value of the field in the neighbor sub-domains. A two dimension problem looks schematically as:

The implementation of the domain decomposition algorithm only requires the MPI calls we have seen in the previous sections, yet book-keeping considerations can become quite complicated to implement in higher dimension problems.

Elliptic Equation — 1D Poisson Equation

In this section we solve Poisson equation in 1D as a simple example of domain decomposition. The distribution of sub-tasks among the processes is typical of what is done in most parallel implementation of parallel equations solvers.

The Model

We will solve the equation

\[ \frac{\delta^2 j(x)}{\delta x^2} = s(x) \]

The function $j(x)$ could represent the electric potential due to a charge distribution $s(x)$, or the steady state temperature profile in the presence of a heat source $s(x)$.

The domain is 1D, and $x \in [0,1]$.

The boundary conditions are assumed of Dirichlet type, namely $j(0)$ and $j(1)$ are given. In addition, the function $j(x)$ can also be specified on specific points in the interior domain. The source is also given, taken to be a simple gaussian centered at mid-point in what follows.

The domain is discretized on a 1D numerical lattice, with $x_i = (i-1) h$ and $h = 1/(N-1)$, where $N$ is the number of grid points. The Poisson equation is re-written in terms of finite difference form

\[ \frac{j_{i+1} - 2 j_i + j_{i-1}}{h^2} = S_i \]

This leads to a large set of linear equations to solve for $j_i$, the field values on the grid, in the domain, keeping $j_i$ fixed at the boundary. This set of equations can be solved by an iterative method, whereby

\[ j_i = \frac{1}{2} (j_{i+1} + j_{i-1} - h^2 S_i) \]

is iterated "in place" until the changes in $j_i$ become less than some predefined tolerance criteria. An arbitrary guess for $j_i$ is assumed to start with. The method is stable in that the solution will converge. The old and new values of $j_i$ are mixed to accelerate the convergence, leading to the "over relaxation" scheme. The accuracy of the solution is second order in $h$.

Serial Implementation

The code poisson_1d.c implements the above scheme. The steps in the code are:

Initialize the numerical grid.
Provide an initial guess for the solution.
Set the boundary values & source term.
Iterate the solution until convergence.
Output the solution for plotting.

Compile and execute poisson_1d.c with

gcc poisson_1d.c -lm -o poisson_1d
poisson_1d 5000 100 > solution_1d.dat

where 5000 and 100 are the (arbitrary) maximum number of iterations and number of grid points respectively. The code stops once good convergence (parameter EPS) is obtained. Use Gnuplot to graph the solution.

Note:

The chi parameter which implements the over-relaxation approach is chosen arbitrarily for simplicity (see Numerical Recipes for an optimum choice).
The exact solution of the 1D Poisson equation with no source is a straight line. This is almost what is obtained at small $x$ where the source term is small.
The use of malloc for memory management is encouraged here to prepare for the parallel version.
The code is written in the simplest possible way, with no function, to simplify reading it. Functions will be introduced in the parallel implementation.

Parallel Implementation

Domain decomposition consists in dividing the original grid among the processes with an equal number of grid points in each process for good load balance. If the number of grid points does not divide equally among the processes, the extra grid points are distributed one each to a subset of processors as indicated schematically in the following figure.

This results in the following arrangement on each processor

The calculation of the derivative terms in finite difference scheme requires the field values at neighbor grid points to be known. Therefore, the value of the field at the artificial boundary between sub-domain must be transmitted between the neighbor processes as indicated in the figure. This must be done at each iteration during the calculation, leading to much transmission.

There are many issues involved in setting up a parallel code to solve PDEs. Here are some of the most important ones:

Minimize communication.
- Done by choosing the geometry well — for instance in 2D a stripe or a "square" geometry can be chosen. The stripe case is definitely easier to program and can make use of an arbitrary number of processors. The square topology for the sub-domain is restricted to $n_1 \times n_2$ number of processors. Questions left for the reader to ponder: which topology leads to fewer perimeter values to transmit at each iteration? What about transmission startup overhead costs?
The parallel implementation puts all processes on the same footing — no master-slave concepts here!
If parts of the calculation needs repeating on different processes to save communication time, let it be so. This is done in the code for the grid and sub-domain calculations whose results are fast to compute and needed in all processes.
Communication blocking must be avoided.
- This may results if all processes are requested to send the field variables for the GHOST positions at the same time, and then subsequently requested to receive the values. Think about it: the send will block until the messages are received, yet the processes won't receive until the sends are completed.
- This problem is solved by using an odd-even transmission scheme. transmission. This is shown schematically as
  - even process
    - send - to right (top) neighbor
    - recv - from left (bottom) neighbor
    - send - to left (bottom) neighbor
    - recv - from right (top) neighbor
  - odd processes
    - recv - from left (bottom) neighbor
    - send - to right (top) neighbor
    - recv - from right (top) neighbor
    - send - to left (bottom) neighbor
Keeping track of the grid indices in each sub-domain and which processes contain neighbor sub-domains can be a difficult task, specially in higher dimensional problems and when complicated geometries are involved. Be careful!
The arrays containing the field variables in the sub-domains must include extra space for the GHOST values.
- The 1D parallel code uses "large" array space for the fields in the sub-domains. This choice makes it possible for the index range in the sub-domain to follow that of the global domain, i.e., imin to imax, and second to easily allow the inclusion of the GHOST values in the array, i.e., dimension imin-1 to imax+1; but it does waste memory space.
Calculation of the changes in the fields in the entire domain from an iteration to the next requires global operations
- In this context MPI provides global operation routines. We use MPI_Allreduce() which allows the calculation of the overall maximum (MPI_MAX) among the values of a single variable in the different processes and to have this value known to all processes.
The real time and memory saving in using parallel computers to solve PDEs is only evident in higher dimensions where the amount of calculation per iteration is large compared to the overhead in transmission introduced by the parallel implementation. For instance, do not be surprised by the fact that the 1D serial code is faster to run than the parallel code.
Will the parallel code give exactly the same answer as the scalar code? In the current implementation, the code gives the same answer only to the accuracy specified for the convergence, i.e., EPS in the code. Namely, as written, the serial and parallel code do not give exactly, i.e., within machine accuracy, the same solution. This is due to the order in which the calculation is done, the GHOST values being the "old" field instead of the "updated" ones. The solution to this problem requires a change in the algorithm. Watch out — not all algorithms transcribe exactly to a parallel environment.

The flowchart of the parallel code is given below:

Compile and execute the parallel code, poisson_parallel_1d.c, with

mpicc poisson_parallel_1d.c -lm -o poisson_parallel1d
mpiexec -np 4 poisson_parallel_1d 5000 100 > soluution_parallel1d.dat

PHYS405

Advanced Computational Physics and Parallel Computing

Domain Decomposition

Lattice Problems

Domain Decomposition

Parallel Implementation

Elliptic Equation — 1D Poisson Equation

The Model

Serial Implementation

Parallel Implementation