Global Operations

Synchronization
Broadcast
Reduce
Sample Code
Binary Tree Algorithm — Broadcast

Global operations are those that affect all processes at once. This section deals with some of these operations provided by MPI. These routines are highly optimized for the hardware in use, and are likely to be more efficient than any equivalent "hand-coded" similar operations.

Synchronization

int MPI_Barrier ( MPI_Comm comm )

A call that causes each process in the communicator comm to block until all the process have called it. Useful for timing codes. Also used in preparation for transmission operations. The use of this routine is time consuming and should be avoided under most conditions. Recall that synchronous message sending routines provide synchronization implicitly.

Broadcast

int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype,
                int root, MPI_Comm comm )

The process specified by root sends the content of buffer, count elements of type datatype, to all processes in the communicator comm. This is a very efficient way to send information from any process to all other processes in a communicator. The time used by this routine is much shorter than the time needed for sequential calls to MPI_Ssend/MPI_Recv, especially for large number of processors.

Reduce

The necessity to find global maximum, minimum, sum, product, or other "global" operations on a set of process distributed numbers often occurs. MPI provides general purpose reduce commands to perform these tasks.

int MPI_Reduce ( void *local_value, void *results, int count,
                 MPI_Datatype datatype, MPI_Op operator,
                 int target_process, MPI_Comm comm )
int MPI_Allreduce ( void *local_value, void *results, int count,
                    MPI_Datatype datatype, MPI_Op operator,
                    MPI_Comm comm )

The commands perform a specific operation specified operator on count elements of the local array local_value and stores the results in the array results.

The difference between the MPI_Allreduce() and MPI_Reduce() routines is the disposition of the latest result; MPI_Reduce() leaves the final result on the target_process, while MPI_Allreduce() broadcast the results to all the nodes in the communicator.

The pre-defined operations are

MPI_MAX: Maximum
MPI_MIN: Minimum
MPI_SUM: Sum
MPI_PROD: Product
MPI_LAND: Logical and
MPI_BAND: Bitwise and
MPI_LOR: Logical or
MPI_BOR: Bitwise or
MPI_LXOR: Logical exclusive or
MPI_BXOR: Bitwise exclusive or
MPI_MAXLOC: Maximum and location
MPI_MINLOC: Minimum and location

Sample Code

The code global_mpi_operations.c illustrates the use of the MPI global operation routines. Note the syntax of the calls.

Binary Tree Algorithm — Broadcast

The MPI global operations routines are based on clever algorithms for efficiency. As a learning tool, we look here at a possible implementation of a broadcast algorithm. It is based on the fact that a binary tree nomenclature leads to an efficient broadcasting strategies which can seriously reduce the time to send the messages to all the nodes in a parallel system. Such a tree, based on node 0 and written for a total of 8 nodes is illustrated as follows

The node zero sends the info to node 1. Then nodes 0 and 1 send the info to nodes 2 and 3 respectively. Then nodes 0, 1, 2, and 3 all send the info to nodes 4, 5, 6, and 7 respectively. It is seen that this strategy saves a large factor in time compared to a sequential series of sends. This is illustrated as follows:

Each time unit in this diagram includes the overhead time (handshaking protocol) as well as the time necessary to send the info. The larger the number of processors is, more significant the saving factor is.

A skeleton implementation of this algorithm is given in the code print_tree.c.

The transmission tree translates in the following rules (generation refers to the time sequence of receive/send in the tree — generation ranges over 1 to 3 for a tree with up to 8 nodes, nodes 0 to 7).

`if`	direction	partner
2^generation <= my_rank < 2^{(generation-1)}	receive from	my_rank - 2^generation
my_rank < 2^generation	send to	my_rank + 2^generation

The code print_tree.c implements these rules.

The example code from this section is bundled in global_operations.tar.gz.

PHYS405