Global Operations
Global operations are those that affect all processes at once. This section deals with some of these operations provided by MPI. These routines are highly optimized for the hardware in use, and are likely to be more efficient than any equivalent "hand-coded" similar operations.
Synchronization
int MPI_Barrier ( MPI_Comm comm )
A call that causes each process in the
communicator comm
to block until all the process have
called it. Useful for timing codes. Also used in preparation for
transmission operations. The use of this routine is time consuming and
should be avoided under most conditions. Recall that synchronous
message sending routines provide synchronization implicitly.
Broadcast
int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
The process specified by root
sends the content of
buffer
, count
elements of
type datatype
, to all processes in the
communicator comm
. This is a very efficient way to send
information from any process to all other processes in a
communicator. The time used by this routine is much shorter than the
time needed for sequential calls
to MPI_Ssend
/MPI_Recv
, especially for large
number of processors.
Reduce
The necessity to find global maximum, minimum, sum, product, or other "global" operations on a set of process distributed numbers often occurs. MPI provides general purpose reduce commands to perform these tasks.
int MPI_Reduce ( void *local_value, void *results, int count, MPI_Datatype datatype, MPI_Op operator, int target_process, MPI_Comm comm ) int MPI_Allreduce ( void *local_value, void *results, int count, MPI_Datatype datatype, MPI_Op operator, MPI_Comm comm )
The commands perform a specific operation specified
operator
on count
elements of the local
array local_value
and stores the results in the array
results
.
The difference between the MPI_Allreduce()
and MPI_Reduce()
routines is the disposition of the
latest result; MPI_Reduce()
leaves the final result on
the target_process
, while MPI_Allreduce()
broadcast the results to all the nodes in the communicator.
The pre-defined operations are
MPI_MAX
- Maximum
MPI_MIN
- Minimum
MPI_SUM
- Sum
MPI_PROD
- Product
MPI_LAND
- Logical and
MPI_BAND
- Bitwise and
MPI_LOR
- Logical or
MPI_BOR
- Bitwise or
MPI_LXOR
- Logical exclusive or
MPI_BXOR
- Bitwise exclusive or
MPI_MAXLOC
- Maximum and location
MPI_MINLOC
- Minimum and location
Sample Code
The code global_mpi_operations.c illustrates the use of the MPI global operation routines. Note the syntax of the calls.
Binary Tree Algorithm — Broadcast
The MPI global operations routines are based on clever algorithms for efficiency. As a learning tool, we look here at a possible implementation of a broadcast algorithm. It is based on the fact that a binary tree nomenclature leads to an efficient broadcasting strategies which can seriously reduce the time to send the messages to all the nodes in a parallel system. Such a tree, based on node 0 and written for a total of 8 nodes is illustrated as follows
The node zero sends the info to node 1. Then nodes 0 and 1 send the info to nodes 2 and 3 respectively. Then nodes 0, 1, 2, and 3 all send the info to nodes 4, 5, 6, and 7 respectively. It is seen that this strategy saves a large factor in time compared to a sequential series of sends. This is illustrated as follows:
Each time unit in this diagram includes the overhead time (handshaking protocol) as well as the time necessary to send the info. The larger the number of processors is, more significant the saving factor is.
A skeleton implementation of this algorithm is given in the code print_tree.c.
The transmission tree translates in the following rules (generation refers to the time sequence of receive/send in the tree — generation ranges over 1 to 3 for a tree with up to 8 nodes, nodes 0 to 7).
if |
direction | partner |
---|---|---|
2generation <= my_rank < 2(generation-1) | receive from | my_rank - 2generation |
my_rank < 2generation | send to | my_rank + 2generation |
The code print_tree.c implements these rules.
The example code from this section is bundled in global_operations.tar.gz.