Point to Point Communication
- Send modes
- Recieve mode
- Communication Envelope
- Deadlock
- Timing
- The Cost of Communication
- Simple Examples
- Average Example
Send modes
Type | Description | Function |
---|---|---|
Synchronous | completes when receive has completed | MPI_Ssend |
Buffered | completes once buffer is full (immediate) | MPI_Bsend |
Standard | either synchronous or buffered | MPI_Send |
Ready | always completes (immediate) | MPI_Rsend |
Synchronous
The sending process expects a hand-shake by the receiving process acknowledging receipt of the message. Safest of the send modes. Can be wasteful in time. The sending and receiving nodes are synchronized.
Buffered
The sending node writes to a user defined MPI buffer for temporary
storage until the message is sent by MPI. The sending and receiving
nodes are NOT synchronized. The analogy is to drop a message in a post
office mailbox to be picked up later. The user must define buffer
space via calls similar to the C routines malloc
and free
.
MPI_Buffer_attach( void *buffer, int size ) MPI_Buffer_detach( void *buffer, int *size )
The size must include an overhead of
size MPI_BSEND_OVERHEAD
. Note also that if many sends
must occur in a row the buffer must have a size large enough to store
all the messages since nothing guaranty the receive completion while
the different messages are sent.
Standard
The standard MPI send is either of synchronous or buffered type. The issue is here the capacity of the network (i.e., the MPI daemons) to store temporarily the message content. Large messages may imply a buffered scenario, whereby MPI will create and free a buffer area large enough to contain the message.
Ready
A dangerous mode in which NO handshaking takes place, MPI assuming that the receive process is ready to receive the message sent by the sending node. This mode is the fastest of course, but should only be used with special precautions.
Syntax
MPI_Send( void *buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm com )
buffer
- message address
count
- number of elements in the message
datatype
- one of the MPI datatypes
dest
- target process to receive the message
tag
- a message identification tag
com
- the communicator to which both send and receive processes belong
buffer
refers to the address of any region in memory.
An array name (starting address of array), or a single variable (the
address of it, &variable
), is what is
expected. count
elements of type datatype
starting
at buffer
will be sent by the MPI send routine.
Recieve mode
There is only one MPI receive mode, MPI_Recv
. The
routine is blocking, i.e., will block until a message is received. The
user may specify a node or a tag value for the expected message or
accept any message.
Syntax
MPI_Recv( void *buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm com, MPI_Status *status )
buffer
, count
, datatype
,
source
, tag
, and com
have the
same meaning as in the send routine, except
that source specifies the sending process and buffer
is
the location where the MPI receive routine is to store the incoming
information. status
contains information about the
message, the communication envelope.
MPI_ANY_SOURCE
and MPI_ANY_TAG
are used
when messages are to be received from any source or with any tag.
Communication Envelope
The message envelope contains at least:
- the rank of the receiver process
- the rank of the sender process
- the message tag
- the communicator under which the message was sent
The message envelope is constructed by the send routines. A message is therefore the data + an envelope.
MPI can provide information about the size, origin, and tag of any
received message through access to the message envelope. The
variable status
provides this
information. The mpi.h
file defines the structure of this
variable as follows.
/* Status object. It is the only user-visible MPI data-structure The "count" field is PRIVATE; use MPI_Get_count to access it. */ typedef struct { int count; int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; int private_count; } MPI_Status;
This variable is visible from the user code. For instance,
status.MPI_SOURCE
tells you the rank of the sending
process.
Note that the size of the data, status.count
, is
private. Therefore it can only be obtained via the MPI query
routine
MPI_Get_count( MPI_Status status, MPI_Datatype datatype, int *count)
The number of items received can be less
than count
in the MPI_Recv
call. The latter
only specifies the maximum size of the receiving buffer.
Deadlock
It is recommended that the synchronous send
routine, MPI_Ssend
, be used to send messages. The
advantage of using this routine is that it does not depend on the MPI
buffer or on pre-synchronization of the sending and receiving
processes.
A danger in writing a parallel code is to create a deadlock in the
communication channels, whereby the nodes are blocked from going to
the next statement in the code by an MPI_Ssend
, the
receiving node also being blocked by an MPI_Recv
statement — itself being blocked in the MPI_Ssend
statement.
ith process:
MPI_Ssend( ...,j,...); MPI_Recv( ...,j,...);
jth process:
MPI_Ssend(...,i,...); MPI_Recv(...,i,...);
A regular MPI_Send()
will solve this
problem. But, the safe way to handle this problem is to use a
Red-Black or Odd-Even scheme, whereby the order of send/receive is
inverted for the odd-even processes. Schematically:
even processes:
MPI_Ssend( ...); // sends to odd processes MPI_Recv( ...);
odd processes:
MPI_Recv(...); // receives from even processes MPI_Ssend(...);
The use of the non-blocking send and receive routines might allow better efficiency (codes computing during communication time) but presents other dangers that will mentioned later.
Timing
double MPI_Wtime()
returns a double precision number giving the clock time in seconds
since some arbitrary time origin in the past. This origin is
guaranteed not to change during the lifetime of a process. So calling
MPI_Wtime
twice and subtracting the results gives the
elapsed time between the calls.
double time1, time2, elapsed_time; time1 = MPI_Wtime(); ... time2 = MPI_Wtime(); elapsed_time= time2 - time1;
The Cost of Communication
Communication is significantly more expensive than calculation. The cost of communication comes from the two major phases in sending a message: the start-up phase and the data transmission phase. The total time to send K units of data for a given system can be written as
ttotal = ts + K tc
ts is sometimes referred as the latency time; it is the time to perform hand-shake protocol to start a point to point communication. tc is the time to transmit units of information; the reciprocal of tc is the bandwidth. The latency time can be as high as 500 microseconds for TCP/IP and as low few microseconds on a CRAY T3E. The bandwidth of our CYBORG system (fast-ethernet) is peaked at 100 MB/sec (real speed might be ~50MB/sec).
You can measure these communication costs via simple codes that
time the send/receive messages. The
program ping_pong.c
(see the Edinburgh notes) does precisely this via
the MPI_Wtime()
routines. It sends messages of various
length back and forth between two processes and measures the round
trip times. Running this program produces the following figures.
The effective bandwidth (transmission rate) is plotted versus the length of the messages. Small messages are costly because of the start-up cost and produce a low effective bandwidth. The latency becomes insignificant for large messages and the effective bandwidth saturates into the network bandwidth.
Read the code, understand it, and run it to reproduce the figures above.
Simple Examples
The following message passing examples are bundled in MPI2_message_passing.tar.gz.
Communication
- simplest_message.c
- Simplest message passing example — node 0 sends a message to other nodes.
- simple_message.c
- Next simplest message passing example — each node (rank != 0) sends a message to node 0
- still_simple.c
- Node 0 sends a message to all other nodes. All other nodes process the message and send result back to node 0. Nodes form a "ring", each node communicating with its neighbors
- hop.c
- Each node sends a message to right neighbor in a ring. Note: The blocking version of the routines produces a deadlock in this process.
- hop_again_again.c
- Each node sends a message to left and right neighbors in a ring fashion again-and-again.
- ring.c
- Non-blocking communication example.
Communication time
- ping_pong.c
- Edinburgh example illustrating the scaling of communication time.
Average Example
Computing the average and standard deviation of a set of numbers is a simple task. Yet doing it in parallel provides a good introduction to message passing programming using MPI.
The following averaging example is bundled in average.tar.gz.Consider a noisy data set data
generated
by generate_random.c.
The file contains x and a y coordinates of points that
you can display with gnuplot
. We seek the average and
standard deviation of the y column.
The code average.c solves
this problem using a serial algorithm. Note the use
of malloc
and free
to reserve and release
memory space. This allows for flexible and general code.
A parallel implementation of the solution could use an algorithm in which the data is divided among all the processes. This illustrates one of the advantage of using a parallel architecture in that the memory requirement per node is much less than in a serial implementation of the code. Or that the agregate memory of the parallel computer is tipically much larger than the memory of a single computer, allowing larger problems to be solved.
The algorithm of a parallel implementation is illustrated in a simplified flowchart. Note the following:
- All the processes contain a chunk of data.
- The number of numbers in each process is adjusted dynamically according to the size of the data set and the number of processes.
- Each process uses
malloc
to reserve an appropriate size memory chunk for the local data. - The wall time saving is acomplished via performing local sums in the different processes (not an issue in this very simple/fast calculation).
- A parallel implmentation of an algorithm tipically requires thinking in terms of real time programming — i.e., what each process does as a function of time.
The reader is encouraged to write their own parallel implementation. Be careful and think things through!