Point to Point Communication

Send modes
- Synchronous
- Buffered
- Standard
- Ready
- Syntax
Recieve mode
- Syntax
Communication Envelope
Deadlock
Timing
The Cost of Communication
Simple Examples
- Communication
- Communication time
Average Example

Send modes

Type	Description	Function
Synchronous	completes when receive has completed	`MPI_Ssend`
Buffered	completes once buffer is full (immediate)	`MPI_Bsend`
Standard	either synchronous or buffered	`MPI_Send`
Ready	always completes (immediate)	`MPI_Rsend`

Synchronous

The sending process expects a hand-shake by the receiving process acknowledging receipt of the message. Safest of the send modes. Can be wasteful in time. The sending and receiving nodes are synchronized.

Buffered

The sending node writes to a user defined MPI buffer for temporary storage until the message is sent by MPI. The sending and receiving nodes are NOT synchronized. The analogy is to drop a message in a post office mailbox to be picked up later. The user must define buffer space via calls similar to the C routines malloc and free.

MPI_Buffer_attach( void *buffer, int size )
MPI_Buffer_detach( void *buffer, int *size )

The size must include an overhead of size MPI_BSEND_OVERHEAD. Note also that if many sends must occur in a row the buffer must have a size large enough to store all the messages since nothing guaranty the receive completion while the different messages are sent.

Standard

The standard MPI send is either of synchronous or buffered type. The issue is here the capacity of the network (i.e., the MPI daemons) to store temporarily the message content. Large messages may imply a buffered scenario, whereby MPI will create and free a buffer area large enough to contain the message.

Ready

A dangerous mode in which NO handshaking takes place, MPI assuming that the receive process is ready to receive the message sent by the sending node. This mode is the fastest of course, but should only be used with special precautions.

Syntax

MPI_Send( void *buffer, int count, MPI_Datatype datatype, int dest,
          int tag, MPI_Comm com )

buffer: message address
count: number of elements in the message
datatype: one of the MPI datatypes
dest: target process to receive the message
tag: a message identification tag
com: the communicator to which both send and receive processes belong

buffer refers to the address of any region in memory. An array name (starting address of array), or a single variable (the address of it, &variable), is what is expected. count elements of type datatype starting at buffer will be sent by the MPI send routine.

Recieve mode

There is only one MPI receive mode, MPI_Recv. The routine is blocking, i.e., will block until a message is received. The user may specify a node or a tag value for the expected message or accept any message.

Syntax

MPI_Recv( void *buffer, int count, MPI_Datatype datatype, int source,
          int tag, MPI_Comm com, MPI_Status *status )

buffer, count, datatype, source, tag, and com have the same meaning as in the send routine, except that source specifies the sending process and buffer is the location where the MPI receive routine is to store the incoming information. status contains information about the message, the communication envelope.

MPI_ANY_SOURCE and MPI_ANY_TAG are used when messages are to be received from any source or with any tag.

Communication Envelope

The message envelope contains at least:

the rank of the receiver process
the rank of the sender process
the message tag
the communicator under which the message was sent

The message envelope is constructed by the send routines. A message is therefore the data + an envelope.

MPI can provide information about the size, origin, and tag of any received message through access to the message envelope. The variable status provides this information. The mpi.h file defines the structure of this variable as follows.

/*
Status object. It is the only user-visible MPI data-structure
The "count" field is PRIVATE; use MPI_Get_count to access it.
*/
typedef struct {
  int count;
  int MPI_SOURCE;
  int MPI_TAG;
  int MPI_ERROR;
  int private_count;
} MPI_Status;

This variable is visible from the user code. For instance, status.MPI_SOURCE tells you the rank of the sending process.

Note that the size of the data, status.count, is private. Therefore it can only be obtained via the MPI query routine

MPI_Get_count( MPI_Status status, MPI_Datatype datatype, int *count)

The number of items received can be less than count in the MPI_Recv call. The latter only specifies the maximum size of the receiving buffer.

Deadlock

It is recommended that the synchronous send routine, MPI_Ssend, be used to send messages. The advantage of using this routine is that it does not depend on the MPI buffer or on pre-synchronization of the sending and receiving processes.

A danger in writing a parallel code is to create a deadlock in the communication channels, whereby the nodes are blocked from going to the next statement in the code by an MPI_Ssend, the receiving node also being blocked by an MPI_Recv statement — itself being blocked in the MPI_Ssend statement.

ith process:

MPI_Ssend( ...,j,...);
MPI_Recv( ...,j,...);

jth process:

MPI_Ssend(...,i,...);
MPI_Recv(...,i,...);

A regular MPI_Send() will solve this problem. But, the safe way to handle this problem is to use a Red-Black or Odd-Even scheme, whereby the order of send/receive is inverted for the odd-even processes. Schematically:

even processes:

MPI_Ssend( ...); // sends to odd processes
MPI_Recv( ...);

odd processes:

MPI_Recv(...); // receives from even processes
MPI_Ssend(...);

The use of the non-blocking send and receive routines might allow better efficiency (codes computing during communication time) but presents other dangers that will mentioned later.

Timing

double MPI_Wtime()

returns a double precision number giving the clock time in seconds since some arbitrary time origin in the past. This origin is guaranteed not to change during the lifetime of a process. So calling MPI_Wtime twice and subtracting the results gives the elapsed time between the calls.

double time1, time2, elapsed_time;
time1 = MPI_Wtime();
...
time2 = MPI_Wtime();
elapsed_time= time2 - time1;

The Cost of Communication

Communication is significantly more expensive than calculation. The cost of communication comes from the two major phases in sending a message: the start-up phase and the data transmission phase. The total time to send K units of data for a given system can be written as

t_total = t_s + K t_c

t_s is sometimes referred as the latency time; it is the time to perform hand-shake protocol to start a point to point communication. t_c is the time to transmit units of information; the reciprocal of t_c is the bandwidth. The latency time can be as high as 500 microseconds for TCP/IP and as low few microseconds on a CRAY T3E. The bandwidth of our CYBORG system (fast-ethernet) is peaked at 100 MB/sec (real speed might be ~50MB/sec).

You can measure these communication costs via simple codes that time the send/receive messages. The program ping_pong.c (see the Edinburgh notes) does precisely this via the MPI_Wtime() routines. It sends messages of various length back and forth between two processes and measures the round trip times. Running this program produces the following figures.

The effective bandwidth (transmission rate) is plotted versus the length of the messages. Small messages are costly because of the start-up cost and produce a low effective bandwidth. The latency becomes insignificant for large messages and the effective bandwidth saturates into the network bandwidth.

Read the code, understand it, and run it to reproduce the figures above.

Simple Examples

The following message passing examples are bundled in MPI2_message_passing.tar.gz.

Communication

simplest_message.c: Simplest message passing example — node 0 sends a message to other nodes.
simple_message.c: Next simplest message passing example — each node (rank != 0) sends a message to node 0
still_simple.c: Node 0 sends a message to all other nodes. All other nodes process the message and send result back to node 0. Nodes form a "ring", each node communicating with its neighbors
hop.c: Each node sends a message to right neighbor in a ring. Note: The blocking version of the routines produces a deadlock in this process.
hop_again_again.c: Each node sends a message to left and right neighbors in a ring fashion again-and-again.
ring.c: Non-blocking communication example.

Communication time

ping_pong.c: Edinburgh example illustrating the scaling of communication time.

Average Example

Computing the average and standard deviation of a set of numbers is a simple task. Yet doing it in parallel provides a good introduction to message passing programming using MPI.

The following averaging example is bundled in average.tar.gz.

Consider a noisy data set data generated by generate_random.c. The file contains x and a y coordinates of points that you can display with gnuplot. We seek the average and standard deviation of the y column.

The code average.c solves this problem using a serial algorithm. Note the use of malloc and free to reserve and release memory space. This allows for flexible and general code.

A parallel implementation of the solution could use an algorithm in which the data is divided among all the processes. This illustrates one of the advantage of using a parallel architecture in that the memory requirement per node is much less than in a serial implementation of the code. Or that the agregate memory of the parallel computer is tipically much larger than the memory of a single computer, allowing larger problems to be solved.

The algorithm of a parallel implementation is illustrated in a simplified flowchart. Note the following:

All the processes contain a chunk of data.
The number of numbers in each process is adjusted dynamically according to the size of the data set and the number of processes.
Each process uses malloc to reserve an appropriate size memory chunk for the local data.
The wall time saving is acomplished via performing local sums in the different processes (not an issue in this very simple/fast calculation).
A parallel implmentation of an algorithm tipically requires thinking in terms of real time programming — i.e., what each process does as a function of time.

The reader is encouraged to write their own parallel implementation. Be careful and think things through!

PHYS405

Advanced Computational Physics and Parallel Computing

Point to Point Communication

Send modes

Synchronous

Buffered

Standard

Ready

Syntax

Recieve mode

Syntax

Communication Envelope

Deadlock

Timing

The Cost of Communication

Simple Examples

Communication

Communication time

Average Example