MPI3 Standard


   Back to course contents

Today's Parallel Systems

Today's parallel systems are almost exclusively of the Multi-Instruction Multi-Data (MIMD) type. Think of a Beowulf system which consists of many off-the-shelve PCs or Macs connected via an internet connection. Each computer in the system constitutes in itself a unit capable of executing multiple instructions and using multiple data. Of course there could be some subtleties about the organization of such systems. For instance, one may ask what is the best connection topology among the PCs. Nowadays, with the advance of gigabits internet, the easiest way is to use a gigabits switch to connect all of the PCs in a hidden internet network. One of the nodes in the system ought to perform double duty in terms of communication with the other internal nodes as well as communication with an outside Ethernet to provide the users with easy access. Of course this node should be configured as a firewall to protect the parallel computer.

Commercial parallel servers do not differ much from this abstract architecture that we just described. Servers by Dell, IBM, HP, ... may look very different but are conceptually very similar. The nodes often come in rack-mounded units of 1U or 2U heights. This provides for efficient use of floor space and efficient cooling. Another subtlety might be that the units come internally with multiple CPUs or multiple core chips which behave like multi-CPU units.


   Back to top of page

Parallel Computing

Parallel computing consists of subdividing large computing tasks into sub-tasks. It is easier to think in terms of coarse grain approach in which the sub-tasks are significant in size. This computing paradigm is driven by the architecture we described above, in that the individual node are intrinsically powerful and can handle large sub-tasks. An advantage of this approach is that it will often minimize the need for communication between the sub-tasks. This will be emphasized in the following sections.

The hardware model that a MIMD computer exhibits is therefore a series of nodes which are complete computers by themselves with a CPU unit, memory, local disks space, and the mean for that to be integrated via the bus. Software-wise, each of those nodes has its own Operating Systems (OS), which in our case will be a Linux implementation of some sort.

A software environment must reside in the nodes on top of the local Operating System for the users to submit tasks in parallel. This software environment will have two very important functions: the administration of the sub-tasks and the facilitation of task-to-task communication. The sub-tasks to be executed must be submitted to a subset of the nodes of the parallel system, monitored, handled appropriately when they misbehave, and terminated when done. The software should also provide the means for the sub-tasks to communicate among themselves since the latter are part of a global task.


   Back to top of page

Message Passing Standards -- PVM and MPI

Parallel Virtual Machine (PVM)

Developed at Oak-Ridge National Laboratory. Public domain software. Runs on most parallel systems. C- and Fortran-binding library of routines to spawn processes, administer processes, and facilitate message passing between processes.

Message-Passing Interface (MPI)

Developed at Argonne National Laboratory. Public domain software. Runs on most parallel systems. C- , C++- and Fortran-binding library of routines administer processes, and facilitate message passing between processes. Launching tasks is performed from within the MPI run-time environment.

The MPI approach to message passing seems to have become the de-facto standard among the builders parallel machines.  This course will use the modern version of this standard.

Message-Passing Interface 3 MPI3

The current MPI standard, MPI3, differs from the original standard in a few aspects. First, it clearly separates the administration of the modes from that of the computing tasks. Second, it implements new parallel I/O capabilities as well as one-sided communication. It also supports two way communication.

In this course we are going to use the implementation MPICH 3.0 " of the MPI3 standard developed at Argonne National Laboratory.


   Back to top of page

The MPI3 Software Model

The MPI3 Software Model is that of a large demanding computing task which is divided in sub-tasks running under the umbrella of the MPI3 system. The MPI3 serves to support the communication between the sub-tasks as well as the overall administration and monitoring of the sub-tasks themselves. Therefore, the first action of MPI3 is to establish a virtual machine, a communication ring, within a subset (or all) of the physical nodes of the parallel computer. Then, MPI3 distributes the parallel jobs among the nodes and establishes communication handles within that virtual computer to handle the communication needs of the individual sub-tasks.

The MPI3 standard establishes a clear distinction between the administrative tasks of establishing and maintaining the communication ring and the administration of the parallel jobs, very much like in the PVM standard. This is made evident by having an independent software suite, the Hydra Process Monitor, which handles the virtual machine establishment and administration.


   Back to top of page

The Hydra Process Manager

The Hydra Process Manager is the default process management framework that uses existing daemons on nodes (e.g., ssh (Secure Shell), pbs (Batch System), slurm (Simple Linux Utility for Resource Management) , sge (Sun Grid Engine), ect) to start MPI processes. A comprehensive description of this command can be found at Using the Hydra Process Manager

From a practical point of view, a single and simple statement is suficiant for the user to establish an MPI3 communication umbrella and submit parallel tasks. For instance

 mpiexec -N 6 ./fun 

will start the MPI3 communication umbrella on your local machine, then deploy 6 instances of your code fun and monitor their execution. The code fun needs to be an executable and be specified with an explicit location ( . / fun ) as required by the UBUNTU OS for security reason.

The command is very complex. It serves to submit from the simplest to the most complex parallel MPI codes. In its simplest form, it works by submitting the command with flags specifying values for run or enironment variables. A comprehensive list- of possible flags is obtained via the help option of mpiexec, i.e., mpiexec -help to get a list of options and mpiexec -some-flag -help to get help on a particular option.

For instance, let us submit the following very simple shell-script called test-script:


echo "-----------------------  Michel was here..."
date
echo $user
hostname
pwd
echo "------"
ls
echo "----------------------"

This script does not need compilation since it is written in simple shell commands. Make it executable via chmod +x test-script .

Now, submit 6 copies of the script on your station via:

 mpiexec -N 6 ./test-script 

Can you explain the output of this command?

Next, submit the script on two particular stations of your choice in 12-704 via:

 
mpiexec -host xphy11.physics.xterm.net 3 -host xphy12.physics.xterm.net: 3  -N 6 ./test-script 

It is obviously cumbersome to explictly type the host names in mpiexec. To avoid this, a list of available host names can be incorporated in a file, hosts_list (arbitrary name), with the following syntax:

xphy10.physics.xterm.net:1
xphy11.physics.xterm.net:1
xphy12.physics.xterm.net:1
xphy13.physics.xterm.net:1
xphy14.physics.xterm.net:1
xphy15.physics.xterm.net:1

The submission of the script could be

mpiexec -f hosts_list -n 6 ./test-script 
This will run test-script on 6 stations, 1 task per station.

The hosts in hosts_list are listed one node per line following the syntax:

host-name : no-of-processor

where host-name stands for any allowed hosts on which to launch the tasks and no-of-processor is the maximum number of processes on each node. Normally, the number of tasks should be limited by the number of CPUs or the number of cores in each node.

The names of the hosts can be simplified as long as the hostnames can be resolved. Likewise, the number of processors ( :1 ) can be omitted since all the hosts (in 12-704) are homogeneous, at one processor; this yields

xphy10
xphy11
xphy12
xphy13
xphy14
xphy15

    Back to top of page

Subtleties in Submitting an MPI3 Program

There are further subtelties to be considered in launching MPI tasks. For instance, many aspects of the micro-architecture of the parallel computer come into play in establishing the communication ring daemons to support the MPI3 standard. A major item is related to the way that the communication links have to be established by mpiexec to take into account the actual configuration of the nodes.

The MPI3 standard allows communication in between tasks by various means. The default is via Unix sockets which imply the capability of communicating over Ethernet. Communications can also proceed via shared memory channels. For instance, emory channels could be used betwen tasks residing in the same physical node.

No matter what mpiexec elects to implement in terms of communication, it is safe to assume that the user's directory should be visible and readable by the tasks in the entire virtual machine. Short of this, the diverse tasks will all require you to type passwords again and again.

The MPI daemons (unless otherwise specified) work through ssh. It is convenient to install a ssh key on the hosts to avoid ssh to require a passwd when invoked.

It is important to know whether the physical parallel computer is homogeneous (all nodes are alike) or inhomogeneous (the cluster could be a hodge-podge of computers). This brings about the fact that the communication channels that MPI3 opens up between the different nodes might need to translate from one data representation to another if the cluster is inhomogeneous. The MPICH3 implementation supports the Linux and the Windows Operating Systems. The Linux environment is the most tested and supported environment.

Another issue derives from the possibility that the nodes on the parallel cluster might be multi-CPU nodes. In such a case, multi tasks can be loaded on each node for better efficiency with the Operating System left to administer those tasks locally. This must be specified in setting-up the MPD daemons, one per node, while more than one tasks be allowed to exist. The tasks on the same node could communicate via shared memory instead of Unix sockets. This will also influence the way that the tasks are going to be loaded on the system. For instance, if six tasks need to be loaded on a four nodes system, the tasks are going to be loaded in a round-about way, namely task 0 will be on node A, 1 on B, 2 on C, 3 on D, 4 on A and 5 on B by default. On the other hand, if two CPUs exist on each node it might be advantageous to load tasks 0 and 1 on A, 2 and 3 on B and 4 and 5 on C.

Consideration needs to be given as well for the existence of dual and quad core CPU chips which may populate the node of the parallel cluster. These chips are equivalent to have two or four CPU independently working in a single node. Therefore the MPD commands ought to be warned that there are equivalently multi-CPUs on each node.

Finally MPICH3 allows the use of different daemon handler in the administration of the parallel system. MPI by itself stands for one such administration tool, the default of the system. There exist two other administration systems, the SMPD which is used mostly by Windows machine and the GFORKER system which is used for developing new concepts within MPICH3 itself.


   Back to top of page
In the following sections we are going to assume that the mpiexec command is understood and that a communication ring of arbitrary size and physical configuration can be created by the users.
   Back to top of page

   Back to course contents