MPI2 Standard

Today's Parallel Systems
Parallel Computing
Message Passing Standards — PVM and MPI
References
MPI2 Examples
- MPI2 software environment

Today's Parallel Systems

Today's parallel systems are almost exclusively of the Multi-Instruction Multi-Data (MIMD) type. Think of a Beowulf system which consists of many off-the-shelve PCs or Macs connected via an internet connection. Each computer in the system constitutes in itself a unit capable of executing multiple instructions and using multiple data. Of course there could be some subtleties about the organization of such systems. For instance, one may ask what is the best connection topology among the PCs. Nowadays, with the advance of gigabits internet, the easiest way is to use a gigabits switch to connect all of the PCs in a hidden internet network. One of the nodes in the system ought to perform double duty in terms of communication with the other internal nodes as well as communication with an outside Ethernet to provide the users with easy access. Of course this node should be configured as a firewall to protect the parallel computer.

Commercial parallel servers do not differ much from this abstract architecture that we just described. Servers by Dell, IBM, SGI, HP, ... may look very different but are conceptually very similar. The nodes often come in rack-mounded units of 1U or 2U heights. This provides for efficient use of floor space and efficient cooling. Another subtlety might be that the units come internally with multiple CPUs or multiple core chips which behave like multi-CPU units.

Parallel Computing

Parallel computing consists of subdividing large computing tasks into sub-tasks. It is easier to think in terms of coarse grain approach in which the sub-tasks are significant in size. This computing paradigm is driven by the architecture we described above, in that the individual node are intrinsically powerful and can handle large sub-tasks. An advantage of this approach is that it will often minimize the need for communication between the sub-tasks. This will be emphasized in the following sections.

The hardware model that a MIMD computer exhibits is therefore a series of nodes which are complete computers by themselves with a CPU unit, memory, local disks space, and the mean for that to be integrated via the bus. Software-wise, each of those nodes has its own Operating Systems (OS), which in our case will be a Linux implementation of some sort.

A software environment must reside in the nodes on top of the local Operating System for the users to submit tasks in parallel. This software environment will have two very important functions: the administration of the sub-tasks and the facilitation of task-to-task communication. The sub-tasks to be executed must be submitted to a subset of the nodes of the parallel system, monitored, handled appropriately when they misbehave, and terminated when done. The software should also provide the means for the sub-tasks to communicate among themselves since the latter are part of a global task.

Message Passing Standards — PVM and MPI

Parallel Virtual Machine

Developed at Oak-Ridge National Laboratory, PVM is public domain software. Runs on most parallel systems. C- and Fortran-binding library of routines to spawn processes, administer processes, and facilitate message passing between processes.

Message-Passing Interface

Developed at Argonne National Laboratory, MPI is public domain software. Runs on most parallel systems. C- , C++-, and Fortran-binding library of routines administer processes, and facilitate message passing between processes. Launching tasks is performed from within the MPI run-time environment.

The MPI approach to message passing seems to have become the de-facto standard among the builders parallel machines. This course will use the modern version of this standard.

Message-Passing Interface 2

MPI2 is a newer MPI standard nad differs from the original standard in a few aspects. First, it separates the administration of the tasks from the responsibility for communication in between the tasks. Seconds, it implements new parallel I/O capabilities as well as one-sided communication.

In this course we are going to use the MIPICH2 implementation of the MPI2 standard developed at Argonne National Laboratory.

The MPI2 Software Model

The MPI2 software model consists in first establishing a virtual machine, the communication ring, within a subset of the physical nodes of the parallel computer and then in running the parallel jobs via the help of communication handles within that virtual computer. Contrary to the MPI standard, the MPI2 standard establishes a distinction between the administrative tasks of establishing and maintaining the communication ring from the administration of the parallel jobs, very much like in the PVM standard.

The concept is that a computing task which has been divided in sub-tasks is run under the umbrella of the MPI2 system which facilitates the communication in between the sub-tasks as well as the overall administration and monitoring of the tasks themselves. In practice, this is done via the establishment of computer daemons on each node which are themselves linked within the MPI2 umbrella by point-to-point communication protocol. The users' tasks in any node talk to the local MPI2 daemons, which themselves talk to each other and therefore can establish communication links from any sub-tasks to any other sub-tasks.

In MPICH2, the multi-purpose daemon (MPD) allows the establishment of the communication ring or the virtual machine. Once the communication ring is established specific MPI commands allow the users to load in the sub-tasks, monitor them, signal them and possibly kill them.

The MPD subset of commands of MPICH2 allows to start the communication ring, monitor it, specify its attributes, repair the communication links, and kill single daemons or the entire sets of daemons in a ring. The programming of he communication links is done by a socket programming among the daemons. The users' codes communicate with the daemons via socket programming as well, hidden under the disguise of MPI calls.

Multi-purpose Daemon (MPD)

The MPD subsystems in MPICH2 refers to either the daemons themselves or the set of commands that allow the administration of those daemons. The best reference to MPD can be found in the users guide on the MPICH2 website.

The following MPD commands are available:

mpd: Starts a message passing daemon
mpdtrace: Shows all MPDs in a communication ring
mpdboot: Starts the ring of dacodeons all at once
mpdringtest: Tests how long it takes for a message to circle the ring
mpdallexit: Takes down all dacodeons and ring
mpdcleanup: Repairs local UNIX socket if ring crashed badly
mpdlistjobs: Lists the processes of jobs
mpdkilljobs: Kills all processes of a single job
mpdsigjob: Delivers a specific signal to the application processes of a job

Each one of those commands can be invoked with the --help argument, which prints information for the command without running it.

It is clear from reading the description of the set of commands within the MPD subset that all administrative needs of the MPI daemon ring can be handled via those commands. MPD provides much flexibility by separating the communication ring administration from the process of running the sub-tasks themselves.

Communication Ring

The command mpdboot creates a virtual machine, a MPI2 communication ring.

mpdboot -n <number-of-nodes> -f <hosts-file>

creates a communication ring with <number-of-nodes> using a subset or all the nodes listed in the file hosts-file. The node on which the command is launched is automatically included in the communication ring. Host names repetition in <hosts-file> are removed.

The hosts are listed one node per line in <hosts-file>. The syntax should be

host-name : no-of-processor

where host-name stands for the hosts on which to launch the MPD daemons and no-of-processor is the number of processors on each node. For instance, a <hosts-file> describing some of the stations in 12-704 might read

xphys11.physics.xterm.net:1
xphys12.physics.xterm.net:1
xphys13.physics.xterm.net:1
xphys14.physics.xterm.net:1
xphys15.physics.xterm.net:1

The names of the hosts can be simplified as long as the hostnames can be resolved. Likewise, the number of processors (:1) can be omitted since all the hosts are homogeneous at one processor each.

xphys11
xphys12
xphys13
xphys14
xphys15

There are further constraints to be considered. The user's directory must be visible and readable by all the hosts in the virtual machine. MPICH2 also requires a .mpi.conf file in the user's home directory that contains

MPD_SECRETWORD=<anything>

The .mpi.conf file should be private:

chmod 600 .mpi.conf

The MPD daemons work through ssh. It is convenient to install a ssh key on the hosts to avoid ssh to require a passwd when invoked.

Running Jobs Under MPI2

There are two commands that allow running jobs under the MPI2 protocols. The first command, mpiexec, allows maximum flexibility in the submission of jobs. The minimum syntax to submit an executable (Unix command) over N processing nodes is

mpiexec -np N executable

The command mpirun, with the same syntax, is the second command to submit a job. This command is maintained under MPI2 for compatibility with the original MPI standard. Using mpiexec is highly recommended to submit parallel jobs.

Subtleties in Establishing the Communication Ring Daemons

Many aspects of the micro-architecture of the parallel computer come into play in establishing the communication ring daemons to support the MPI2 standard. The concepts are related to the way that the communication links have to be established to take into account the actual configuration of the nodes.

The MPI2 standard can establish communication in between tasks by various means. The default is via Unix sockets which imply the capability of communicating over Ethernet. Communications can also proceed via shared memory channels. MPD could use only Linux sockets or given the option to use Linux sockets or shared memory in building a communication ring. Tasks must use sockets to communicate between nodes. Within a given node tasks might communicate via the shared memory associated with the different processes or via Unix sockets.

It is important to understand whether the parallel computer is homogeneous (all nodes are alike) or inhomogeneous (the cluster could be a hodge-podge of computers). This brings about the fact that the communication channels that MPI2 opens up between the different nodes might need to translate from one data representation to another if the cluster is inhomogeneous. The MPICH2 implementation supports the Linux and the Windows Operating Systems. The Linux environment is the most tested and supported environment.

Another issue derives from the possibility that the nodes on the parallel cluster might be multi-CPU nodes. In such a case, multi tasks can be loaded on each node for better efficiency with the Operating System left to administer those tasks locally. This must be specified in setting-up the MPD daemons, one per node, while more than one tasks be allowed to exist. The tasks on the same node could communicate via shared memory instead of Unix sockets. This will also influence the way that the tasks are going to be loaded on the system. For instance, if six tasks need to be loaded on a four nodes system, the tasks are going to be loaded in a round-about way, namely task 0 will be on node A, 1 on B, 2 on C, 3 on D, 4 on A and 5 on B by default. On the other hand, if two CPUs exist on each node it might be advantageous to load tasks 0 and 1 on A, 2 and 3 on B and 4 and 5 on C.

Consideration needs to be given as well for the existence of dual and quad core CPU chips which may populate the node of the parallel cluster. These chips are equivalent to have two or four CPU independently working in a single node. Therefore the MPD commands ought to be warned that there are equivalently multi-CPUs on each node.

MPICH2 allows the use of different daemon handler in the administration of the parallel system. The MPD by itself stands for one such administration tool, the default of the system. There exist two other administration systems, the SMPD which is used mostly by Windows machine and the GFORKER system which is used for developing new concepts within MPICH2 itself.

References

The standard MPICH2 website contains a wealth of information about the MPI2 implementation. In particular the users guide presents a fine description of the MPD subset of commands and their functions. The MPI execution per se is described in the users guide document on the same site.

In the following sections we are going to assume that the MPD subset of commands is understood and that a communication ring of arbitrary size can be formed by the users. It will also be assumed that the users can submit parallel tasks using the mpiexe command. The use of the MPI2 standard will be described as needed in the following sections.

PHYS405

Advanced Computational Physics and Parallel Computing