Commercial parallel servers do not differ much from this abstract architecture that we just described. Servers by Dell, IBM, SGI, HP, ... may look very different but are conceptually very similar. The nodes often come in rack-mounded units of 1U or 2U heights. This provides for efficient use of floor space and efficient cooling. Another subtlety might be that the units come internally with multiple CPUs or multiple core chips which behave like multi-CPU units.
The hardware model that a MIMD computer exhibits is therefore a series of nodes which are complete computers by themselves with a CPU unit, memory, local disks space, and the mean for that to be integrated via the bus. Software-wise, each of those nodes has its own Operating Systems (OS), which in our case will be a Linux implementation of some sort.
A software environment must reside in the nodes on top of the local Operating System for the users to submit tasks in parallel. This software environment will have two very important functions: the administration of the sub-tasks and the facilitation of task-to-task communication. The sub-tasks to be executed must be submitted to a subset of the nodes of the parallel system, monitored, handled appropriately when they misbehave, and terminated when done. The software should also provide the means for the sub-tasks to communicate among themselves since the latter are part of a global task.
Parallel Virtual Machine (PVM)
Developed at Oak-Ridge National Laboratory. Public domain software. Runs on most parallel systems. C- and Fortran-binding library of routines to spawn processes, administer processes, and facilitate message passing between processes.
Message-Passing Interface (MPI)
Developed at Argonne National Laboratory. Public domain software. Runs on most parallel systems. C- , C++- and Fortran-binding library of routines administer processes, and facilitate message passing between processes. Launching tasks is performed from within the MPI run-time environment.
The MPI approach to message passing seems to have become the de-facto standard among the builders parallel machines. This course will use the modern version of this standard.
Message-Passing Interface 2 MPI2
The newer MPI standard, MPI2, differs from the original standard in a few aspects. First, it separates the administration of the tasks from the responsibility for communication in between the tasks. Seconds, it implements new parallel I/O capabilities as well as one-sided communication.In this course we are going to use the implementation MIPICH2 of the MPI2 standard developed at Argonne National Laboratory.
The concept is that a computing task which has been divided in sub-tasks is run under the umbrella of the MPI2 system which facilitates the communication in between the sub-tasks as well as the overall administration and monitoring of the tasks themselves. In practice, this is done via the establishment of computer daemons on each node which are themselves linked within the MPI2 umbrella by point-to-point communication protocol. The users' tasks in any node talk to the local MPI2 daemons, which themselves talk to each other and therefore can establish communication links from any sub-tasks to any other sub-tasks.
In MPICH2, the multi-purpose daemon (MPD) allows the establishment of the communication ring or the virtual machine. Once the communication ring is established specific MPI commands allow the users to load in the sub-tasks, monitor them, signal them and possibly kill them.
The MPD subset of commands of MPICH2 allows to start the communication ring, monitor it, specify its attributes, repair the communication links, and kill single daemons or the entire sets of daemons in a ring. The programming of he communication links is done by a socket programming among the daemons. The users' codes communicate with the daemons via socket programming as well, hidden under the disguise of MPI calls.
The following MPD commands are available:
Each one of those commands can be invoked with the --help argument, which prints information for the command without running it.
It is clear from reading the description of the set of commands within the MPD subset that all administrative needs of the MPI daemon ring can be handled via those commands. MPD provides much flexibility by separating the communication ring administration from the process of running the sub-tasks themselves.
mpdboot -n <number-of-nodes> -f <hosts-file>
creates a communication ring with <number-of-nodes> using a subset or all the nodes listed in the file hosts-file. The node on which the command is launched is automatically included in the communication ring. Host names repetition in <hosts-file> are removed.
The hosts are listed one node per line in <hosts-file>. The syntax should be
host-name : no-of-processor
where host-name stands for the hosts on which to launch the MPD daemons and no-of-processor is the number of processors on each node. For instance, a <hosts-file> describing some of the stations in 12-704 might read
xphys11.physics.xterm.net:1
xphys12.physics.xterm.net:1
xphys13.physics.xterm.net:1
xphys14.physics.xterm.net:1
xphys15.physics.xterm.net:1
The names of the hosts can be simplified as long as the hostnames can be resolved.
Likewise, the number of processors ( :1 ) can be omitted since all the hosts are homogeneous at one processor each.
xphys11
xphys12
xphys13
xphys14
xphys15
There are further constrains to be considered. The user's directory must be visible and readable by all the hosts in the virtual machine. MPICH2 also requires a .mpi.conf file in the user's home directory that contains
MPD_SECRETWORD=anything
The .mpi.conf file must be read/write protected for the OS only, i.e., chmod 600 .mpi.confThe MPD daemons work through ssh. It is convenient to install a ssh key on the hosts to avoid ssh to require a passwd when invoked.
mpiexec -np N executable
The command mpirun, with the same syntax, is the second command to submit a job. This command is maintained under MPI2 for compatibility with the original MPI standard.
Using mpiexec is highly recommended to submit parallel jobs.
The MPI2 standard can establish communication in between tasks by various means. The default is via Unix sockets which imply the capability of communicating over Ethernet. Communications can also proceed via shared memory channels. MPD could use only Linux sockets or given the option to use Linux sockets or shared memory in building a communication ring. Tasks must use sockets to communicate between nodes. Within a given node tasks might communicate via the shared memory associated with the different processes or via Unix sockets.
It is important to understand whether the parallel computer is homogeneous (all nodes are alike) or inhomogeneous (the cluster could be a hodge-podge of computers). This brings about the fact that the communication channels that MPI2 opens up between the different nodes might need to translate from one data representation to another if the cluster is inhomogeneous. The MPICH2 implementation supports the Linux and the Windows Operating Systems. The Linux environment is the most tested and supported environment.
Another issue derives from the possibility that the nodes on the parallel cluster might be multi-CPU nodes. In such a case, multi tasks can be loaded on each node for better efficiency with the Operating System left to administer those tasks locally. This must be specified in setting-up the MPD daemons, one per node, while more than one tasks be allowed to exist. The tasks on the same node could communicate via shared memory instead of Unix sockets. This will also influence the way that the tasks are going to be loaded on the system. For instance, if six tasks need to be loaded on a four nodes system, the tasks are going to be loaded in a round-about way, namely task 0 will be on node A, 1 on B, 2 on C, 3 on D, 4 on A and 5 on B by default. On the other hand, if two CPUs exist on each node it might be advantageous to load tasks 0 and 1 on A, 2 and 3 on B and 4 and 5 on C.
Consideration needs to be given as well for the existence of dual and quad core CPU chips which may populate the node of the parallel cluster. These chips are equivalent to have two or four CPU independently working in a single node. Therefore the MPD commands ought to be warned that there are equivalently multi-CPUs on each node.
MPICH2 allows the use of different daemon handler in the administration of the parallel system. The MPD by itself stands for one such administration tool, the default of the system. There exist two other administration systems, the SMPD which is used mostly by Windows machine and the GFORKER system which is used for developing new concepts within MPICH2 itself.
In the following sections we are going to assume that the MPD subset of commands is understood and that a communication ring of arbitrary size can be formed by the users. It will also be assumed that the users can submit parallel tasks using the mpiexe command. The use of the MPI2 standard will be described as needed in the following sections.