Introduction to Parallel Computing

Computer Architecture
Basic Computer
High Performance Computing – Parallelism
- Granularity
Traditional Computer Classification
- Flynn's Taxonomy
- Fast Processor Architectures
MIMD - Memory Organization
- Shared memory
- Distributed memory

Computer Architecture

High performance computing generally requires some understanding of computer architecture in order to optimize programs written in high level languages. The algorithms for solving complicated problems must also be adapted to the hardware in order to take advantage of the speed offered by the latest advances in supercomputers.

Computer architecture refers to the understanding of the components that make up the computer and the way they are interconnected. Computer organization (micro-architecture) is concerned with the implementation of a computer architecture; for instance two Dell servers (same architecture - both multiprocessors) may have different number of processors, different memory data transfer schemes or different speed. Computer engineering is concerned with the actual construction of the machine, i.e., construction of the chips, cooling and electrical requirements, ...

Computer scientists are scientists who use computers to implement their theoretical models. Often they require enormous amount of computing to produce results. These scientists need to be aware of the advancements in computer architecture and computer organization in order to optimize their codes.

It is the purpose of this section to survey general concepts of computer architecture and to introduce some notions about various types of computers.

High level compilers (e.g., C++, C and FORTRAN) can achieve very high degree of efficiency in taking advantage of some particular hardware platforms. This is particularly true of the various scalar architectures that will be discussed below. However, it is fair to say that this is by far not the case yet with parallel architectures. In the latter case a careful study of the algorithms in view of the hardware requirements is still very necessary.

Wikipedia has an interesting page on Computer Architecture.

Basic Computer

The basic components of a computer are:

processor: workhorse of the computer which performs logical control and arithmetical operations
memory: passive component which accepts information for storage and retrieves it as needed
input/output devices: in charge of long term storage (e.g., disk, solid state storage) and human interface
communication channels: to link above three components (e.g., bus, switch, ...)

They can be represented in a PMS diagram (Processor, Memory, and Switch) as in the following figure:

Processors

The high level languages (e.g., C) translate the programmers' instructions into fetch-decode-execute cycles (fetch instructions, decode them and store in program counters) and execute the instructions (fetch data from memory into registers, operate on it and store results in memory). The speed at which codes execute depends on the efficiency of the compiler in minimizing the instructions to execute at the low level language, and the intrinsic speed of the processor. The later is driven by the clock speed as well as the number of clock cycles necessary to implement the operations. A Reduced Instruction Set Chip (RISC) is one in which the number of elementary instructions is minimum for better optimization, the higher functions being assembled by the compiler. The units of processor speed are:

MIPS:: Million Instructions Per Second
MFLOPS (GFLOPS, TFLOPS):: Million Floating Points Operations per Second (order of 1000-3000 (1-3 GFLOPS) for a personnel computer)
Clock speed:: 2 GHz to 4 GHz for Pentium CPU chips.

Memory

Random Access Memory (RAM) is where the information is stored, both instructions and data. The smallest unit of information is a bit (0-1), followed by a byte (8 bits - ASCII character) and by a word (a C float or double - either 4 or 8 Bytes depending on the computer architecture). The "bus" are nowadays mostly capable of carrying 64 bit wide words (8 bytes). DDR (Double Data Rate) will twice the data transaction. For instance a DDR 333 memory chip running at 176MHz x 2 rate can carry (8 Bytes x 333 MHz) 2656 MB/sec on a 64 bit bus.

Performance depends on two further factors: access time and cycle time. Access time (latency or response time) measures how fast the memory chips respond to a fetch or a write from the CPU. The range varies from 80 ns (PC and Macintosh) down to 10 ns (specialized cache memories - see below).

Memory cycle time measures the minimum period between successive requests. An 80 ns memory chip can not in general respond to requests at every 80 ns.

Cache

Much of the memory requests in large computations come in order; for instances, the instructions are fetched in the order the compiler produced them, while elements of an array will often be defined or used in its index order, i.e. x[0], x[1], x[2], ... This leads to the idea of a cache memory, a very fast memory chip which serves for the purpose of temporarily storing information which might be needed next. Examples: if x[0] is needed, the processor might request x[0], x[1], ... to be brought in from main memory to cache, just in case the other values might also be needed. The same occurs in instructions buffer memory. The typical CPU chip comes with from 512 KB to 1 or 2 MB of cache. Many chips come equipped with primary and secondary caches.

The advantage of this arrangement is to allow the processor to calculate at its potential speed, since the fetching of instructions and fetching/storing of the data is faster in cache memory (expensive) than in regular memory (cheaper). But it also relies on the designer of the cache to have algorithms to fetch instructions and data from main memory which minimize the misses. It also asks the programmer to fetch the data in index order as much as possible for efficiency! If the elements of an array are used at random, the probability of misses in the cache augments dramatically!

Much work has been developed to designing efficient cache algorithms and hardware. See the Wikipedia page for further details.

Virtual Memory

In most large multi-user multi-tasking systems, as well as PC, the main memory can be supplemented by a swapping mechanism onto disk. Namely, when the total aggregate memory required by all the processes running on the machine exceeds the physical memory, the overflow (either entire processes or last used part of processes) are written in a swap space on disk.

Using virtual memory reduces performance considerably! The CRAY supercomputers in the past did not implement virtual memory for the sake of efficiency.

Inter-leaved Memory

Memory comes in banks dictated by the computer designer or the chip technology. For instance, memory simms come in 32 MB, 64 MB, 128 MB, ... sizes. The vector CRAYs were made to use this fact in order to improve the memory access time by eliminating the cycle time. This technology is widely used nowadays.

Suppose we have 4 banks each containing 256 bytes. The block-oriented scheme would assign the virtual addresses 0...255 in bank 1, 256...511 in bank 2, etc. The inter-leaved scheme would assign the virtual addresses 0,4,8... in bank 1, then 1,5,9,... in bank 2, etc. The processor can receive the content of memory locations in different banks on successive clock cycles. Theoretically the processor can receive one word per processor cycle independently of the memory cycle time if the information is fetched with an appropriate stride.

Bus and I/O bandwidth

It is obvious that large computational tasks often require huge amount of data to be manipulated. Therefore the speed of the bus or the inter-component communication speed is of prime importance. In a simple configuration the communication speed is characterized by a bandwidth, in units of MB of data which can be transferred per second. This is in general a complicated topic which will be the subject of much discussion in the section below on parallel architecture.

High Performance Computing – Parallelism

The interest of the computational scientist is to have more calculations performed per unit time, or to have a higher throughput. Parallelism is often the way to achieve this.

Job level parallelism

most cheaply achieved by having multiple computers and distributing users among them. This applies better to large number of smaller calculations rather than large applications.

Any multi-tasking system automatically implements this kind of parallelism by handling various tasks at once; also the I/O sub-systems also function simultaneously, in parallel, with the other components of the computer.

Program level parallelism

when a single task is divided among multiple CPU. This is the subject of this course.

Instruction level parallelism

Pipelined and vector architectures are the way to implement these approach. Cray vector computers were very important in the early days of high performance computing but have been abandoned nowadays due to cost. Pipelined CPU chips are described on a Wikipedia page. See also pipelined description.

Arithmetic and bit level parallelism

i.e., how does one handle the 64 bits of the two words in an add operation; of concern to the chip makers.

We are here concerned primarily with program level parallelism in our quest to use massively parallel computers. Instruction level parallelism was important on vector computers.

Granularity

The concept granularity is important to understand at the onset. A coarse grain system is one in which the operations that run in parallel are fairly large, on the order of entire programs. An example could be that of a fluid dynamicist who sub-divides a large volume in few large parcels and lets different processors handle the fluid problem in each parcel. Fine grain granularity is when a task is sub-divided into very many very small tasks. For example, a sum of integers

sum=0;
for {i=1; i<=N ; i++ )
sum = sum + i ;

could be divided into sub-sums on each processor, resulting in very few operations to be performed in each individual processors, and a subsequent global sum of the individual answers.

The granularity in computer parallelism is reflected in or dictated by the hardware. Historically the CM1 from Thinking Machine Corporation contained up to 65,536 very simple one-bit processors. The philosophy was here that these processors were cheap to produce and eventually could add up to large MFLOPS ratings.

On the other hand the Beowulf clusters, made out of off-the-shelf computers, have more powerful nodes. The design philosophy is here that fewer nodes are easier to interconnect and that memory organization is easier.

It turns out that from the computational scientist point of view, the coarse grain model is often easier to adapt to particular problems. This is born out by the market where most massively parallel machines have now become obsolete, often driving their parent companies into bankruptcy.

Traditional Computer Classification

Flynn's Taxonomy

SISD

(single instruction stream, single data stream)

Conventional computer architecture with 1 processor; instructions are done sequentially. Your PC or Macintosh is of this type.

SIMD

(single instruction stream, multiple data stream)

single instruction processing unit, called a controller, and several data processing unit. The Thinking Machines CM-1 and MassPar MP-1 were examples of such architecture. The way these systems worked was that the instruction processing unit fetched an instruction, then broadcasted it to all PE which then perform exactly the same operation. The only freedom was that a PE could be de-activated in a given cycle. It was hoped that this architecture would reduce the time devoted to handle program logic. However, the difficulty was that typical problem could not be adapted easily to this approach. However, these ideas are coming back in full force in the context of Graphical Processing Units (GPUs). These follow from the realization that graphical boards contain many processors that can be tapped to compute. Unfortunately these come in an architecture suitable for graphics, and not for general computing.

MISD

(multiple instruction streams, single data stream)

This category in which various computing elements work on a single data has never lead to a successful computer.

MIMD

(multiple instruction streams, multiple data stream)

This is the widest and most successful category of parallel computers. The Cray C90, Cray T3D, Cray T3E, Intel Paragon, SP2, ... all fell under this category. Much smaller systems, like the Sun SparcServers, also implemented this architecture. Nowadays, DEll, IBM, and HP servers are of this type. In fact, a group of independent workstations on a network can be made to work as a MIMD computer. This is the type of computers we will concern ourselves with in the following.

Fast Processor Architectures

Processors in coarse grain parallel computers must be fast in order to achieve global high performance. There are recent major approaches to achieve these fast CPU speeds (besides the use of RISC): pipelined, vector, super-scalar and hyper-threaded approaches.

Pipelined Processors

The analogy is the assembly line of a factory; the results of operations are immediately passed to the next operations. The goal is to increase the number of instructions performed per second.

Consider adding 2 float numbers (of type m x 2^e). The operation can be divided in stages:

If e1 < e2 swap operands. Find difference ed = e1-e2.
Shift m2 to the right by ed bits.
Compute mantissa of the sum, m1+m2. Exponent of sum is e1.
Normalize the sum.

A pipelined processor is capable of performing the above stages in parallel in the case of an addition of vectors. This achieves very high efficiency in manipulating vectors.

The peak CPU speed of different chips can be found by standard benchmarks and change with each new release. The user must however write program which avoid loop interdependency in order to achieve these speeds and the results are highly dependent on the compiler used.

Vector Processors

Vector processors operate on entire vectors with one instruction. Consider the following adding loop:

for ( i=0; i < N; i++ )
c(i) = a(i) + b(i);

Fast computing can be achieved if the processor is capable of handling NV additions (or other arithmetical operations) at once, hence the term vector processor. If NV >= N, then the sum above can be calculated in "one" clock cycles instead of N cycles (i.e., fetch a[], fetch b[], do the sum and store the results in c[]). If NV is smaller than N, only few (much smaller than N) clock cycles might be needed. The advantages are that fewer instructions are performed, producing lower overhead, and that the various elements of the arrays are worked on in parallel (simultaneously).

The Cray computers did harbor very fast theoretical speed by using their own vector processors. Unfortunately the vector processors have been on their way out. The cost of production became prohibitive by today's standards. They were certainly the easiest to program since very efficient compilers were written to take advantage of the architecture.

Super-Scalar Processors

This approach uses instructions level parallelism to achieve speed. These processors have

an instruction fetch unit capable of obtaining multiple instructions at once
instruction decoding units capable of detecting independence of instructions
multiple execution units to execute instructions in parallel when possible
the execution units might also be pipelined for increased efficiency.

Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design. Dec (alpha) , HP (HP-PA), IBM (RS/6000), SGI (PowerChallenger) and Sun (SuperSparc) all offered superscalar processors. Except for CPUs used in some battery-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar.

Wikipedia describes this architecture in a very informative page.

Multitasking, Simultaneous Multithreading (SMT), Hyper-Threading Technology (HTT).

Wikipedia describes this architecture in a very informative page. This page describes HTT as Intel's trademark for their implementation of the simultaneous multithreading technology on the Pentium 4 micro-architecture. It is a more advanced form of Super-threading that debuted on the Intel Xeon processors and was later added to Pentium 4 processors. The technology improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle, for example during a cache miss. A Pentium 4 with Hyper-Threading enabled is treated by the operating system as two processors instead of one.

Hyper-Threading works by duplicating certain sections of the processor, those that store the architectural state, but not duplicating the main execution resources. This allows a Hyper-Threading equipped processor to pretend to be two "logical" processors to the host operating system, allowing the operating system to schedule two threads or processes simultaneously. Where execution resources in a non-Hyper-Threading capable processor are not used by the current task, and especially when the processor is stalled, a Hyper-Threading equipped processor may use those execution resources to execute another scheduled task.

Except for its performance implications, this innovation is transparent to operating systems and programs. All that is required to take advantage of Hyper-Threading is symmetric multiprocessing (SMP) support in the operating system, as the logical processors appear as standard separate processors.

MIMD - Memory Organization

There are two major ways of organizing the memory in MIMD multiprocessors architectures: the shared memory and distributed memory approaches.

Shared memory

In this architecture the various processors all have access to the entire main memory. This is the simplest systems to use since any piece of data is always accessible to all PEs. Pentium servers now exits in quad format. In its more elementary mode of operations, the operating system distributes the various processes among the processors in order to keep them equally busy. On the other hand, a single task may be distributed on the processors to achieve true parallelism.

The major drawback of this configuration is the memory contention problems. This occurs when two or more processors require the same data, or worse, when two processors need to modify the same data. This problem prohibits the implementation of such systems with large number of processors.

Distributed memory

Think of a group of independent computers, each with its own memory, all coupled together with very fast communication channels and you have a fairly good representation of a distributed memory MIMD computer.

This configuration eliminates all memory contention problems since each processor has its own local memory. The disadvantage is that each processor might have to fetch the data it needs from the other processors via the communication channels (slow compared to the bus speed). This has a great potential for critical bottlenecks.

The topology of the communication layout between the nodes if of extreme importance in establishing fast connections between the nodes. Beowulf systems tend to have the nodes share a common hidden Ethernet.

PHYS405

Advanced Computational Physics and Parallel Computing