2.2 Microprocessor

2.2 Microprocessor

A microprocessor (also referred to as the CPU or processor) is at the heart of any computer. It is the single component that implements instruction execution. Processors vary in a number of ways; we focus on the more important characteristics. The lowest-level binary encoding of the instructions and the actions they perform are dictated by the microprocessor instruction set architecture (ISA). The most common ISA used for cluster node CPU is IA32, or X86. This family of processors includes all generations of the Pentium processor and the Athlon family. A shared ISA doesn't imply an identical instruction set; newer processors have extra features that old processors do not. For example, SSE and SSE2 are numerical instruction sets that were added in Pentium III and Pentium 4 processors, respectively. The earliest clusters were composed of 486 processors, which implement this ISA.

A processor runs at a particular clock rate. That is, it can execute instructions at a particular frequency, measured in terms of megahertz or gigahertz. For example, a 2.4 GHz processor can execute a rate of 2.4 billion instructions per second. Note that a processor's clock rate is not a direct measure of performance. Frequently, processors with different clock rates can perform equivalently for some tasks; likewise, two processors with the same clock rate can perform quite differently for some tasks. Current clock rates range from 1 GHz to slightly over 3 GHz.

Any processor has a theoretical peak speed. Theoretical peak is the maximum rate of instruction execution a processor can achieve. This is determined by the clock rate, ISA, and components included in the processor itself. This rate is measured in floating-point operations per second, or flops. A current generation processor will have a theoretical peak of 3–5 gigaflops. As one might guess from the name, theoretical peak is just that, theoretical. A processor rarely, if ever, runs at that rate while executing a real user application.

Both the instructions and the data upon which they act are stored in and loaded from the node's random access memory (RAM). The speed of a processor is often measured in megahertz, indicating that its clock ticks so many million times per second. RAM runs at a much slower clock rate, usually measured in hundreds of megahertz. Thus, the processor often waits for memory, and the overall rate at which programs run is usually governed as much by the memory system as by the processor's clock speed.

The slow rate at which data can be copied from RAM is mitigated by a processor's cache. The cache is a small amount of fast memory usually co-located on the CPU. When data is copied from main memory, it is also stored in cache. If the same data is accessed again, it can be read from cache. This is highly advantageous: applications can be optimized to access memory in patterns that take the best possible advantage of cache speed. The quicker access to memory in cache leads to better processor utilization; the processor spends less time waiting for data from memory. Processor caches vary in size from kilobytes on some processors to upwards of four to eight megabytes on processors specified to provide good floating point performance. Obviously, the larger the cache is, the easier it is to reuse entries stored in it.

2.2.1 IA32

IA32 is the most common ISA used in clusters today, and for the foreseeable future. This is caused by the enormous economies of scale at work. Processors implementing this ISA are used in the majority of desktop PCs sold. IA32 is a 32-bit instruction set. It is treated as a binary compatibility specification. Multiple processors, implemented in vastly different ways, all implement the same instruction set to allow for application portability. The three most common processors used in clusters today are the Pentium III and 4 processors, manufactured by Intel, and AMD's Athlon processor. Recent additions to the IA32 ISA include SSE and its successor SSE2. (Streaming SIMD Extensions) SSE and SSE2 are instruction set extensions that define instructions that can be performed in parallel on multiple data elements; these are not necessarily implemented in all instances of IA32 processors. These features can yield substantially improved performance, so care should be taken when choosing the processor for a new system. Hyperthreading is another feature recently added to the IA32 ISA. It allows multiple threads of execution per physical CPU. This feature typically impacts application performance negatively and can be disabled, so it really isn't a decision point when choosing a CPU, as SSE and SSE2 are.

Pentium 4. The Pentium 4 implements the IA32 instruction set but uses an internal architecture that diverges substantially from the old P6 architecture. The internal architecture is geared for high clock speeds; it produces less computing power per clock cycle but is capable of extremely high frequencies. This architecture is also the only IA32 processor family that implements the SSE2 instruction set, providing a substantial performance benefit for some applications. This is also the only architecture that implements hyperthreading, but (as was mentioned previously) this feature is not terribly important for computational applications typically run on clusters.

Pentium III. The Pentium III is based on the older Pentium Pro architecture. It is a minor upgrade from the Pentium II; it includes SSE for three-dimensional instructions and has moved the L2 cache onto the chip, making it synchronized with the processor's clock. The Pentium III can be used within an SMP node with two processors; a more expensive variant, the Pentium III Xeon, can be used in four-processor SMP nodes.

Athlon. The AMD Athlon platform is similar to the Pentium III in its processor architecture but similar to the Compaq Alpha in its bus architecture. It has two large 64 KByte L1 caches and a 256 KByte L2 cache that run at the processor's clock speed. The performance is a little better than that of the Pentium III and Pentium 4 in general at similar clock rates, but either can be faster depending on the application. The Athlon supports dual-processor SMP nodes. Newer Athlon processors support SSE, but not SSE2.

2.2.2 Other Processor Types

HP Alpha 21264. The Compaq (now HP and originally DEC) Alpha processor is a true 64-bit architecture. For many years, the Alpha held the lead in many benchmarks, including the SPEC benchmarks, and was used in many of the fastest supercomputers, including the Cray T3D and T3E, as well as the Compaq SC family. Alpha are still popular with some users, but since the Alpha processor line is no longer being developed and the current Alpha processor will be the last, Alphas are rarely chosen for new systems. However, a few large clusters make use of Alphas, including the ASCI Q system at Los Alamos National Laboratory; ASCI Q is one of the fastest systems in the world, according to the Top500 list.

The Alpha uses a Reduced Instruction Set Computer (RISC) architecture, distinguishing it from Intel's Pentium processors. RISC designs, which have dominated the workstation market of the past decade, eschew complex instructions and addressing modes, resulting in simpler processors running at higher clock rates, but executing somewhat more instructions to complete the same task.

PowerPC G5. The IBM PowerPC is an processor architecture used in products from IBM and from Apple. The newest processor is the G5, a sophisticated 64-bit processor capable of running at speeds of up to 2GHz. Other features include a 1GHz frontside bus and multiple functional units, allowing the G5 to perform multiple operations in each clock cycle. Apple sells Macs with the G5 processor, and a number of groups have built clusters using Macs, running Mac OS X (a Unix-like operating system).

IA64. The IA64 is Intel's first 64-bit architecture. This is an all-new design, with a new instruction set, new cache design, and new floating-point processor design. With clock rates approaching 1 GHz and multiway floating-point instruction issue, Itanium should be the first implementation to provide between 1 and 2 Gflops peak performance. The first systems with the Itanium processor were released in the middle of 2001 and have delivered impressive results. For example, the HP Server rx4610, using a single 800 MHz Itanium, delivered a SPECfp2000 of 701, comparable to recent Alpha-based systems. More recent results with a 1.5 GHz Itanium 2 in an HP rx2600 server gave a SPECfp2000 of 2119. The IA64 architecture does, however, require significant help from the compiler to exploit what Intel calls EPIC (explicitly parallel instruction computing).

Opteron. Another 64-bit architecture is AMD's Opteron. Unlike the Intel IA64 architecture, the Opteron supports both the IA32 instruction set as well as a new 64-bit extension, allowing users to continue to use their existing 32-bit applications while taking advantage of a 64-bit instruction set for applications that require easy access to more than 4 GB of memory. The Opteron includes an integrated DDR memory controller and a high-performance interconnect called "HyperTransport" that provides up to 6.4 GB/sec bandwidth per link; each Opteron may have three HyperTransport links. Early Opterons have delivered a SPECfp2000 of 1154. The AMD Opteron is used in the Cray "Red Storm," that will use over 10,000 processors and have a peak performance of over 40 Teraflops.

Part III: Managing Clusters