Multiprocessing: Functions and Features

21.05.2001 von Christian Vilsbeck
A PC with two or more processors does not offer advantages by itself. To use the unified processing power operating system and applications need to play along. tecChannel.de sheds light on the special features of multiprocessing.

When it comes to high-end models of CPUs, processor manufacturers charge heavily for each megahertz delivered. For example, the move from a Pentium 4 with 1.4 GHz to 1.7 GHz to gain 20 per cent more clock speed rings in an additional charge of 80 per cent (as of May 2001). A Xeon equipped with 2 MByte cache and 30 per cent more clock speed is priced almost 90 per cent higher. Apparently, this unfavorable relationship between price and performance can be avoided by using a multiple processor system. A suggested performance increase of 100 per cent will cost about the double amount. Necessary chipsets and mainboards for multiprocessing are more complex and are produced in small number. This results in significantly higher system prices.

If a multiprocessor system is considered, it is important to know that many applications are not accelerated in such an environment and in some cases even can slow down. This article explains dependencies of performance gains and clarifies who will profit from the use of a multiprocessor system.

SMP, Multitasking and Multithreading

SMP computers integrate multiple processors of equal value. Each CPU can access the full range of system resources such as RAM, graphic card, controller and other peripheral devices. Additionally, SMP systems use mechanisms to synchronize exclusive resources. This includes the content of the cache integrated in each single processor, for example.

The operating system has to recognize SMP support of the system and be able to make use of it. For example, this requires the ability of multitasking or multithreading. Multitasking describes the ability of the operating system to run different and independent applications at the same time. Multithreading allows for parallel processing of an application by more than one CPU. For example, one thread can process print data, while another allows the user to continue his work in the foreground.

All tasks or threads share the resources of the computer. The operating system assigns fixed processing times of the processor which are completed in their order. This has the effect that also single processor systems appear to run different applications parallel. In the case of multiprocessor systems the operating system is able to distribute tasks or threads to every installed CPU. If software is programmed multithreaded, parts of applications in fact run parallel which results in increased execution speed. However, time-focused control of processing as well as the synchronization of threads make multithreaded programming a very complex task. Also, not every application or part of a task can be split into multiple units which can be processed simultaneously. Additionally, debugging is difficult, since processing times of single threads can be different depending on each system. For these reasons, most software is programmed single threaded.

Performance Traps

The use of two processors for multithreaded applications does not mean that the application will run at double speed. In his 1967 rule, Gene Amdahl describes the way performance gains of multiple processors are limited by threads which cannot be parallelized. According to Amdahl, a multithreaded application never shows linear scaling in regards to the number of processors involved. Further loss comes from limited system resources such as the available bandwidth of shared main memory in SMP systems.

Programs which show a need for a much more effort to synchronize threads and data than the actual advantage of gained time through parallel processing represents, are not suited for multithreading. Such software would run slower as a single threaded version, even if more CPUs are used. In contrast, multithreading is especially simple and efficient, if larger amounts of data can be split in independent segments. An example for this is picture editing with complex filters: An image is separated in several regions, each processor works on a different region. Similar scenarios can be seen at numerous scientific applications and engineering software (structural, field calculations, stream mechanics etc.), which work by the method of finite elements. In these cases certain areas of a net to be calculated are assigned to a processor. After completion of single calculations the partial results are combined using the rules of the framework.

Theoretically, also single threaded applications should run faster in a multithreaded system. The operating system assigns the application to a CPU, while the second CPU processes the overhead of the operating system. This assumption is confirmed, if the CPU usage of both processors is viewed in the system monitor.

However, if the performance of a single threaded application is measured, it is often lower in a multiprocessor system than in a PC with just one processor. The same applies to benchmark results, which are delivered for example by the BAPCO suite, which contains numerous single threaded applications. The explanation: Both processors maintain steady communication between each other and continuously synchronize their cache contents. The overhead for the synchronization slows down the application involved.

Cache Consistency through MESI

All processors of a SMP system can access the shared main memory. Exclusive resources such as local processor cache can lead to problems. If data are located in a certain memory area of processor 0, the CPU will use its fast cache when the data are accessed again. This however can lead to inconsistencies, if processor 1 has overwritten this memory area in the meantime with new data.

For this reason SMP systems feature a cache coherence protocol. It ensures permanently consistent data in the main memory and the cache of the processor. A widely distributed protocol for multiprocessors is MESI (modified, exclusive, shared and invalid). Each cache line can represent one of the four different conditions:

More Speed: MOESI

An extended variant of the cache coherence protocol MESI is the less popular MOESI protocol. Next to the four conditions of the cache line modified, exclusive, shared and invalid MOESI adds owner status. If the cache line of a processor holds owner status, other CPUs can access the data of the owner CPU and do not have to access it in the main memory. This type of cache handling results in less memory accesses and in increased system performance. MOESI-based multiprocessor chipsets include AMD's 760MP.

The function of cache sharing is based on a simple principle: Processor 0 is owner of a cache line. In the event of a read access through processor 1 onto this area, this request is forwarded for example by the 760MP chipset to the fast cache of processor 0. An access to the slower main memory does not occur. This does not only decrease the response time for the read access, it also takes load off the memory bus which now is available for other accesses.

Control of Interrupts via APIC

The control of interrupts in a SMP system is done by so called APICs (advanced programmable interrupt controller). APICs are a main part of a multiprocessor platform and ensure dynamic distribution of interrupts to single processors. In this way a balanced distribution of interrupts is achieved. A SMP system features two types of APICs. First, there are local APICs, which are integrated in a CPU. Second, there are so called I/O-APICs which take care of external interrupts of peripheral building blocks. Usually they are located in a SMP-capable chipset or in an additional block. Besides other functions, local APICs forward inter-processor interrupts, for example when a processor sends an interrupt to another processor. For example, this is an advantage for control signals.

APICs communicate to each other through the so called ICC bus (interrupt controller communication) and share in this way upcoming controller functions. Since all interrupt messages pass the ICC bus, the memory bus does not receive additional load. This connection also allows a processor to share processing of interrupts with another CPU, which leads to a balanced load.

SMP Operating Systems

In order to take advantage of multiprocessor hardware, the operating system needs to support SMP accordingly. The same applies to applications. Software, which is programmed single threaded, does not run faster in a multiprocessor environment. Systems, which run multiple CPU intensive programs simultaneously, are an exception. If a computer runs a web server, a database, remote access and processes at the same time a number of print and file jobs, also single threaded applications will profit from the use of multiple CPUs. The reason: A MP-capable operating system distributes different applications to different CPUs. If CPU statistics reveal that multiple programs cause a significant load on the processor, a MP-system makes sense in any case. It is important to know, that for almost every operating system there are different versions of SMPs and Kernels which relate to a certain number of processors. For example, Windows 2000 supports only up to 2 CPUs. The server versions are able to support up to 32 processors (Windows 2000 Server Datacenter), depending on the hardware. QNX and BeOS handle up to eight processors. The following table lists some x86-comaptible operating systems

MP operating systems for x86-based CPUs

Operating system

Manufacturer

BeOS

Be

Darwin for Intel

Apple

Linux *

Linux.org

OS/2 Warp

IBM

QNX

QNX

Windows 2003 / XP / 2000 / NT

Microsoft

featuring SMP support.

* Linux distributions offer complete SMP support starting with kernel version 2.4. There is limited SMP support in versions 2.0.x and higher.

SMP Applications

SMP applications are common in server and back office environments. All professional web servers, SQL servers or groupware servers are multithreaded and therefore profit from SMP systems.

There is a different situation with workstation and desktop applications. Programming of SMP-capable software is complex, error-prone and for these reasons also expensive. Software which supports multiprocessing can often be found in professional environments. Popular programs can be found in the areas of digital content creation, CAD, and simulations, as well as scientific applications which use the method of finite elements.

The following table lists some professional MP-capable desktop applications:

Desktop applications for MP systems

Software

Category

Manufacturer

MoviePack

Video editing

AIST

Maya

Video editing, game development, 3D animation

Alias/Wavefront

Photoshop 6.0

Image editing

Adobe

Premiere 6.0

Video editing

Adobe

3D Studio MAX 4

3D modelling, rendering, animation

Discreet

Parallel Performance for ANSYS

Add-on for several ANSYS products

ANSYS

Windows Media Encoder 7

Sound and video conversion

Microsoft

LightWorks 5.6

3D rendering

Lightwork

Parasolid 12.1

3D modeling

UGS

Mental Ray 2.1

Raytracing and rendering

Mental Images

VMware 2.0.3

Tool for operating systems which creates virtual PCs on a host system.

VMware

There are also games which support multiprocessing. The amount of titles, however, is very limited:

Games for MP systems

Software

Manufacturer

Quake III Arena

id Software

Falcon 4.0

Microprose

Starsiege

Dynamix

Additionally, all 3D games based on the Quak III engine support SMP.

Windows and SMP

Windows 95/98 and Windows ME do not support multiprocessor solutions. Windows NT 4 and Windows 2000 use for single processor and multiprocessor environments different operating system kernels. In the case of Windows NT only the files hal.dll and ntoskrnl.dll in the system32 folder are affected. Windows NT copies during the installation process the correct versions, depending on the found system.

There are numerous versions of hal and ntoskrnl versions in Windows 2000. For example, for the single processor and the multiprocessor version there is an ACPI-capable and a standard version available. Additionally, during the installation further system files are exchanged, depending on the number of CPUs.

It is possible to install a multiprocessor kernel on single processor systems. In this case the system performance will decrease by a few per cent. The reason is the complex kernel which also is kept busy with significant overhead on a single processor system.

Conclusion

Two CPUs are a sensible way to increase performance of fitting applications. In the server space there are already numerous MP-capable applications available. However, the professional workstation environment shows significantly less applications. In this case, serious consideration, if gained computing power and time savings justify the higher price of the MP-system, is necessary.

There are hardly any desktop applications and games available. The reason is the complex development and the small installed base in single segments.

Single threaded applications and benchmarks run most of the time slower on MP-systems than on single processor systems. One reason is the overhead needed for the synchronization of the processors. Only if multiple programs put load on the CPU, a performance gain for the whole system will be seen.

(cvi)