Knockout: Intel’s Woodcrest 3.0 GHz outclasses AMD’s Opteron

23.05.2006 von Christian Vilsbeck
Intel’s Woodcrest processor sets the pace with a new architecture comprising a 3.0 GHz dual-core, 4 MB L2 cache, and a 1333 MHz front side bus – and tecCHANNEL tested this core CPU before its launch in June!

Die deutschsprachige Version dieses Artikels finden Sie hier.

Intel can finally rear up its head again. For a long time, Intel had nothing serious to pitch against AMD’s Opteron processor. In particular, Intel’s 2.8 GHz dual-core Xeon Paxville DP for two systems was hopelessly inferior to Opteron’s dual cores. But Intel is ringing in an era with the Xeon Woodcrest boasting a new dual-core architecture for top-notch performance with significantly lower power consumption.

Part of Intel’s Xeon family, the Woodcrest processor for two-way systems can work with clock speeds of up to 3.0 GHz. Both cores share the same 4 MB L2 cache. Intel has raised the speed of the front side bus to 1333 MHz. Each Woodcrest processor has its own FSB to the chipset designed for the new platform code named Bensley for servers and Glidewell for workstations.

The official unveiling of the Xeon Woodcrest processor is scheduled for June 2006 . This core processor for the LGA771 socket is pin compatible with the Xeon 5000 series Dempsey, which is soon going to become superfluous in the face of the impressive performance of the Woodcrest chip. tecCHANNEL was in Intel’s D1D fab in Hillsboro, Oregon, where we used our own software to conduct an advance test of the new generation Xeon Woodcrest processor clocking at 3.0 GHz. The Woodcrest is chip was pitched against former Xeon models, other new Dempsey-based Xeon 5000 series, plus AMD’s dual-core Opteron 280 and the single-core variant 254.

The Xeon Woodcrest

Intel will brand the Woodcrest processor for dual-socket systems with 5100 series numbers. We tested the fastest one – Xeon 5160 at 3.0 GHz. According to the roadmap known to tecCHANNEL, the June 2006 launch will include a lineup of the following processors: 5110 (1.66 GHz), 5120 (1.83 GHz), 5130 (2.00 GHz), 5140 (2.33 GHz), and 5150 (2.67 GHz). All these Xeons share a 4 MB L2 cache, but their front side bus differs – the Xeon 5110 and 5120 use a 1066 MHz FSB, while the faster Woodcrest chips have a 1333 MHz FSB.

The 5100 series Xeon chips for the Bensley & Glidewell platform use an LGA771 socket. The new server and workstation motherboards can also be fitted with the Xeon 5000 series Dempsey models. The Xeon 5070 at 3.46 GHz featuring NetBurst architecture has a TDP rating of 130W, compared to a TDP of just 80W for the 3.0 GHz 5160 Woodcrest. Intel rates the slower Woodcrest processors at 40 and 65W TDP.

Intel’s Woodcrest supports EM64T for 64-bit computing, a must for a new architecture. The Vanderpool technology’s VT-x instruction set achieves virtualization of these CPUs. A standard feature of the Xeon 5100s is the XD technology for enhanced protection against viruses and buffer overflows, supplemented by SpeedStep for dynamic lowering of the frequency and core voltage. However, the core architecture of Xeon chips lacks hyper-threading.

As opposed to the Dempsey Xeons, both Woodcrest processor cores are on a single die, whereas each Dempsey core has its own 2 MB L2 cache. Both the Woodcrest and Dempsey processors are built on Intel’s new 65-nanometer processing technology.

According to our roadmap, Woodcrest processors will be priced in 1,000-unit quantities from US$209 for the Xeon 5110 to US$851 for the top of the line model – Xeon 5160.

Details of the core architecture for Woodcrest processors are described in tecCHANNEL’s article Pole Position Change: Intel’s New Core Processors. In-depth information on the new Woodcrest platform is in the article, Brand New: Intel’s Xeon Platform - Bensley & Glidewell

CPU2000: SPECint_base2000

We used realistic SPEC benchmarks for the Windows Server 2003 and compiled these for the base rating, by applying Intel’s C++9.0 and MS Visual Studio .NET for all integer tests. No special libraries were used to optimize the respective processor.

The SPECint_base2000 benchmark runs single-threaded, and thus it does not take advantage of hyper-threading or dual cores. The results serve as indicators of the processor’s integer performance.

Analysis of Integer Performance

In all integer tests, the Xeon 5160 core processor, Woodcrest, was way ahead of the NetBurst based Xeon 5070 Dempsey processor. In the single-thread tests from SPECint_base2000, Woodcrest benefited from its Advanced Smart Cache. Hence, when only one core is working, it uses all of the shared 4 MB of L2 cache.

The impact of the large L2 cache was quite evident for 300.twolf, a compute-intensive place and route simulation program. The 3.0 GHz Woodcrest core processor was 89% faster than the 3.46 GHz Dempsey, since it stores all of the required data in the 4 MB L2 cache. In fact, this Xeon 5160 demonstrated 129% better performance with the 181.mcf planning software tools.

An L2 cache of 512 KB is quite adequate for the data compression utility, 164.gzip. There is no advantage in having an L2 cache with 2 or 4 MB or faster memory. Consequently, the pure integer performance of the Woodcrest core processor was a clear 59% superior versus the NetBurst Dempsey. The relative performance was similar for ray tracing with 252.eon running primarily in L1 cache.

CPU2000: SPECint_rate_base2000

We used realistic SPEC CPU2000 benchmarks for the Windows Server 2003 and compiled these for the base rating, by applying Intel’s C++9.0 and MS Visual Studio .NET for all integer tests. No special libraries were used to optimize the respective processor. The results are a good measure of the performance of processors with standard software running in parallel.

The benchmark suite of SPECint_rate_base2000 determines the maximum throughput ratio of integer computations for multi-tasking applications, with multiple copies of the benchmark running in parallel. Typically, the number of tasks/copies reflects the number of virtual processors in the system.

Optimized Data from Manufacturers: SPECint_rate_base2000

The SPEC Web site lists the highly optimized results of CPU2000 benchmarks released by manufacturers of processors and suppliers of servers, workstations, and PCs. Some of these results include several compilers and libraries written specially for the CPUs.

These SPECint_rate_base2000 metrics of manufacturers show the highest integer performance of processors in a multi-tasking environment.

CPU2000: SPECfp_base2000

We used realistic SPEC benchmarks for the Windows Server 2003 and compiled these for the base rating, by applying Intel’s C++9.0, MS Visual Studio .NET, and Intel’s Fortran 9.0 for all floating point tests. No special libraries were used to optimize the respective processor.

The SPECfp_base2000 benchmark runs single-threaded, and thus it does not take advantage of hyper-threading or dual cores. The results serve as indicators of the processor’s floating point performance.

Like the Intel CPUs, AMD’s Opteron processors support SSE3. However, Intel compilers with the SSE3 optimization switch -QxP/fast did not work with AMD CPUs. We tested AMD64 processors with the compiler switch -QxW and SSE2 support.

Although you can use a patch to get around a processor query with Intel compilers, SPEC’s strict rules allow publication of results only with officially available hardware and software. Intel accordingly does not provide support for getting around the CPU query in its compilers.

Floating Point Analysis

Floating point applications for the SPEC CPU2000 benchmark suite are much more compute-intensive than integer tests. The larger the cache size, the better its ability to buffer slow memory accesses. However, even large caches are of little benefit for some compute-intensive programs – what really counts is the memory bandwidth. This becomes quite apparent for computations of finite water elements for compute-intensive shallow water modeling with 171.swim. In this case, the Woodcrest and its 4 MB L2 cache is barely faster than the Dempsey 3.46 GHz model featuring a 2 MB L2 cache (the second core’s cache remains unused).

Nevertheless, the 4-channel FB-DIMM of the Bensley platform boosts both CPUs. For the 171.swim application, the CPU reads a multitude of data blocks in burst mode from a large 1335 x 1335 data array.

For image recognition with 179.art, for instance, the Woodcrest processor turned out to be 134% faster than the 3.46 GHz Dempsey. This is because the majority of the workload fit into the 4 MB shared L2 cache of the former, whereas the Dempsey’s 2 MB buffer is inadequate. The performance of Opteron processors with L2 caches of 1 MB each per core was even poorer.

The Woodcrest processor’s high SSE performance is evident – independent of the workload size – for all floating point applications in the CPU2000 benchmark suite.

CPU2000: SPECfp_rate_base2000

We used realistic SPEC-CPU2000-Benchmarks for the Windows Server 2003 and compiled these for the base rating, by applying Intel’s C++9.0, MS Visual Studio .NET, and Intel’s Fortran 9.0 for all floating-point tests.

The benchmark suite of SPECfp_rate_base2000 determines the maximum throughput ratio of floating point computations for multi-tasking applications, with multiple copies of the benchmark running in parallel. Typically, the number of tasks/copies reflects the number of virtual processors in the system.

Optimized Data from Manufacturers: SPECfp_rate_base2000

SPEC.org lists the highly optimized results of CPU2000 benchmarks released by manufacturers of processors and suppliers of servers, workstations, and PCs. Some of these results include several compilers and libraries written specially for the CPUs.

These SPECfp_rate_base2000 metrics of manufacturers show the highest floating point performance of processors in a multi-tasking environment.

Floating Point: Linpack Linux 64-bit

Linpack is a widely used tool to determine the floating point performance of high-end computers. It solves complex linear equations and states the results in flops – floating point operations per second.

We deployed the 64-bit version of Linpack 2.1.2 for the SUSE Linux 64-bit edition. The SMP capable benchmark requires SSE3 support for EMT64 processors. AMD’s Opteron processors with SSE3 also work smoothly with Linpack versions from Intel compilers.

For the Linpack 2.1.2, all processors use their extended SSE3 commands. The core CPUs reached a peak rate of 31.42 GFLOPS in our test, with a Linpack version 3.0 specially optimized for Woodcrest. The 3.0 version takes advantage of 16 new SSE4 instructions in the core architecture. SSE4 is an alias, while Intel has not yet come up with another name for the new multimedia instructions.

Analysis: SunGard Adaptiv Credit Risk

SunGard’s Adaptiv Credit Risk 2.5 is a financial analysis tool. It applies modified Monte Carlo simulations to predict the future value of an investment based on current market data.

SunGard’s Adaptiv Credit Risk was programmed in C# for Microsoft’s .NET environment. This multi-thread analysis tool does not use math libraries like Intel’s MKL or AMD’s Core Math Library ACML, but is ideal for multi-processor systems. SunGard operates primarily with integer operations.

Rendering: CINEBENCH 9.5

CINEBENCH 9.5 is based on Maxon’s Cinema 4D release 9.5, and is designed to execute shading and ray tracing tests. The ray tracing test checks the rendering performance of processors. The graphic card’s performance plays a minor role for the test focused on the FPU.

For the ray tracing test with CINEBENCH 9.5, more memory and greater FSB bandwidths are of little advantage, since the workload runs mostly in the first two cache levels.

Rendering: 3ds Max

The 3ds Max 7 software from Discreet/Autodesk is designed for professional 3D modeling, animation, and rendering. It makes full use of multi-processing for rendering tasks. The benchmark suite SPECapc from SPEC.org served as the basis for the rendering scenes. The performance of the graphics card was irrelevant for rendering, and we did not use the OpenGL based tests of the SPECapc suite.

Encoding: LAME 3.97a

Alongside the Fraunhofer variants, LAME has become the best-known MP3 codec. The LAME open source codec supports a variable and constant bit rate and converts MP3 files from wav files.

The Israel Institute of Technology Technicon wrote 32-bit and 64-bit versions of the MP3 encoder for LAME. We used a single thread for encoding to assess the performance of various CPU architectures. LAME 3.97a applies the extended SSE2 instructions for encoding.

Cache/Memory: 32-bit Transfers

We checked the cache and memory performance of processors with our tecMem program in the tecCHANNEL Benchmark Suite Pro for Windows Server 2003. tecMem measures the memory bandwidth used effectively between the CPU’s load/store unit and the various hierarchical memory levels – L1 and L2 cache, and RAM. The results allow differentiated analysis of load, store, and move operations.

Cache/Memory: 128-bit Transfers

The 128-bit SSE instructions allow the CPU to reach its maximum cache and memory performance.

Summary

In February 2006, Intel boasted that its upcoming processors with core architecture would be 20% faster than competitive products from its rival AMD. What seemed at the time to be a somewhat presumptuous claim has turned out to be true. Our benchmarking tests confirmed that the Xeon Woodcrest processor for servers and workstations outclasses by far the rest of the x86 architectures – including AMD’s Opteron CPUs or the new Xeon 5070 Dempsey with NetBurst architecture.

The reputed CPU2000 benchmark demonstrated that the 3.0 GHz Woodcrest is between 35% and 77% (base and rate) faster than the 3.46 GHz Xeon 5070. The Woodcrest’s superiority is impressive – not only for the CPU2000 benchmark suite. In other comparisons with cache-based rendering programs, audio decoding, or the extremely compute-intensive Linpack tests, the Woodcrest processor was 20% to 50% ahead of the next best competitor. In these tests, no difference whatsoever was discernible between single and multi-thread applications.

Intel is planning to unveil the Xeon 5100 Woodcrest processor for the Bensley/Glidewell platform in June 2006. Following this great downfield pass, AMD will have to go all out to catch up. It is doubtful that the Socket F Opteron with a DDR2 memory controller scheduled for Q3 of 2006 will surprise us with such a boost in performance. Nevertheless, one should not underestimate AMD – perhaps a new generation of impressive Opteron processors is just around the corner! (cvi)

Test Platforms

Details on the various test configurations are in the following tecCHANNEL article.