Microsoft SQL Server 2000 (64-bit): Intel Itanium Processor Touchstone

 

Intel Corporation

December 2003

Applies to:
    Microsoft® SQL Server™ 2000 Enterprise Edition (64-bit)

Summary: Learn about the capabilities and performance of SQL Server 2000 (64-bit) and the Itanium architecture. (12 printed pages)

Contents

Introduction
The 64-bit OS
WOW64 & IA32 EL
The 64-bit Development Environment
Itanium Processor-SQL Performance
Summary

Introduction

In April 2003, Microsoft® officially announced Windows Server™ 2003, a long awaited release that marked availability of true 64-bit capabilities in the high volume and mainstream Windows server operating system family. The OS announcement coincided with announcement and availability of the first 64-bit back office application for Windows: Microsoft SQL Server™ 2000 Enterprise Edition (64-bit). Their availability is a development as significant as the introduction of fully protected virtual addressing in Microsoft Windows NT®. Both are designed and optimized for the Itanium processor, the first high volume, mainstream 64-bit architecture. 64-bit architectures directly lead to higher performance, increased stability, and lower costs by severely reducing IO subsystem requirements with very large memory configuration support (up to half a terabyte in Windows Server 2003). With Windows Server 2003 and SQL Server 2000 it's now possible to fully service database queries and transactions in native 64-bit instructions. The capabilities and performance of 64-bit SQL is a touchstone, an evaluation standard, for the potential of Itanium® architecture; it clearly demonstrates the directions of Itanium architecture and how Itanium-based servers seamlessly interoperate with 32-bit application servers in the Microsoft Windows® environment. Database administrators and developers are perfectly poised to take advantage of Itanium.

The 64-bit Software Stack

From the beginning, the Intel® Itanium architecture served as the standard development platform for 64-bit Microsoft product ports. Intel continues to engage Microsoft, building upon five years of intense work on multiple Itanium processor fronts including the operating system, the development environment, and the application layer (including SQL). Intel works hard at technology and methodology transfers, strongly supports 64-bit software development and porting, and provides detailed analysis to guarantee Itanium processor specific optimizations and permanence. For instance, micro-architectural analysis of the code flow in critical areas in the OS and application layer is constantly evaluated for optimal scheduling, latency, and contention. New compiler technologies were also prototyped and pushed to maturity on Itanium processors, and new Microsoft Win32® API extensions now support the Itanium processor's extended virtual address space. Such capabilities are found throughout the Microsoft software stack.

The 64-bit OS

In the four years since the first public demonstration of 64-bit Windows in 1999, the OS has been fully ported and now runs native Itanium instructions for both the kernel and supporting processes. The LLP64 Windows pointer model (int and long remain 32 bits wide, "long long" and pointers are defined as 64 bits wide) provided for minimal code changes during the port. With the exception of architectural specific design in the kernel (some examples in table below), 32-bit and 64-bit Windows share common source code. For example, the Windows user interface remains the same. All the familiar Windows tools from Microsoft Notepad to Microsoft Internet Explorer can be produced in 32-bit or 64-bit images. The same holds true for application development: 32-bit code and 64-bit code can share common source.

Scaling is a major feature of Windows Server 2003. After almost 10 years of Windows NT availability, 32 processors is the maximally supported configuration for 32-bit platforms. For Itanium, however, the first Windows Server release marked the availability of several large scale Itanium based enterprise systems in Table 2 below, up to a maximum of 64 processors in 512 GB of memory. This alone is a strong indicator of both the potential of the architecture and the commitment from both Microsoft and Intel. On any OS, the elimination of scaling issues can take years of study and redesign. Among other things, the work involves detailed analysis of bus transactions using hardware traces and dealing with the multiple application classes, their behavior, and backward compatibility requirements. Itanium scaling is a stellar achievement in the very first Windows OS release.

Table 1. Itanium features and design specifics

Itanium Feature Design Specifics
Virtual addressing Windows for Itanium processors uses an 8KB page size versus a 4KB page size for Windows on x86. Page tables are laid out in virtual memory in what is called the virtual hashed page table (VHPT) in the Itanium architecture. As an aside, it's actually possible to incur a double page fault when a page isn't mapped in the DTLB or ITLB when the required lookup page from the page table isn't virtually mapped. WOW64 (see below) process virtual addressing is required to operate in the low 32 addresses bits. Transitions into kernel space must be thunked.
Register Set The Itanium processor has 128 integer registers, of which the top 96 are part of a rotating register stack. The OS is required to "spill" the registers out to memory when needed when the register stack overflows in deep call stacks.
Exceptions Itanium processor features like the register stack require redesign of the stack unwinding process used by exception handling. Additionally, the Itanium processor compiler has the capability of issuing speculative code, and under the rare speculation failure, the OS must trap a speculation exception and branch to specific recovery code.
Context Switch As eluded above, the register environment for the architecture is more robust than x86 based systems. The OS is required to save and restore a different, larger set of registers during thread context switches. In database environments, however, context switches are driven by IO processing. The Itanium architecture reduces IO requirements, and therefore context switches, through large memory configurations.
Debugging The OS must recognize and provide handlers for Itanium processor specific faults from the debug registers, including the break instruction.

Table 2. Platforms and processor support

Maximum Processor Support Platform
16 Unisys ES7000 Orion 130 Enterprise Server
32 NEC Express 5800/1320Xc
64 HP Integrity Superdome

WOW64 & IA32 EL

Legacy code support has been a hallmark of Windows. In fact, it is still possible to insert a bootable DOS floppy disk into the drive of an 8-processor IA32 (x86)-based server, power it up, and work with the familiar DOS command line interface. DOS will only run with a single processor, but incredibly, it just works.

On Itanium, backward compatibility only extends back to 32-bit applications, not the dark ages of early PC platform definition. For instance, the Itanium processor Extensible Firmware Interface (EFI) precludes DOS booting by essentially providing a miniature OS that inhabits platform NVRAM (not the disk boot sector) and replaces the NT loader. The WOW64 Windows component (Windows on Windows), however, transparently handles 32-bit legacy code support: IA32-based Win32 applications still run on Itanium-based systems (a caveat being that software vendors may still choose to restrict installation of 32-bit executables like support tools that haven't yet been ported). The portable executable format (PE32) explicitly identifies native architecture. At process initialization, the Windows loader intercepts all 32-bit executables and loads them into a truncated 32-bit wide address space as a WOW64 process where Itanium processor transparently executes the IA32 instructions.

A full, IA32 specific subsystem is established as part of the standard Windows installation process, and you can find 32-bit stock versions of all the standard libraries you normally find in the \windows\system32 directory under \windows\syswow64. User to kernel transitions in ntdll.dll, user.dll, and gdi.dll are properly thunked for the Itanium processor 64-bit kernel environment, but otherwise a 32-bit application will load and execute as before.

In the Windows Server 2003 SP1 timeframe, a new Windows component will be enabled: IA32 EL, the IA32 execution layer. IA32 EL offers run-time binary translation of IA32 instructions into IA64 instructions. Such processes are widely used in managed-code environments like Java and Microsoft's .NET.

There are some benefits to IA32 EL versus hardware support. First, Itanium processors offer a subset of the IA32 instruction set, and translation provides the capability to map IA32 instructions onto the rich set of parallel integer and floating point instructions offered in the IA64 instruction set. Compared to the hardware execution of limited, native IA32 instruction, a significant amount of instruction level parallelism can be uncovered. A second benefit concerns minimizing IA32 hardware support, reducing chip real estate, simplifying architectural implementations, and reducing production costs.

Reliability, Availability, and Serviceability (RAS) and Machine Check Architecture (MCA)

An often-overlooked design feature of Itanium processor platforms running Windows is the built-in MCA capability. MCA provides a methodology for correcting and reporting specific classes of errors to the operating system. Errors like memory subsystem ECC faults or ECC faults in the processor cache can be intercepted, corrected, and reported to the OS, where it can be recorded. This happens "behind the scenes" without user intervention or an unrecoverable system error. Recoding (logging) these records will allow Windows to predict future errors based on past events and allows system administrators to identify problematic components to be replaced during regularly scheduled downtimes. Looking forward, for unrecoverable errors, it becomes possible to limit the error propagation to the process level, by doing such things as selectively unmapping memory and terminating a single process. Further details on MCA support can be found with the following links:

The 64-bit Development Environment

Itanium Architectural Features

A full discussion of Itanium micro architecture features is out of the scope of this article, but a brief summary is in order. At this point, the architecture is most notable for what most sets it apart from the legacy of IA32, a 64-bit address space. In other respects too, it's a very different beast. The compiler plays a larger performance role. The new instruction set was designed for high exposure of instruction level parallelism (ILP), and the compiler exposes this parallelism. Itanium2®-based implementations have been designed with very large and fast caches, to the benefit of server applications. Brief details on Itanium processor specific features are outlined in the table below, full details can be found in the Intel Itanium Architecture Software Developer's Manual.

Table 3. Architectural features and comments

Architectural Feature Comments
Register Stack Engine 128 integer and 128 floating point registers. The compiler is afforded much more flexibility by having access to a large register set. The stack provides for fast procedure calls by passing arguments in registers as opposed to the stack.
Predication Instructions, including branches and memory operations, can be explicitly predicted on specific conditions. For instance, this allows if-then blocks to be laid out flat in IA64 assembly without branch instructions, reducing misprediction.
Large High Speed Cache At present, Itanium2 processors offer on-die L3 cache sizes as large as 6MB. The 6MB cache in turns offers very low latencies, as low as 14 cycles for integer operations.
ILP ILP allows independent instructions to be scheduled on a large number of available functional units in bundles 3 instructions wide. On Itanium2 processors, 2 instruction bundles can be executed per clock, for a maximum of six instruction issues per clock.
Data Speculation C/C++ optimization can be difficult for compilers since it is difficult to disambiguate pointers, and the compiler must explicitly order memory operations for the worst case. Advanced loads allow the compiler to hoist operations that can't be disambiguated further back in the instruction sequence. This allows memory latency to be hidden behind intervening instruction issue before the data is utilized.
Control speculation With the same goal of hiding latency, the compiler is also given the flexibility of scheduling a load instruction before a branch that precedes it.
Performance Monitoring Itanium architecture implementations have the richest set of performance counters on any Intel architecture. These counters allow developers to quickly determine the exact cause of performance bottlenecks using tools like Intel's VTune Performance Analyzer tool. In fact, the Microsoft SQL group identified and eliminated performance bottlenecks with Itanium processor counters that have long existed in SQL source code, unable to be exposed in any other way.

Toolset specifics

As explained above, Windows Server 2003 for Itanium processors provides native execution for SQL and a 32-bit execution environment, but there are certainly development needs for providing native 64-bit applications that share Itanium processor platforms with SQL. The freely available Windows Platform Development Kit (PSDK) provides the 64-bit Windows development environment to make this happen. Included in the PDSK are the complete 64-bit compiler toolset and a couple of 64-bit debuggers. Porting issues are succinctly summarized elsewhere in a great article by Stan Murawski, Beyond Windows XP: Get Ready Now for the Upcoming 64-bit Version of Windows. The article includes details about data types, alignment issues, and a method for getting IA32 builds to coexist with IA64 builds in Visual Studio projects. While the article and the PSDK documentation are excellent primers on the porting, the Itanium processor 64-bit compilation environment is an evolution of compiler technology and warrants additional discussion.

The IA64 instruction set was designed with a goal of putting much more capability in the hands of the compiler. The Itanium compiler now has the flexibility to issue control or data speculation for hiding memory latency and instruction predication to avoid branch misprediction, the two largest limiters to high performance program execution (additional details of the Itanium architecture are also summarized in the Itanium Architectural Feature section below.

  • PoGO
  • WPO/Global scheduling
  • Memory hierarchy optimizations (cache line width and cache size)
  • Branch optimizations
  • Prediction
  • Speculation
  • Loop unrolling
  • Object oriented optimizations
  • Rich intrinsic support

Itanium architecture was the development platform for a couple of key technologies, Whole Program Optimization (WPO) and Profile Guided Optimization (PoGO), and these features are fully supported in the PSDK available toolsets. Documentation can be found in the *readme.*doc file in the bin\Win64 subdirectory of the PSDK installation directory. It is highly recommended that this document be reviewed in order to understand the rich feature set of the IA64 compiler.

WPO allows global optimizations instead of local procedural scoped optimizations. This capability enables such optimizations as inlining and custom calling conventions that reduce procedure prologue and epilog overheads. PoGO optimizations should be used by every application built natively for Itanium processors because performance gains in the 20-40% range above basic /Ox optimizations are realistic expectations. Essentially, PoGO uses runtime profile information and heuristics to make intelligent choices about inlining and branch behavior in an executable. PoGO optimization requires an additional flag during object compilation, /GL. Two link phases are then required, one with /LTCG:pgi and one with /LTCG:pgo. A profile of the runtime codepath behavior in the executable is collected with typical input parameters after the first data phase and fed into the second link phase. Keep in mind, PoGO is not a requirement for execution and can easily be left to the late stages of the development process.

Large virtual pages

Another issue Itanium processor developers should be aware of: large 64-bit virtual address spaces don't come free. This isn't anything new, and it's a locality issue. It's also as much an issue for 32-bit systems as 64-bit systems, but 64-bit systems, generally speaking, work with much larger amounts of memory. For instance, in the familiar example that follows, there can be huge performance implications:

Sequence A Sequence B Sequence C
int sum = 0;

char array[][JEND];

for (i = 0; i < IEND; i++) {

for (j = 0; j < JEND; j++) {

sum += array[i][j];

}

}

int sum = 0;

char array[][JEND];

for (j = 0; j < JEND; j++) {

for (i = 0; i < IEND; i++) {

sum += array[i][j];

}

}

int sum = 0;

char array[][PAGESIZE];

for (i = 0; j < IEND; i++) {

for (j = 0; i < JEND; j++) {

sum += array[i][j];

}

}

Sequence B is equivalent to Sequence A, but contains limited data locality due to the way that C/C++ calculates adjacent array references. Depending on the execution environment and dimensional widths, Sequence B might execute several orders of magnitude slower than Sequence A, because all array accesses in Sequence A reference adjacent array members, guaranteeing cache locality.

Large pages address this problem at page-level granularity instead of the cache-level granularity above. In Sequence C, assume that the data layout is identical by index to Sequence A and Sequence B. In every outer loop iteration there's a good chance of incurring a data translation look aside buffer (DTLB) miss because a new virtual memory page to physical memory page must be looked up and entered into the DTLB. The DTLB caches recent virtual page mappings in order to avoid the secondary references to the page table that would otherwise be required for every virtual address load or store. A DTLB miss is costly; when a mapping request misses in the DTLB, the OS must trap the miss and insert a new mapping into the TLB, and the overhead is several hundred CPU cycles.

Data access patterns dictate how often a DTLB miss is incurred, but Windows Server 2003 supports allocation of large virtual pages larger than the standard 8KB size. Large page size (granularity) is specific to the Windows version; for Server 2003, it's 16MB. As an example of the benefit, consider Sequence C again. As long as IEND<2048 (on Server 2003) and the array is allocated within a large page, only one TLB miss will probably be incurred. Large pages are allocated through the standard Win32 VirtualAlloc() API by using the MEM_LARGE_PAGE specification for the allocation type. Details on this can be found in the current VirtualAlloc() API documentation. A quick note: SE_LOCK_MEMORY_PRIVILEGE is required for this to work properly.

Large page allocations offer performance advantages for many situations. For instance, very large hash tables can span several hundred pages, and any single bucket lookup is likely to suffer a DTLB miss. If the hash table is referenced often and the entire table is allocated in large virtual pages, there's a significant increase in the probability that the mappings will remain resident in the DTLB. Another example involves linked list scans. If the linked structures are allocated over a long period of time (without temporal locality) then the pathological condition may be common: every forward pointer dereference suffers a DTLB miss. The can be remedied by allocating memory with large virtual pages, and allocating the linked structures from this area. At this time, unfortunately, Windows doesn't support large pages in standard Win32 heaps.

The 64-bit application layer: SQL Server 2000 Enterprise Edition (64-bit)

Like Windows, the 32-bit and 64-bit versions of SQL Server share the same source code, eliminating maintenance nightmares. The differences between the two versions are limited to a few areas such as cacheline size specification and the virtual page size. There are also inline x86 assembly sections for low-level synchronization constructs where the Itanium architecture code utilizes a rich set of native IA64 intrinsics supported by the compiler.

Platform transparency

Architecture transparency is a premier feature of the online database format; the 32-bit and 64-bit formats are identical. The page size remains 8KB; the page header still comprises the first 96 bytes on a page. Embedded pointers are database format specific offsets, never virtual memory addresses. The database is allocated with identical extent sizes. It's possible to create a database on a SQL32 installation, back it up, and restore the database on a SQL64 installation. In Intel's Server Performance Lab, engineers work on comparative performance analysis between multiple 32-bit and 64-bit systems using a single large multi-terabyte disk subsystem and a single database. Backups on a 32-bit system can be restored on a 64-bit system. By design, the data is totally hardware agnostic.

Memory management

In 32-bit Windows, there are few options for dealing with the 4GB absolute limit imposed by a 32-bit virtual memory design. The /3GB boot option supported by Windows expands the user addressable virtual memory to 3 gigabytes from 2. Beyond that, developers must depend on physical address extension (PAE), supported on Pentium III and later processors, in combination with Window's Address Windowing Extension (AWE). The 32-bit SQL Buffer Manager allocates more than 3GB of physical memory using AWE to map a specific physical memory window (or range) into virtual memory. For instance, on a platform with a large 8GB memory configuration, a 5GB dataset can be fully mapped into physical memory, and the buffer manager will selectively map /3GB subsets into virtual memory to avoid disk IO. The remaining 2GB is unmapped in virtual memory and must be mapped in cooperation with an expensive Win32 MapUserPhysicalPage call that unmap other memory from the 3GB of virtual space. While AWE operations are orders of magnitude less expensive than IO operations (an obvious performance advantage for 32-bit platforms), there's still an order of magnitude difference between AWE and directly referencing a virtual address.

SQL Server 2000 (64-bit) eliminates the mapping overhead, and uses direct virtual addressing. Like its 32-bit counterpart, a 64-bit pointer in SQL Server can directly address any region within user space, incurring only the overhead of page faults and processor cache misses. The special AWE support is SQL is just disabled in the 64-bit versions.

There are unexpected benefits to using 64-bit addressing because not all SQL memory allocation is handled the buffer manager, and these allocations are always subject to the /3GB virtual limit on 32-bit machines. For instance, internal heap memory, sorting memory, memory needed for recovery and transaction rollbacks, the execution plan, and user connectivity structures are allocated through standard methods. The competition between these special needs and the SQL Buffer Manager can lead to virtual memory resource contention even if there's a surplus of physical memory. As an example, on Itanium processors, SQL can support a far greater number of users. For DB administrators that have encountered problems due to constant database growth, Itanium processors provide a clean and simple solution.

Connectivity

Microsoft Data Access Components (MDAC) provides a whole family of protocols for resolving connectivity issues. MDAC includes the Microsoft SQL Server network libraries for TCP/IP, named pipes, and shared memory, in addition to SQLOLEDB and SQLODBC. After MDAC version 2.7, support for both 32-bit and 64-bit versions was established. 32-bit client connectivity to 64-bit SQL data stores is seamless, along with 64-bit to 64-bit connectivity.

SQL Server 2005

SQL Server 2005 is the next version of SQL Server. It's a major redesign, the first since 2000, and 64-bit support will be a cornerstone. SQL Server 2005 features a new system interface abstraction that will allow a wide variety of non-uniform memory access (NUMA) optimizations for which enterprise class Itanium architecture based platforms will be the major beneficiaries. For instance, SQL will be able to intelligently localize memory accesses, producing lower memory access times, lowering response times, and increasing throughput and overall performance. Over the next few years, expect NUMA specific features to make their way into midrange and low-end systems. Additional redundancy, availability, and serviceability capabilities will be integral to SQL Server 2005: database pages will be tagged with a checksum that will help DB administrators to identify the very rare, but difficult to isolate errors originating on the system bus or IO subsystem. Hot installed memory support allows installation of physical system memory without system shutdown. The new memory will be allocated and used by SQL.

One of the more notable features of SQL Server 2005 is that it will be hosting the common language runtime (CLR), allowing SQL to run .NET enabled stored procedures. The 64-bit .NET Framework is on the verge of release and this means that .NET stored procedures will be write once, run anywhere, including Itanium processors where performance advantages will be immediately felt. Compatibility issues will be irrelevant; it will be possible to evaluate server platforms on price, performance, and RAS features. 64-bit virtual spaces will also eliminate virtual space constraints in SQL .NET AppDomains.

Itanium Processor-SQL Performance

On the same announcement date as Server 2003 and SQL64, Microsoft announced record-breaking benchmark results with its partners. The company hit the ground running with the very first release of a new 64-bit server OS with concrete proof of a well-designed, high performance product. More notable is the competitive environment: other architectures have been on the market for 10-15 years, and in the case of x86, well over 20. The announcement was Itanium processor's first real foray into the enterprise database space. Because performance for any specific architecture matures over several years due to additional software and platform tuning, these initial numbers are indicators of the shape of things to come.

Below are listed a few of industry standard benchmarks involving both SQL and Itanium processor that demonstrate how a 64-bit Itanium architecture Server can be quickly integrated into existing data centers with excellent results.

Benchmark Results Comments
Siebel PSPP Scalability Benchmark 30,00 concurrent users resulting in 206,722 business transactions throughput/hour The Siebel benchmark was run on a SQL Server 2000 (64-bit) backend running on the Itanium processor based Unisys ES7000 Orion 130 Enterprise Server. 30,000 concurrent users is the best Siebel PSPP result matching an IBM AIX/DB2 result on a 32P, 64-bit Power4 Server.
TPC-C (on-line transaction processing) 786,646 tpmC As of 8/28/2003, the HP Integrity Superdome result occupies the #2 overall, non-clustered results. Most notable is that this result is also the lowest of the top 10 throughput results as measured by price-performance, $1.82/tpmC less costly than the IBM Power 4 result. This result also used SQL Server 2000 (64-bit) as the backend
SAP 2-tier SAP® Standard Application Sales and Distribution Benchmark Itanium 2-based NEC Express5800/1032 occupies a top spot with 2,750 SD benchmark users with an average dialog response time of 1.85 seconds. The achieved throughput was 278,300 fully business processed order line items per hour and 835,000 dialog steps per hour. Also achieved with SQL Server 2000 (64-bit) acting as a database backend.

Summary

The Itanium architecture is the foundation for a full 64-bit, Itanium architecture specific software stack produced with the tools also available in the Microsoft PSDK for software developers at large. Hosted on a native 64-bit OS, Microsoft SQL Server 2000 Enterprise Edition (64-bit) is a major advance and a touchstone for the capabilities of Itanium processors. SQL clearly proves in industry standard benchmarks the power of the architecture and Itanium processor's ability to blend into older 32-bit environments. Beneficial 64-bit examples abound. For instance, line of business (LOB) application performance (SAP and Siebel) can be constrained for large customers in the 32-bit space; these applications directly benefit from the large memory availability on Itanium processor platforms. On-line analytic processing (OLAP) cubes with dimensions with more than 1 million members and data warehousing with large queries and sorts benefit similarly. SQL Server 2000 (64-bit), however, is merely the shape of things to come. Future releases, like SQL Server 2005, will mark the true exploitation of Itanium processor capabilities, and SQL Server 2005 is the first point of interest in a joint hardware-software roadmap that already lays out the introduction of new technology that's going to make enterprise computing more robust, reliable, and powerful in just the next few years.