Introduction to Developing Applications for the 64-bit Itanium-based Version of Windows

 

Microsoft Corporation

Revised June, 2003

Applies to:
   Microsoftt® Windows Server™ 2003

Summary: With the introduction of a 64-bit operating system, Microsoft has taken the Windows environment to the next level. The 64-bit version of Windows is an enterprise-level operating system able to run on high-end systems, such as the Intel® Itanium® platform, with 16-terabyte (TB) of memory and up to 64 processors. Migrating existing applications from the 32-bit version of Window to the 64-bit version of Windows can be relatively painless, even allowing both versions to be built from the same code base. Using existing facilities, provided by the Windows API, together with the high performing Intel Itanium platform enables developers to leverage their existing knowledge of the Windows API to build scalable, enterprise-class applications. (19 printed pages)

Download SimpleWebServerSample.exe.

Contents

A Quick Overview of the Itanium
Introduction to the 64-bit Version of Windows
C/C++ Programming for the 64-bit Version of Windows
General Windows Programming
Conclusion
Resources

A Quick Overview of the Itanium

EFI: The Itanium's BIOS

The Itanium's BIOS is not your normal, familiar personal computer BIOS. The Extensible Firmware Interface (EFI) is a layer of abstraction that separates the operating system from the BIOS and hardware. There is the EFI shell, which is similar to a command prompt in Windows. In some aspects, the EFI shell is like having a mini-operating system built in. From the EFI shell you can access drives, including CD-ROMs, run executables, such as the setup program for Windows, or even perform simple text editing. System configuration data is stored in non-volatile memory, rather than on your hard drive, and can be configured through the EFI shell.

EPIC: Intel's In-Order Processor

The Itanium is an in-order processor, meaning that it executes instructions in the order they are provided. This is different from the average x86 processor which reorders instructions in the pipeline when able, in an attempt to execute instructions in parallel. For the Itanium, the compiler must order instructions explicitly and take care of detecting interdependencies between them. The compiler must also take care of telling the processor which instructions can be executed in parallel. Intel has a new term for this: EPIC. EPIC stands for Explicitly Parallel Instruction-set Computing. It is the responsibility of the compiler to perform all optimizations. The processor will not reorder anything. This puts a lot more responsibility on the compiler, which is discussed later in this article.

Execution units

The Itanium is made up of nine execution units, which are comprised of the following:

  • Two integer units
  • Two integer/load-store units
  • Two floating point units
  • Three branch units

The Itanium has a ten stage pipeline responsible for fetching, decoding, and executing instructions. The Itanium can handle up to six instructions at once.

Registers

The Itanium has over 328 registers: 128 64-bit integer general-purpose registers, 128 82-bit floating point registers, 64 1-bit predicate registers, 8 branch registers, and a collection of other registers for various purposes, such as x86 backward compatibility (when running in x86 compatibility mode, the Itanium maps some of the x86 registers on to the 64-bit registers, while providing other registers that are used specifically for the processor's x86 mode).

To help manage its impressive set of registers, the Itanium has the ability to both frame and rotate the registers. The general-purpose registers are broken down into two groups: the first 32 registers are fixed and global. The remaining 96 registers are eligible for both framing and rotating.

Register Frames

The ALLOC instruction is used to set up a register frame. The register frame maps physical registers (hardware) onto logical registers (software), so that when calling a function, rather than pushing and popping all of the arguments, the compiler can allocate a range of registers for the subroutine, some of which may map onto the registers of the parent routine. The registers that overlap between the two are used to pass the parameters. This is much more efficient then pushing and popping the parameters onto the stack. Of course, the traditional method of pushing and popping parameters can still be used.

Since the first 32 registers are fixed, you cannot frame them. Therefore the maximum frame size is (the remaining) 96 registers. Also, only the integer registers can be framed, the floating point and predicate registers cannot.

Register Rotating

Registers can also be rotated or shifted one or more positions. This can help when unrolling loops, so that loops that operate on the same set of registers over and over can run at the same time, using different physical registers, without interfering with each other. With this option, the compiler can further improve instruction parallelism.

Instruction Set

An IA-64 instruction is 41 bits long. Seven bits are used to specify one of the 128 general-purpose registers, and with two source registers and one destination register, that equals 21 bits. Each instruction can specify one of the 64 predicate registers, adding another 6 bits. This puts us at 27 bits and we haven't even specified the actual operation code yet.

Instructions are packaged into 128-bit "bundles." That's three instructions (123 bits), plus a 5-bit template field. These bundles are then assembled into "groups." A group is a set of instructions that can theoretically execute at the same time. The instructions in a group have no interdependencies on each other. At compile time, the compiler must figure this out and group the bundles together. The processor will not double-check the compiler's work, the compiler must get it right. Groups can be arbitrarily long. A bit in the template field signifies the end of a group.

Bundles and groups are different. Bundles are the way the instructions are dispatched to the processor. The Itanium's bus and decode circuitry is 128-bits wide, which just happens to be 3 instructions (the Itanium actually dispatches two bundles at once). Groups are the logical way that instructions interact with each other.

For more information on the Itanium platform and the IA-64 architecture, visit the Intel Itanium Web site. And for information on developing software for the Itanium platform, see the Intel Itanium Developer Center.

Introduction to the 64-bit Version of Windows

It's Just the Windows API

The Microsoft® Windows Server™ 2003 64-bit platform does not require you to learn a new API in order to utilize the benefits of a 64-bit environment. There is no Win64 API, it's still the familiar Win32 API (more appropriately now called the Windows API). There are some new 64-bit compatible data types; and therefore, there may be some minor code changes that you will need to make. The key is that all of the existing Win32 knowledge you have directly applies to the 64-bit version of Windows, and the majority of your code should compile for the 64-bit platform without changes. This also means that you can build both 32-bit and 64-bit versions of your code from a single code base, eliminating the maintenance overhead involved with maintaining two code bases.

There are some differences between the two operating system versions that are important to note though. Microsoft has done away with some legacy components, such as the Win16 sub-system—16-bit Windows code is not supported on the 64-bit version of Windows. The POSIX and OS/2 sub-systems are also not supported. There is a new sub-system though, called WOW64.

WOW64

WOW64 is short for Windows-32-on-Windows-64. It provides 32-bit emulation for existing 32-bit applications, enabling most 32-bit applications to run on the 64-bit version of Windows without modification. It is similar to the old WOW32 sub-system, which was responsible for running 16-bit code under the 32-bit version of Windows.

The hardware itself has a 32-bit compatibility mode, which handles actual execution of IA-32 instructions, but the WOW layer handles things like switching the processor between 32-bit and 64-bit modes and emulating a 32-bit system. For example, there are different registry hives for 32-bit and 64-bit programs. There is also a different system directory for 32-bit binaries. The 64-bit binaries still use the System32 directory, so when a 32-bit application is installed on the system, the WOW layer makes sure to put the 32-bit binaries in a new directory: SysWOW64. It does this by intercepting calls to APIs like GetSystemDirectory and returning the appropriate directory depending on whether the application is running under the WOW or not. The same issue can exist with the registry. Since both 32-bit and 64-bit COM servers can be installed on the system under the same class identifier (CLSID), the WOW layer needs to redirect calls to the registry to the appropriate 32-bit or 64-bit hives. The WOW layer also handles mirroring changes between some areas in the registry, in order to make it easier to support interoperability between 32-bit and 64-bit code.

WOW64 is important because it allows you to leverage most of your existing 32-bit code when performance and scalability are not a concern. It is a best of both worlds approach. You can port your service to 64-bit and leave your Microsoft Management Console (MMC) configuration snap-in 32-bit. The 64-bit version of Windows includes both 32-bit and 64-bit versions of MMC. When choosing to leave administration tools as 32-bit there may be some issues with inter-process communication, but protocols such as Remote Procedure Call (RPC) should work between 32-bit and 64-bit processes, as long as the interfaces are designed correctly. Another thing to keep in mind about WOW64 is that it is not designed for applications that require high-performance. At the very least, the WOW64 sub-system needs to extend 32-bit arguments to 64-bits, and truncate 64-bit return values to 32-bits. In the worst case, the WOW64 sub-system will need to make a kernel call, involving not only a transition to the kernel, but also a transition from the processor's 32-bit compatibility mode to its native 64-bit mode. Applications won't be able to scale very well when run under WOW64. For those applications that you would like to leave as 32-bit, test them under WOW64. If the performance is not meeting your expectations, you'll need to look at migrating the application to 64-bit.

WOW64 is implemented in user mode, as a layer between ntdll.dll and the kernel. WOW64 and some of its support DLLs are the only 64-bit DLLs that can be loaded into a 32-bit process. For all other cases, processes are kept pure. The 32-bit processes cannot load 64-bit DLLs, and vise versa.

For more information about WOW64, see "64-bit Windows Programming - Running 32-bit Applications" in the Microsoft® Platform SDK.

Virtual Memory and Address Space

By default, the 32-bit version of Windows is limited to 4 gigabyte (GB) of address space, half of which is reserved for the kernel. This limits your average application to just 2 GB worth of virtual memory. 2 GB may seem like a lot, but that address space can easily be fragmented in an application by bad allocation algorithms, or large file mappings, or even excessive use of DLLs. Just take a look at the 'VM Size' column in task manager and see how much virtual memory your average application consumes. Of course, just like the old DOS days (with XMS/EMS), there are methods that allow a 32-bit application to access more than 4 GB of physical memory. Enter Physical Address Extensions (PAE) and Address Windowing Extensions (AWE). PAE works by extending the number of address bits from 32 to 36, which allows an application to address up to 64 GB. AWE allows an application to map ranges of physical memory greater than 4 GB into your virtual address space. Both of these methods introduce overhead and added code complexity.

The 64-bit version of Windows provides 16 TB worth of address space, half of which is available to user mode applications. This means entire databases can be moved into memory, significantly increasing performance, or whole Web sites can be cached in memory. It also enables code to reserve and commit huge contiguous blocks of virtual memory, without really having to worry about virtual memory fragmentation. This allows for huge file mapping objects or shared memory sections as well.

C/C++ Programming for the 64-bit Version of Windows

/WP64: Getting the Compiler to Warn You of Potential Issues

The Microsoft® Visual C and Microsoft® Visual C++® .NET 2002 compilers added the /WP64 switch, which allows you to test your 32-bit code for 64-bit compatibility issues. The compiler will issue warnings about things like pointer truncation and improper casts. One of the first steps in porting your 32-bit application to the 64-bit version of Windows is to turn on this flag and compile your code like you normally would. The first time, expect to get several errors. For example, take a look at this snippet of code:

DWORD i = 0;
size_t x = 100;

i = x; // C4267: warning C4267: '=' : conversion from 
    // 'size_t' to 'DWORD', possible loss of data.

On a 32-bit platform this code would compile fine, because size_t is 32-bits, but on a 64-bit platform size_t is a 64-bit integer. With /WP64 enabled, the compiler will warn you about situations like this.

Additional examples:

void func(DWORD context)
{
  char* sz = (char*)context; // C4312: warning C4312: 
                // 'type cast' : conversion 
                // from 'DWORD' to 'char *' of
                // greater size
  // Do something with sz..
}

char* string = "the quick brown fox jumped over the lazy dog.";

func((DWORD)string); // C4311: warning C4311: 'type cast' :
           // pointer truncation from 'char *' 
           // to 'DWORD'

Once you fix these errors, test your 32-bit code. You want to make sure that your 32-bit code continues to work as expected. Your 32-bit and 64-bit binaries should build from the same code base. This is a key concept for writing Windows applications moving forward. You need to think about both 32-bit and 64-bit issues from the beginning, and code your application to work for both platforms.

New Data Types

The 64-bit version of Windows uses the LLP64 data model. What this means is that the standard C types int and long remain 32-bit integers. The data type size_t is mapped to the processor's word size (32-bits for IA32 and 64-bits for IA64), and __int64 is a 64-bit integer. This was done to assist in porting 32-bit code. The significance is that you can have the same code base for both the 32-bit and 64-bit version of your application.

There is another data model called LP64, which maps the standard C type long to a 64-bit integer; and int remains a 32-bit integer. This data model is common on Unix platforms, but can make it harder to create both 32-bit and 64-bit versions of your application from a single code base. You might notice a common theme here. The idea of a 32-bit versus a 64-bit platform is that you should be able to build both versions of your application from a single code base. If you can't do that, then you may want to revisit your design. Having a single code base is a huge win, especially if you plan to ship both versions.

Polymorphic Types

Since the Win32 API is targeted towards C, there are a lot of cases where you need to cast an integer to a pointer and visa versa. This isn't a problem on 32-bit hardware, where the size of a pointer and the size of an integer are the same, but on 64-bit hardware it's another story. This is where the polymorphic types come into play.

For specific precisions, you can use the fixed-precision data types. Their sizes are consistent, regardless of the word size of the processor. Most of these types contain the precision in their name, as can be seen in the following table:

Table 1. Fixed-Precision Data Types

Type Definition
DWORD32 32-bit unsigned integer
DWORD64 64-bit unsigned integer
INT32 32-bit signed integer
INT64 64-bit signed integer
LONG32 32-bit signed integer
LONG64 64-bit signed integer
UINT32 Unsigned INT32
UINT64 Unsigned INT64
ULONG32 Unsigned LONG32
ULONG64 Unsigned LONG64

Alternatively, when you need a data type whose precision varies with the word size of the processor, use the pointer-precision data types. These are also known as the "polymorphic" data types. These types generally end in the _PTR suffix, as can be seen in the following table:

Table 2. Pointer-Precision Data Types

Type Definition
DWORD_PTR Unsigned long type for pointer precision
HALF_PTR Half the size of a pointer. Use in a structure that contains a pointer and two small fields
INT_PTR Signed integral type for pointer precision
LONG_PTR Signed long type for pointer precision
SIZE_T The maximum number of bytes to which a pointer can refer. Use for a count that must span the full range of a pointer
SSIZE_T Signed SIZE_T
UHALF_PTR Unsigned HALF_PTR
UINT_PTR Unsigned INT_PTR
ULONG_PTR Unsigned LONG_PTR
LPARAM Synonym for LONG_PTR (defined in WTypes.h)
WPARAM Synonym for UINT_PTR (defined in WTypes.h)

All of the Win32 APIs that passed parameters or context information via integer arguments have been changed to use these new types. A good example is the SetWindowLong and SetWindowLongPtr functions:

The old way:

LONG SetWindowLong(
    HWND hWnd, 
    int nIndex, 
    LONG dwNewLong);

The new, polymorphic way:

LONG_PTR SetWindowLongPtr(
      HWND hWnd, 
      int nIndex, 
      LONG_PTR dwNewLong);

Notice that the xxxPtr version of the function uses the new polymorphic types. It's fairly common for developers to store context information for a window by storing a pointer in the window's extra data area. Any code that stores a pointer on the 32-bit version of Windows by using the SetWindowLong function, must be changed to call SetWindowLongPtr. The change is simple and quick, as are most of the changes required to use the polymorphic types.

Also good examples are WindowProc and GetQueuedCompletionStatus:

LRESULT CALLBACK WindowProc(
          HWND hWnd, 
          UINT uiMsg, 
          WPARAM wParam, 
          LPARAM lParam);

BOOL GetQueuedCompletionStatus(
          HANDLE hCompletionPort, 
          LPDWORD lpNumberOfBytes,
          PULONG_PTR lpCompletionKey,
          LPOVERLAPPED* lpOverlapped,
          DWORD dwMilliseconds);

WindowProc uses LPARAM, which is a polymorphic type. GetQueuedCompletionStatus uses ULONG_PTR which is also a polymorphic type. This allows existing code, which assumes that the size of an integer is the same as the size of a pointer, to continue to work with very little modification.

New Optimization Modes for the Compiler: PoGO and LTCG

The compiler included with Microsoft® Visual Studio® .NET 2002 contains two new optimization modes, Link Time Code Generation (LTCG, also known as Whole Program Optimization), and Profile Guided Optimization (PoGO). Code optimization is more important on the Itanium processor than it was on the x86 platform, because the compiler assumes all of the responsibility for producing efficient code. Both of the two new optimization modes will increase build times and will require good test scenarios, especially PoGO due to its need to capture profiling data. LTCG allows the linker to perform optimization across module boundaries, and actually generate code during the link phase to produce more efficient binaries by generating better inline code, or even using custom calling conventions. PoGO allows the compiler to optimize according to usage patterns. It requires a two-phase build process. During the first phase, the binary is instrumented to enable it to gather profiling data. During the second phase the profiling data is analyzed and the data is used to guide optimization.

Miscellaneous Performance

Today's compilers are great at producing highly optimized code. On the Itanium processor, the compiler is responsible for a lot more. Since the Itanium is an in-order processor, the compiler must perform optimizations like rearranging instructions so that they can execute in parallel. With the addition of the predicate registers as well, the compiler has a lot more freedom in optimizing branching. Using the predicate registers, the compiler can do away with a branch altogether and use the predicate field of the instruction to control whether or not an instruction actually executes. This is good for performance because rather than jumping over a small block of code and invalidating the instruction prefetch, the compiler can tell the processor to conditionally ignore the instructions.

Alignment

It is important to think about alignment on the Itanium processor. Pointers must be aligned on 64-bit boundaries or you will get an alignment exception. You can use the UNALIGNED keyword to allow an unaligned pointer dereference, but you will take a large performance hit. You generally want to let the compiler handle alignment of structures using the #pragma pack directive. This allows you to specify a different alignment for 32-bit and 64-bit (at compile time) instead of manually aligning your structures in code.

General Windows Programming

The following information provides practical guidelines to keep in mind when designing for scalability; an introduction to some of the exciting features built into Windows Server 2003 that you can take advantage of; and, a discussion of a few patterns to avoid.

Performance and Scalability

In order to scale, you have to know what that means for your particular scenario. For example, for a Web server, scalability can mean the ability to serve pages in relation to the number of users connected. Think of it in terms of a line graph.

Figure 1. Linear Scaling Example

As the number of users increases, the number of pages per second should also increase. The above graph shows a linear scale. As the number of users triples, so does the pages per second served.

Another definition of scaling is in terms of hardware. If I double the number of processors on my system, will the throughput of my Web server double too? What about RAM or disks, etc. Applications need to be designed with this in mind too. The number of threads you create should be based on the number of processors in the system and the type of work each thread is doing. The amount of memory used for caching Web page content should be proportional to the amount of RAM available to the application, etc. This concept is generally termed scaling up. If I build my box bigger and bigger, can I produce more and more?

The other form of scaling is when you are talking about distributed computing or server farms. This is generally termed scaling out. If I double the number of machines in my server farm, will my throughput double as well?

These scenarios need to be taken into account when designing a scalable system. Modern day hardware is getting bigger and bigger (the Itanium supports up to 64 processor machines) so scaling up needs to be in the forefront of the developer's mind. This is especially true since it is possible for your graph to flatten out and even start to decline as you increase resources. If one part of the entire system cannot scale, it can have a negative impact on the system as a whole.

Threads: How to Use Them Effectively

Dividing your work between threads can simplify your code, and on multiple processor systems can make your code more efficient, but it can also destroy performance and scalability if you don't know what you're doing. For example, if all of the threads in your application need to acquire the same global critical section, the contention for that critical section can have your threads spending most of their time sleeping. It can also cause excessive context switching to occur, which can cause your application to spend a significant portion of its processing time in the system kernel, not even running your code. These issues can be especially bad on multi-processor systems, where your extra processors can end up sitting idle, waiting for access to the shared data.

The ideal number of threads to use is equal to the number of processors in the system. If your threads are independent and processor-bound then they should be able to consume their entire timeslice each time. If you have threads that may perform blocking operations, then you might want to increase the number of threads so that when one is sleeping, another can take its place. You'll want to identify where your threads can block and how often they will block. With this in mind, you can get an idea of how many threads you should have running. You always want to have a thread ready to go for each processor. Otherwise you are wasting processing power. Of course, these are only guidelines, and the only way to be sure that your application is running as efficiently as possible is to profile and test it.

Asynchronous I/O: Don't Block Waiting For Your Data

Windows systems based on the NT kernel support asynchronous I/O, also known as overlapped I/O. Most forms of I/O can be done asynchronously. This includes file I/O and network I/O. For file I/O you use the ReadFile/WriteFile APIs. By opening the file with the FILE_FLAG_OVERLAPPED flag and specifying an OVERLAPPED structure when reading/writing you will get the system to notify you when the I/O has completed. This allows you to do other work while waiting. For network I/O using Windows Socket (WinSock), you create the socket using the WSASocket API and specify the WSA_FLAG_OVERLAPPED flag, and then you specify either an OVERLAPPED structure or a callback function when you call the WSARecv/WSASend APIs. Async I/O is especially effective when you are writing a network server. You can 'queue' multiple receive requests and then go to sleep waiting for one of them to complete. When one completes you process the incoming data and then 'queue' another receive. This is much better than polling for data using the select API, and it uses system resources much more efficiently.

There are several options for waiting for an asynchronous I/O request to complete:

Calling the GetOverlappedResult API

After issuing your async I/O request, you can use the GetOverlappedResult API to poll the status of your request, or to just wait for the request to complete. When the request completes, GetOverlappedResult will return the number of bytes that were transferred as part of the request.

Using the HasOverlappedIoCompleted Macro

You can use the HasOverlappedIoCompleted macro to efficiently poll whether or not the request associated with the OVERLAPPED structure has completed. Once the request has completed, you can use the GetOverlappedResult API to get more information about the request (such as the number of bytes transferred).

Specifying an Event in the OVERLAPPED Structure

By specifying an event in the hEvent field of the OVERLAPPED structure, you can perform you own polling or waiting for the request to complete by specifying the event in a call to WaitForSingleObject or WaitForMultipleObjects. The kernel will signal the event when the overlapped operation completes.

Binding the Kernel Object to an I/O Completion Port

I/O completion ports are an extremely useful tool provided by the system. See the following section for information. For an event-driven system, such as a network server waiting for input, I/O completion ports provide the perfect mechanism to wait for and handle incoming events.

I/O Completion Ports: Event Driven I/O

Most Windows developers are familiar with window messages and message queues. Think of I/O completion ports as high-performance, highly-scalable super message queues. If you have an event-driven system, you need to be using completion ports. Completion ports are designed from the ground-up to provide performance. If you are writing code from scratch, you should absolutely be using I/O completion ports. They require some experimenting to get just right, but once you're familiar with how they work, they are pretty simple to use. If you are porting an application from another system or a code base that uses synchronous I/O then you have some work ahead of you, but the benefits are well worth the effort.

You create a completion port using the CreateIoCompletionPort API. This is also the API you use to associate a kernel object with the completion port. Once a file handle or socket handle is associated with a completion port, all I/O requests that complete on that handle will be queued to the completion port.

Notifications can be queued to a completion port or processed in first-in-first-out (FIFO) order. You can also queue custom notifications to a completion port using the PostQueuedCompletionStatus API. Using this custom notification method is a good way to signal for your threads to shutdown and to inject any other custom, external event. In the following sample code, PostQueuedCompletionStatus is used to tell the worker threads to exit:

HRESULT StopCompletionThreads()
{
  // Tell the threads that were started, to shut down
  for (size_t i = 0; i < COMPLETION_THREAD_COUNT; i++)
  {
    assert(g_completionThreads[i]);
    PostQueuedCompletionStatus(g_completionPort, 0, 0, NULL);
  }

  // Wait for the threads to shutdown
  WaitForMultipleObjects(
    COMPLETION_THREAD_COUNT, 
    g_completionThreads, 
    TRUE, 
    INFINITE);

  // Close the handle for each thread
  for (size_t i = 0; i < COMPLETION_THREAD_COUNT; i++)
  {
    CloseHandle(g_completionThreads[i]);
    g_completionThreads[i] = NULL;
  }

  return S_OK;
}

Notice that zero is passed for the dwNumberOfBytesTransferred and dwCompletionKey parameters, and NULL for the OVERLAPPED parameter. This combination of values is what the worker thread checks for in order to shutdown:

UINT __stdcall CompletionThread(PVOID param)
{
  BOOL      result      = FALSE;
  OverlappedBase* overlapped    = NULL;
  ULONG_PTR    key        = 0;
  DWORD      numberOfBytes   = 0;

  for (;;)
  {
    result = GetQueuedCompletionStatus(
          g_completionPort, 
          &numberOfBytes, 
          &key, 
          (OVERLAPPED**)&overlapped, 
          INFINITE);
    if (result)
    {
      if (numberOfBytes == 0 && key == 0 && !overlapped)
        break;

      OverlappedCallback callback = 
          overlapped->callback;

      callback(
        NO_ERROR, 
        numberOfBytes, 
        key, 
        overlapped);
    }
    else
    {
      if (overlapped)
      {
        OverlappedCallback callback = 
          overlapped->callback;

        if (callback)
        {
          callback(
            GetLastError(), 
            numberOfBytes, 
            key, 
            overlapped);
        }
      }
    }
  }

  return 0;
}

At the heart of the I/O completion method is the OVERLAPPED structure. The OVERLAPPED structure contains context information specific to each I/O request. Typically, the structure is extended to add your own context information. You get access to this structure (and therefore your context data) when processing the completion notification.

Extend the OVERLAPPED structure either by inheriting from it or by including it as the first field of your own structure, such as in the following:

//C++
struct OverlappedBase : public OVERLAPPED
{
   OverlappedCallback   callback;
};

Or

//C
struct OverlappedBase
{
   OVERLAPPED         overlapped;
   OverlappedCallback   callback;
};

The OVERLAPPED structure contains the following fields:

typedef struct _OVERLAPPED {
   ULONG_PTR   Internal;
   ULONG_PTR   InternalHigh;
   DWORD      Offset;
   DWORD      OffsetHigh;
   HANDLE   hEvent;
} OVERLAPPED;

The Offset and OffsetHigh fields are used to specify the offset when reading or writing to a file. The Internal field contains the status (or error) of the operation. The InternalHigh field contains the number of bytes transferred as part of the I/O request. The Internal and InternalHigh fields are not valid until GetOverlappedResult returns TRUE (or a completion notification is queued to the completion port).

The structure can be extended to include any extra fields you may need. However, keep in mind that the structure must remain available for the lifetime of the I/O request.

The following snippet shows how the OVERLAPPED and OverlappedBase structures are extended for network I/O operations:

#define SOCKET_BUFFER_SIZE     128

#define SOCKOP_ACCEPT        1
#define SOCKOP_RECV         2
#define SOCKOP_SEND         3

struct SocketOverlapped
{
   OverlappedBase  base;
   int        op;
   SOCKET      sock;
   DWORD       numberOfBytes;
   DWORD       flags;
   BYTE       buffer[SOCKET_BUFFER_SIZE];
};

This allows the per-request information to be stored with every I/O request initiated. The op field stores the operation being initiated, sent, received, or accepted. The numberOfBytes field contains the count of bytes in the buffer field that are valid (either for sending or receiving).

Divide and Conquer: Let Threads Work Independently

The bane of scalability is contention. When one thread has to wait on another thread, for example, to acquire a lock, then that thread is wasting time, and work that could potentially be done must wait. This leads into thread affinity and Non-Uniform Memory Access (NUMA). If your processing can be partitioned between threads (with no real dependencies between threads), then you can lock each thread onto its own processor. On NUMA systems you can also partition the memory that each thread uses so that the memory is local to the NUMA node.

Thread Affinity

Windows Server 2003 allows you to specify which processors a thread is allowed to run on. This is known as setting the processor affinity of the thread. You use the SetThreadAffinityMask/GetThreadAffinityMask functions to set and check the affinity of a particular thread. Setting the affinity of a thread is useful in reducing inter-processor bus traffic. When a thread moves from one processor to another, the current processor's cache must be synched to the new processor. Performance problems can be caused by a thread bouncing around between processors. Also, some systems allow you to bind specific device interrupts to a specific processor. In your software you can "bind" a particular thread to that processor and issue/process all I/O for that device from that thread, thereby increasing the potential concurrency of the system (that is, spreading highly active devices like network cards between multiple processors).

NUMA

NUMA stands for Non-Uniform Memory Access. On traditional Symmetric Multi-Processing (SMP) systems, all of the processors in the system have equal access to the entire range of physical memory. The problem with traditional SMP systems is that the more processors and memory added, the higher the bus traffic becomes. This becomes prohibitive to performance. On a NUMA system, the processors are grouped into smaller systems each with its own "local" memory. Accessing the "local" memory is cheap, while accessing memory on another node can be expensive. Windows will attempt to schedule threads on the same node as the memory that is being used, but you can help Windows by using the NUMA functions to improve thread scheduling and memory usage. Use the following functions to determine which processors belong to which node, and to set a thread's affinity for a particular processor/node:

GetNumaHighestNodeNumber

GetNumaProcessorNode

GetNumaNodeProcessorMask

SetThreadAffinityMask

Also, applications that make extensive use of memory can use the following function to improve their memory usage on a NUMA system:

GetNumaAvailableMemoryNode

It is important to start thinking about NUMA and large multi-processor systems and to design for them from the start. Most initial 64-bit deployments will be for large multi-processor systems with more than 8-processors, running huge enterprise applications such as Secure Audio Path (SAP). NUMA is critical for overall scalability and performance.

WinSock Direct

In large datacenters, traffic between servers can exceed the bandwidth of traditional TCP/IP-based networks. System Area Networks (SANs) were designed to solve this problem by offloading some of the network protocol processing from the CPU. This benefits server applications by providing faster communications between servers, thereby improving the performance of scale-out solutions. Most SANs require the application to program directly to the vendor's API, resulting in very few applications that are available for deployment in a SAN environment. Microsoft developed Windows Sockets (WinSock) Direct to provide a common programming interface to the lower-level SAN implementations. WinSock Direct sits underneath the standard WinSock API, but bypasses the kernel networking layers to talk directly to the SAN hardware. Because WinSock Direct sits underneath the existing WinSock API, IT departments can deploy applications in a SAN environment without requiring modifications to the application.

SANs provide reliable, in-order delivery of data through two standard transfer modes: Messages and Remote Direct Memory Access (RDMA). Messages are more like your traditional networking protocols, where packets are sent to a peer, and the peer requests packets from the network. RDMA allows the destination buffer of the packet to be specified.

In general, SAN hardware will implement most of its data transfer capabilities directly in the hardware. This allows the SAN implementation to do things such as bypass the kernel. Processing that is normally provided by the kernel is offloaded directly into the hardware.

WinSock Direct removes the need for applications to program directly to SAN specific APIs. By just installing SAN hardware and the WinSock Direct drivers for the hardware, existing applications can make use of the higher-performance offered by SANs.

The following resources provide additional information on thread affinity, NUMA, and WinSock Direct.

DLLs, Processes, and Threads: Multiple Processors

NUMA Support

WinSock Direct: The Value of System Area Networks

WinSock Direct and Protocol Offload on SANs

Windows Sockets 2.0: Write Scalable WinSock Apps Using Completion Ports

Security

Security should be at the forefront of every developer's mind when writing software. One of the most important tools when writing secure code is in-depth knowledge about the APIs that you are using. APIs that deal with null-terminated strings are a good example to explore. Most APIs will null-terminate for you, but not all of the time. For example, take this excerpt from the remarks section in the Visual Studio .NET product documentation for _snprintf, _snwprintf:

The _snprintf function formats and stores count or fewer characters and values (including a terminating null character that is always appended unless count is zero or the formatted string length is greater than or equal to count characters) in buffer.

Basically, this is saying that if the destination buffer is not large enough, or is exactly the right size, that it will not null-terminate it. This is an incredibly important point. If you have a 1024 character buffer on the stack, and the input from the user is exactly 1024 characters, then the buffer will not be null-terminated, and any code that assumes the buffer is always null-terminated (and less than 1024 characters) is a buffer overflow waiting to happen. The bottom line is, know your APIs. For documentation of Windows Server 2003 APIs, see the Platform SDK on MSDN.

All user input should be validated for correctness, whether that means validating against an expected format, or validating against a maximum length. Validation should be performed as early as possible. Also, any error messages generated from validation should be careful not to give away too much information. Even though you should always assume that a hacker has access to your source code (and design for security with that in mind), you don't want to make the hacker's job any easier. An error message like: "The buffer cannot exceed 1024 characters" tells the hacker right away that 1024 characters is likely the size of any stack buffers that may hold the data. Furthermore, an error message such as that one means nothing to the average user, and will only confuse him.

Your application should be designed "secure by default". This means that you need to evaluate your features and their potential security risk. Do not underestimate the ingenuity of a determined hacker. Even if you think the feature is safe, turn it off if it is not essential. Give your users the information they need to be able to decide whether or not to enable the feature. Developers can make mistakes; you do not want to take chances with your customer's systems and data.

Also, when thinking about security, there is no need to reinvent the wheel. The Windows API provides many useful APIs for performing operations like access checks, cryptography, and storing sensitive data.

If you create named kernel objects, like a named mutex or shared memory section, use Access Control Lists (ACLs) to secure access to the object. Use ACLs on any files that are created (or create the file in a directory that already has the proper ACLs). Think about where you create registry keys too. Should the user be able to change them? Should they be machine-wide? Should an administrator be able to lock down access to them? Remember that user-specific settings should go in HKEY_CURRENT_USER, while machine-wide settings need to go in HKEY_LOCAL_MACHINE.

If you need to store sensitive user data, such as username/password combos or personal information, use the operating system-provided cryptographic providers to encrypt the data. Some simple functions to get you started are CryptProtectData and CryptUnprotectData.

Conclusion

As you have seen, the process for migrating your applications to a code base that compiles for both 32- and 64-bit versions of Windows is relatively straight forward. There are some scenarios to consider in the form of pointers, truncation, and data alignment. However these issues are not insurmountable, merely challenging. Keep in mind that in order to build applications that scale well, it is important to consider your use of threads and threading models very carefully. Poor thread execution is a scaling killer. For additional resources to enable you to be successful in your work with Windows Server 2003 and the 64-bit operating environment, see the resources listed below.

Resources

Microsoft® Platform SDK

Platform SDK download from the Platform SDK Update Site

Microsoft® Windows® Driver Development Kit documentation

Windows Driver Development Kits download site

Windows Server 2003 Installable File System (IFS) Kit

Getting Ready for 64-bit Windows

Windows Server 2003 Developer Center

Visual Studio .NET Developer Center

.Net Framework Developer Center

Intel Itanium Developer Center

Microsoft TechNet: Windows Server 2003 Resources

Microsoft Product Support Services: Windows Server 2003 Support Center