UNDER THE HOOD

Optimizing DLL Load Time Performance

Matt Pietrek

Download the code for this article: UnderTheHood0500.exe (264KB)

Over the years, one of the dominant themes in my columns and seminars has been the benefits of techniques that optimize your executables. Typically this means basing and binding, but might also include importing functions by ordinal or changing executable page alignment. Intuitively, these strategies should make your code load faster. However, I've always had a nagging question about this topic—namely, just how much of an improvement can be expected with these techniques? This month, I've put my money where my mouth is and come up with some concrete numbers.
In this column, my goal is to measure the load time speed effects of several scenarios:

  • A program using DLLs that have conflicting load addresses versus having each DLL load at its preferred address. This tests the effects of the REBASE program.
  • A program that imports hundreds of APIs the normal way versus having all the APIs bound to their target DLL. This tests the effects of the BIND program.
  • Importing functions by ordinal rather than by name. Doing this requires some work on your part, typically creating and maintaining a .DEF file.

Shortly, I'll explain why the second scenario depends on the first. That is, binding your executables against DLLs that aren't loading at their preferred address is a waste of time. Binding assumes that you have a properly based system where DLLs load according to their header specification.

A Quick Review of Load Time Performance Tuning

Before getting into the performance improvement details, a quick synopsis of basing, binding, and importing by ordinal is in order. Basing is a good place to start. When creating a DLL, the linker assumes that the DLL will load at a particular address. Certain pieces of the code and data contain hardcoded addresses that are only correct if the DLL loads at the preferred address. However, at runtime it's possible that the operating system may have to load the DLL at a different memory location.

To handle the situation where the OS has to move the DLL, the linker adds base relocations to the DLL. Base relocations are addresses that require modification so that they contain the correct address for where the DLL loaded in memory. The more base relocations a DLL has, the more time the OS needs to process them and to load the DLL. A properly based DLL loads at its preferred address, and can skip processing the base relocation records.

When not given explicit directions, the Microsoft® linker creates each DLL with a preferred load address of 0x10000000. If your program uses more than one DLL of your own devising, you'll have multiple DLLs with the same preferred load address. The result is that every DLL but the first one will be relocated by the operating system at load time. This is sometimes referred to as a load address collision. However, if you intervene you'll be able to prevent this from happening.

The REBASE program that comes with Microsoft Visual Studio® and the Platform SDK is a handy tool for getting rid of load address collisions. You supply REBASE with a list of all the modules that make up your program (not counting system DLLs), and it picks new load addresses for the DLLs and modifies them accordingly.

Binding an executable builds on the premise that all DLLs can be tweaked to load at their preferred address. When you import a function from a DLL, the information necessary for the Windows® loader to find the imported function is stored in your executable. Typically, this information is the imported DLL's name and the name of the imported function. When the loader resolves an imported function, it's essentially executing the same code that GetProcAddress uses.

Normally at startup time, the loader spins through all the imported functions and looks up their addresses. However, if the imported DLLs don't change from run to run, the addresses that the loader gets back don't change either. An easy optimization is to write the target function's address to the importing executable, which is exactly what the BIND program does.

Normal Win32® executables have two identical copies of the information needed to look up an imported function. One is called the import address table (IAT), while the other is called the import name table. However, only one copy (the IAT) is required by the Win32 loader. The BIND program takes advantage of the fact that there are two copies of this information and overwrites the IAT entries with the imported function's actual addresses.

At load time, the loader checks to see if everything is kosher, and if so, uses the address that BIND has stored in your IAT. This eliminates the need to look up the function by its name. What if something is different than when the executable was linked? For instance, perhaps the imported DLL got loaded elsewhere. In this case, the loader uses the import name table information to do a normal lookup.

BIND.EXE is the most well-known way to bind an executable. However, it optimizes your executables based upon your system DLLs. If you distribute your program to users, they probably will have different system DLLs, so you'll want to bind your executables on their system. The Windows Installer has the BindImage action, which looks pain-free to use (although I must confess I've never written an installation script). Alternatively, you can use the API BindImageEx that's part of IMAGEHLP.DLL.

The final performance optimization under the microscope this month is importing by ordinal. Normally when you import a function, your binary contains the name of the imported functions. When the Win32 loader looks up the imported names, it has to do string comparisons to match the names you're importing to the names exported by the DLLs you're importing from.

When a Portable Executable binary exports functions, it contains an array of offsets to the exported functions. When you import by ordinal, the importing binary contains an array index (here, called the ordinal) into this array. This method of finding the imported function's address is a simple array lookup, so it's very fast. Importing by name is a lot more work, since the Win32 loader takes the name and does a search to find the corresponding ordinal value. From there, the loader continues as if you had specified an ordinal in the first place. Importing by name just adds an additional layer of code on top of importing by ordinal.

So why does Win32 allow importing and exporting by name? There's a variety of reasons, but two immediately come to mind. First, it can be a pain to keep the same export ordinal assigned to a given API if its DLL evolves over time, as the Win32 system DLLs tend to do. Keeping track of hundreds of functions in a .DEF file can be tedious. In addition, there's more than just one KERNEL32.DLL; there's one for Windows 2000, one for Windows 98, and so on. Second, exporting by name allows you to use the function's name with GetProcAddress, rather than its ordinal value. If you dump a random selection of import libraries from the Platform SDK or Visual Studio, you'll find that most DLLs export by name, but a few export by ordinal.

How do you import and export by ordinal? The importing part is actually done automatically for you by the Microsoft linker. However, in exchange you must export the APIs by ordinal. When the linker generates the import library corresponding to a DLL, it generates import records that will tell the linker how the APIs should be imported. The best way to export by ordinal is to explicitly tell the linker through a .DEF file. For instance, in a .DEF file, you'd have:

EXPORTS 
    MyExportedAPI    @1

If you don't use the @1 modifier, the Microsoft linker exports the API by name.

Besides the faster load time associated with importing and exporting by ordinal, there's another more subtle benefit. When exporting an API by ordinal, you can tell the linker not to store the exported API name in the exporting DLL. This means a smaller export section, and potentially a smaller binary with less data to demand page in. To eliminate the API name, use the NONAME modifier when exporting by ordinal.

EXPORTS 
    MyExportedAPI    @1 NONAME

If you look at MFC42.DLL, you'll see that it exports almost all of its 6000+ APIs by ordinal, and with the NONAME modifier. Imagine the added bulk if MFC42.DLL had to store all 6000 mangled C++ names in its exports!

Creating the Optimization Tests

To test the effects of proper basing, binding, and importing by ordinal, I wrote a program called MakeLoadTimeTest.EXE that generates the program to be benchmarked. The program I created allows me to easily tweak things like the number of imported functions, how many DLLs it calls, and the number of relocations each DLL has. The source for MakeLoadTimeTest is in this month's download files.

The MakeLoadTimeTest.CPP code isn't particularly pretty. However, you don't have to read it too closely, since the things you might want to change are isolated at the top—particularly these three lines:

const unsigned nDLLs = 10; 
const unsigned nExportedFunctions = 100; 
const unsigned nGlobalVariablesPerFunction = 5;

Based on these constants, the generated program (LoadTimeTest) will import 10 DLLs (in addition to KERNEL32.DLL). Each of these DLLs will export 100 functions, and the main executable will import all 100 functions. Finally, each exported function references five global variables. Why reference a global variable? It's an easy way to force a base relocation to be generated. The more base relocations, the more work the loader needs to do when a DLL doesn't load at its preferred address.

Figure 1 shows the code generated for one exported function. It starts out with five global variable declarations (for example, g_var_n2_0). Next is a #ifdef block that lets you decide at compile time whether the function is exported by name or by ordinal. Finally, the function itself (in this case, LoadTimeDLL_10_func_2) simply stores a value into previously declared variables. Why the funny variable and function names? The numbers at the end of the names make them all unique, so I avoid naming collisions.

void * g_var_n2_0; 
void * g_var_n2_1; 
void * g_var_n2_2; 
void * g_var_n2_3; 
void * g_var_n2_4; 
#ifdef ORDINAL_EXPORTS 
#pragma comment( linker, "/EXPORT:_LoadTimeDLL_10_func_2,@2,NONAME") 
#else 
#pragma comment( linker, "/EXPORT:_LoadTimeDLL_10_func_2") 
#endif 
extern "C" void LoadTimeDLL_10_func_2(void) 
{
   g_var_n2_0 = LoadTimeDLL_10_func_2;
   g_var_n2_1 = LoadTimeDLL_10_func_2;
   g_var_n2_2 = LoadTimeDLL_10_func_2;
   g_var_n2_3 = LoadTimeDLL_10_func_2;
   g_var_n2_4 = LoadTimeDLL_10_func_2;
}

Figure 1 A LoadTimeTest Function

As for the LoadTimeTest executable, it only needs a simple function main that references all the functions exported by the generated DLLs. The entire function is over 1000 lines long, but here's the relevant snippet:

#include "LoadTimeTest.H"
#include <windows.h> 

int main(int argc) 
{
     TerminateProcess( GetCurrentProcess(), 0 );
     LoadTimeDLL_1_func_1();
     LoadTimeDLL_1_func_2();
     LoadTimeDLL_1_func_3();
     LoadTimeDLL_1_func_4();
     // ...
}

You might be wondering why I used the TerminateProcess call. In an ideal world, I would be able to time just how long it takes to load my target process, right up to the point where its entry point is invoked. However, I couldn't come up with a simple way to do exactly this.

The hack I eventually decided on is to make the main function call TerminateProcess. This kills the process immediately without sending the DLL_PROCESS_DETACH notifications to the DLLs. In addition, because I don't actually call all the generated APIs (such as LoadTimeDLL_1_func_1), I don't incur the overhead of demand paging the code in. However, because I referenced all the generated exported functions in function main, the loader is forced to load the DLLs and potentially apply the base relocations.

Besides the code generator (MakeLoadTimeTest), and the generated program (LoadTimeTest), there's one more program in this benchmarking suite. The LoadTimer executable runs the LoadTimeTest program and times how long it takes to execute. Because Windows is a preemptive multitasking system, I expended a fair amount of effort to get fairly reliable timing information. The source code in the download files contains all the details, but it's worth summarizing here.

For starters, LoadTimer doesn't rely on a single run of LoadTimeTest. The first time LoadTimeTest runs, disk overhead adds to its load time. Subsequent runs are usually faster because the operating system has cached the pages from the EXE and DLLs. I time 30 invocations of LoadTimeTest and use the fastest run—the one with the least amount of external overhead from factors such as thread switching and interrupt processing.

In order to minimize the overhead of these external events, LoadTimer sets its priority to REALTIME_PRIORITY_CLASS. In addition, when it starts LoadTimeTest.EXE, the code specifies REALTIME_PRIORITY_CLASS for the process. This ensures that both the CreateProcess code executed in the LoadTimer process, as well as the code in the LoadTimeTest process, are run with the highest possible priority. As a result, the effect of external happenings should be minimized.

For the timing, I used the QueryPerformanceCounter API. I originally had tinkered with using the x86 architecture's RDTSC instruction, which can be as accurate as a single CPU clock cycle. However, it requires knowing the CPU's speed to calculate an actual time. While you can read the CPU's speed from the registry in Windows NT® (HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\ System\CentralProcessor\0), this number isn't exact. For instance, on my 550Mhz machine, the registry reports a speed of 548Mhz. (Who can I sue for missing 2Mhz ?) In the end, the granularity of QueryPerformanceCounter seemed perfectly adequate for the durations under consideration.

To make my benchmark code flexible, I made LoadTimer usable on any program that will exit on its own without user intervention. LoadTimer takes a command-line argument, specifying the file name to run. In my test, the command-line argument was LoadTimeTest.EXE.

The LoadTimeTest Benchmark Process

Here are the steps needed to reproduce my results from the code files in the download. First, build MakeLoadTimeTest and LoadTimer from the project files. Next, run MakeLoadTimeTest.EXE. The results will be 11 .CPP and 11 .H files. Running the BuildLoadTimeTest.BAT file compiles these files into LoadTimeTest.EXE and 10 associated DLLs. If you specify "ORDINAL" as the argument to BuildLoadTimeTest.BAT, LoadTimeTest.EXE imports the APIs by ordinal; otherwise, they're imported by name.

If you look at BuildLoadTimeTest.BAT, you'll see that it uses the linker defaults for the 10 DLLs. Thus, all of the generated DLLs will have the same preferred load address: 0x10000000. This is intentional, as it starts the benchmark with nine DLLs that will be relocated at runtime.

Now, on to the actual testing. First, run the command

LoadTimer.EXE LoadTimeTest.EXE

several times, and record the lowest time. This is the worst-case-scenario timing.

Now, let's fix the problem of all those DLLs needing to be relocated. Run the RebaseLoadTimeTest.BAT file, which uses REBASE.EXE on the EXE and the 10 generated DLLs, so that each one has a unique load address. Rerun the timing sequence, and record the lowest time. This gives you a feel for how much rebasing can affect loading times.

Now that the EXE and all the DLLs are loading at their preferred load address, it's worthwhile to see what additional gains can be had by binding them. Run BindLoadTimeTest.BAT and then rerun the timing sequence, again recording the lowest time.

At this point, you should have three load times: the default time without any intervention, the time after rebasing the executables, and the time after basing and binding. To see the effect of importing by ORDINAL, rerun the previous tests, but with one change: at the beginning specify "ORDINAL" as the argument when running BuildLoadTimeTest.BAT.

LoadTimeTest Results

Before getting to the actual numbers, let me first say that I was amazed at how fast programs can load. I intentionally created LoadTimeTest.EXE to make a lot of work for the Win32 loader. It has a fair number of DLLs and lots of exported functions and relocations. Even under the slowest scenario, my machine still loaded the program under Windows 2000 in less than 1/50th of a second. If your program takes a long time to load, don't blame the loader. The problem is almost certainly that somebody's initialization code is taking too long.

Figure 2 shows the results I obtained. The test machine was a Dell XPS T550, a single Pentium III CPU running at 550Mhz. The only visible process running (other than the Explorer shell) was a command prompt. The tests were run from a FAT16 partition so that I could test on both Windows 2000 and Windows 9x.

================================================
Windows 2000 By Name, FAT16

Default:
Fastest time: 58286 ticks, 0.016283 seconds
Ticks: 58286
Ticks/second: 3579545

Based:
Fastest time: 52611 ticks, 0.014698 seconds
Ticks: 52611
Ticks/second: 3579545

Bound:
Fastest time: 49006 ticks, 0.013691 seconds
Ticks: 49006
Ticks/second: 3579545

================================================
Windows 2000 By Ordinal, FAT16

Default:
Fastest time: 57824 ticks, 0.016154 seconds
Ticks: 57824
Ticks/second: 3579545

Based:
Fastest time: 50609 ticks, 0.014138 seconds
Ticks: 50609
Ticks/second: 3579545

Bound:
Fastest time: 49251 ticks, 0.013759 seconds
Ticks: 49251
Ticks/second: 3579545

================================================
Windows 98 SE By Name, FAT16

Default:
Fastest time: 32738 ticks, 0.027438 seconds
Ticks: 32738
Ticks/second: 1193180

Based:
Fastest time: 30150 ticks, 0.025269 seconds
Ticks: 30150
Ticks/second: 1193180

Bound:
Fastest time: 28944 ticks, 0.024258 seconds
Ticks: 28944
Ticks/second: 1193180

================================================
Windows 98 SE By Ordinal, FAT16

Default:
Fastest time: 31812 ticks, 0.026662 seconds
Ticks: 31812
Ticks/second: 1193180

Based:
Fastest time: 29569 ticks, 0.024782 seconds
Ticks: 29569
Ticks/second: 1193180

Bound:
Fastest time: 29513 ticks, 0.024735 seconds
Ticks: 29513
Ticks/second: 1193180

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

================================================
Windows 2000 By Name, Compressed NTFS

Default:
Fastest time: 61087 ticks, 0.017066 seconds
Ticks: 61087
Ticks/second: 3579545

Based:
Fastest time: 53834 ticks, 0.015039 seconds
Ticks: 53834
Ticks/second: 3579545

Bound:
Fastest time: 50388 ticks, 0.014077 seconds
Ticks: 50388
Ticks/second: 3579545


================================================
Windows 2000 By Ordinal, Compressed NTFS

Default:
Fastest time: 58607 ticks, 0.016373 seconds
Ticks: 58607
Ticks/second: 3579545

Based:
Fastest time: 51755 ticks, 0.014459 seconds
Ticks: 51755
Ticks/second: 3579545

Bound:
Fastest time: 50204 ticks, 0.014025 seconds
Ticks: 50204
Ticks/second: 3579545

Figure 2 LoadTimeTest Results

You can slice the data in a variety of ways, but here are the numbers I found interesting. On Windows 2000, properly basing the DLLs improved the load time by roughly 12 percent. Basing and binding the EXE and the DLLs improved the load time by around 18 percent. Importing by ordinal versus importing by name provided a 4 percent improvement. On Windows 98, Second Edition, properly basing the DLLs improved the load time by roughly 8 percent. Basing and binding the EXE and the DLLs improved the load time by around 12 percent. Importing by ordinal versus importing by name provided a mere 2 percent improvement.

Importing by name instead of ordinal doesn't affect the load time very much when the program is properly based and bound. When everything is properly configured, the loader can read the address of the imported function directly from the DLL doing the importing. The loader doesn't need to bother looking up names or indexing into arrays to get the address of the function.

Random Observations

While working through the code, I was thinking about the effects of DLLs loading at a nonpreferred address. In my test code, each of the DLLs exports 100 APIs, yet those APIs are never called. Because of demand paging, it's conceivable that the pages containing those APIs might not be brought into memory. As such, the overhead of applying the base relocations might not apply. With this in mind, I contacted a performance expert in Microsoft research. He told me that under Windows 9x, my hypothesis about not needing to apply the base relocations was correct.

However, under Windows NT and Windows 2000 the pages are set temporarily to read/write, are modified, and then returned to their original permission. As a result, any pages modified in this way are no longer shared between processes. Essentially, under Windows NT and Windows 2000, any executable that does not load at its preferred load address will have all of its code and data demand paged in when the executable first loads.

On a different note, I was surprised at the results of QueryPerformanceFrequency under Windows 2000. Looking at Figure 2, notice that under Windows 2000 the timer operates at 3.57Mhz. This is exactly three times faster than the 1.19 Mhz frequency used on Windows 98. If you're a PC old-timer, you may recall that motherboards traditionally have a 1.19Mhz oscillator that drives the 8254 chip.

Intrigued, I ran my tests on a Pentium Pro 200Mhz running Windows 2000 and got the 1.19Mhz frequency. I then ran LoadTimer on some dual processor machines and found that the frequency matched the CPU speed. The conclusion I've drawn from this is that Windows 2000 observes what the motherboard is capable of doing and uses the best available timer frequency for the QueryPerformanceXXX APIs. Yes, I admit to being a nerd and spending way too much time on this minor detail, but I enjoyed the experimentation and the opportunity to ponder things at the hardware level.


Matt Pietrek performs advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site, at https://www.wheaty.net, has a FAQ page and information on previous columns and articles.


From the May 2000 issue of MSDN Magazine.