CLR Inside Out

Improving Application Startup Performance

Claudio Caldato

Contents

Cold Startup
Identify Code Loaded from Disk
System Assemblies and Other Processes
NGen Performance
Authenticode Verification
Wrapping Up

as waiting for an application to start is frustrating to many users, focusing on your client application's startup performance can greatly enhance your customers' first and lasting impressions of your handiwork. And because startup performance matters to users, it's worth exploring the factors that impact it so you can avoid the most common mistakes.

Application startups are typically classified as either cold or warm. In the context of a managed application, cold startup is when neither the Microsoft® .NET Framework system assemblies nor the application's code are in memory, so they all need to be fetched from disk. Warm startup is either a subsequent startup of an application or an application startup when most of the system code is already in memory because it was used previously by another managed application.

Cold Startup

In most cases, cold startup is I/O bound. In other words, more time is spent waiting for data than is spent processing instructions. The time it takes to launch the application is equal to the time it takes the OS to fetch code from disk plus the time it takes to perform additional processing such as JITing the IL code and any other initialization that is performed in the startup path of the application. Since processing is usually not the bottleneck in a cold startup, the initial goal of any application startup performance investigation is to reduce disk access by reducing the amount of code that is loaded.

The way in which the application's code is written also has a significant impact on cold startup so it is important to find out if, for example, the application opens additional files or launches other processes that could compete for I/O resources at startup.

Because cold startup is an I/O bound scenario, the use of a traditional CPU profiler (whether instrumentation-based or sampling-based) will not help investigations significantly. Instrumentation-based profilers will report the time spent waiting on I/O as blocked time. The problem is that, even if you are able to attribute the blocked time to a specific call stack, blocked time is counted only once. Any following disk I/O is not taken into account, resulting in a partial picture of the actual contribution of disk I/O to the total execution time.

With sampling-based profilers, the information collected can even be misleading. Here CPU usage, not I/O, is tracked, so all the time spend in I/O is not accounted for in the profiler's reports.

You can see for yourself in Figure 1 that cold startup is I/O-bound by launching your application twice in a row. The first launch will likely be much slower than the second (where most of the code needed for the execution is already in memory due to the first launch, saving time by avoiding disk access). Of course, to make sure the first launch is indeed a cold startup, restart your machine before and make sure there are no managed applications in the startup folder and that no Windows® service using managed code runs when the user logs on.

Figure 1 Time Reading Disk and CPU Time in Cold Startup

Figure 1** Time Reading Disk and CPU Time in Cold Startup **

Note that to perform an ideal test of cold startup, you should disable the SuperFetch service, which may otherwise preload some of the code needed by your app, creating a warmer startup scenario. The benefit of measuring with SuperFetch turned off is that you can count on the fact that all the code needed by the application was loaded into memory on application startup, and hence you can more accurately measure the cost of I/O. You should keep in mind, however, that what you are measuring is not necessarily the actual user experience, so don't draw any concrete conclusions about the actual performance of your application with the data you collect with SuperFetch turned off.

Two performance counters that you can use to see the impact of cold startup on I/O are % processor time and % disk read time. If I/O dominates cold startup, which it should, you should see a big difference between % processor time and disk read time. Performance counters can be collected using PerfMon (see the "Startup Performance Resources" sidebar for more information).

In Figure 1, the red line represents % disk read time and the green line represents % processor time. In the case of cold startup, you can see that CPU usage is relatively low compared with the time spent reading from disk.

The second time the application is launched, you are in the warm startup scenario so the performance counter should show a different picture. In Figure 2, the scenario is CPU bound, and as you can see, % disk read is very low compared with % processor time.

Figure 2 Shorter Times in Warm Startup

Figure 2** Shorter Times in Warm Startup **

Warm startup is CPU-bound because the code is already in memory (so there is no need for additional I/O) but code needs to be JITed before the application can run. With the .NET Framework today, the native code generated by the JIT is not saved from one execution of an application to the next.

If you see that warm startup is not significantly smaller than cold startup, you need to discover what is consuming CPU cycles (because most of the code is preloaded in a warm startup and is unlikely to be I/O bound). The likely culprits, then, include large volumes of code that must be JITed or complex computations that the application must perform.

To determine if JITing is the issue, you can check the performance counter .NET CLR JIT\% time in JIT. If the value is not high (for instance more than 30%-40% for most of the startup time), it means that JIT is not likely to be a major contributor and you should use a profiler to determine which functions in your app consume the most CPU time. Keep in mind that the counter is updated only if methods are actually JITed. This means that after the last method has been JITed the counter will still report the last value; it will not drop to zero. Therefore, make sure you look at the counter only for the first few seconds of the app's startup; at that time you are likely to see the counter increase very fast, indicating that the spike in CPU utilization is caused by the JIT compiler.

You should also be aware that any application that's loaded when the user logs on will have to compete for I/O with other services and applications, making startup time even worse. Therefore, avoid adding applications to the startup group whenever possible (a good tool for determining what applications are set to run at machine setup is AutoRuns, available at technet.microsoft.com/sysinternals/bb963902).

Identify Code Loaded from Disk

Startup Performance Resources

The following previously published CLR Inside Out columns all contain valuable information for further investigation:
Improving Application Startup Time by Claudio Caldato
msdn.microsoft.com/msdnmag/issues/06/02/CLRInsideOut
Includes some useful tips on how to write code in such a way that it has less impact on the application startup.
The Performance Benefits of NGen by Surupa Biswas
msdn.microsoft.com/msdnmag/issues/06/05/CLRInsideOut
Provides a good overview of the NGen technology with a focus on performance. This article is useful for understanding the costs and benefits of using NGen when considering different scenarios such as warm startup, code sharing, cold startup, and so on.
Investigating Memory Issues by Claudio Caldato and Maoni Stephens
msdn.microsoft.com/msdnmag/issues/06/11/CLRInsideOut
This piece is not related to startup per se, although it discusses working set, which is another important performance metric. The article covers some more advanced techniques that can be used to determine why a managed application is using too much memory.

Also, you might want to check out the following articles that provide further information to help you reduce startup time:
Bug Bash: Let the CLR Find Bugs for You with Managed Debugging Assistants by Stephen Toub
msdn.microsoft.com/msdnmag/issues/06/05/BugBash
Talks about MDAs, which are debugging aids that work with the CLR to provide information on the state of the runtime. They provide information about runtime events that you would ordinarily not be able to generate.
Inside the Windows Vista Kernel by Mark Russinovichmicrosoft.com/technet/technetmag/issues/2007/03/VistaKernel
Explains how SuperFetch operates and how ReadyBoost adds to RAM via USB without adding RAM.
Signing and Checking Code with Authenticodemsdn2.microsoft.com/ms537364
Gives you all the information you need to understand the Authenticode signing process.
Runtime Profiling
msdn2.microsoft.com/w4bz2147
Explains how to use PerMon and what the different counters mean.

The next step is to determine what is loaded from disk and find out if there is code that is being loaded unintentionally. The quickest way to determine what is loaded into memory is to use the VADump tool (you can find it in the Windows Platform SDK). Figure 3 shows an excerpt of the report that's generated by running the following command:

Figure 3 VADump Output

Category Total Private Shareable Shared Pages KBytes KBytes KBytes KBytes Page Table Pages 177 708 708 0 0 Other System 39 156 156 0 0 Code/StaticData 8169 32676 2160 8336 22180 Heap 14042 56168 56168 0 0 Stack 0 0 0 0 0 Teb 0 0 0 0 0 Mapped Data 8 32 0 4 28 Other Data 1 4 4 0 0 Total Modules 8169 32676 2160 8336 22180 Total Dynamic Data 14051 56204 56172 4 28 Total System 216 864 864 0 0 Grand Total Working Set 22436 89744 59196 8340 22208 Module Working Set Contributions in pages Total Private Shareable Shared Module 72 2 70 0 HeadTrax - HeadTrax.exe 107 7 0 100 ntdll.dll 37 4 6 27 mscoree.dll 77 3 0 74 KERNEL32.dll 6 2 0 4 LPK.DLL 27 4 0 23 USP10.dll 116 4 0 112 comctl32.dll 878 23 79 776 mscorwks.dll Heap Working Set Contributions 0 pages from Process Heap (class 0x00000000) 0 pages from Process Heap (class 0x00000000) 9332 pages from Process Heap (class 0x00000000) 0x0255850F - 0xC255350F 9332 pages 0 pages from Process Heap (class 0x00000000) 0 pages from Process Heap (class 0x00000000) 4710 pages from Process Heap (class 0x00000000) 0x00040000 - 0x10040000 4710 pages 0 pages from Process Heap (class 0x00000000) Stack Working Set Contributions 0 pages from stack for thread 00001018 0 pages from stack for thread 000017EC 0 pages from stack for thread 0000187C

VADump –sop <proc ID>

One important thing to keep in mind is that VADump shows only what is loaded in memory at the time the tool is run, so it might miss modules that are loaded in memory for only a short period of time. It also doesn't show the part of the app (either code or data) that has been paged out to disk. The goal is to review the VADump report to determine if it makes sense to load all the modules in the list. For instance, if your application does not use XML and you see that System.Xml was loaded, you need to investigate.

You can find out what loaded an assembly by using the sxe command in the Windows debugger (windbg). The "sxe ld:<dll name>" command causes the debugger to break when the specified DLL is loaded. You can then check the call stack to find out which function caused the DLL to be loaded in memory. This aspect of the investigation should not be underestimated. It is very easy to lose sight of what the application actually loads in memory.

System Assemblies and Other Processes

Once you have eliminated the loading of all unnecessary assemblies at startup (for further improvement, you can also modify the application code to delay some of the initialization work done at startup), the next thing to do is to reduce the amount of code that is loaded from system assemblies. Unfortunately, I know of no tools that can tell you how much code is pulled in if a system API is used. That would be very useful because the developer could use APIs in startup code that need less code to be loaded from the system assemblies. Until such tools become available, you can get an approximate page cost of an API by using an instrumentation-based profiler (for instance, the Visual Studio® Performance Tools).

By looking at the profile data, you can try to avoid APIs that involve system calls with big call trees (big tree, deep-nested calls mean that the code for each method call was pulled out from disk—therefore, it is a way to approximate how expensive the call is). If you can implement the same functionality by calling an API that doesn't have a deep system call tree, you will save time. This is not a scientific approach because it is not easy to determine how much code can be saved by cutting a call tree, but it generally makes sense that the bigger the call tree is, the greater the amount of code loaded from disk will be.

In some cases, your application explicitly or implicitly might launch other processes at startup. You can easily find out what they are by using the –o option in the Windows debugger (windbg). The –o option causes the debugger to attach to any child process. A typical example of a process being launched implicitly by the application occurs when the app uses XML serialization and doesn't precompile the serialization classes (using the Sgen utility). When this happens, the C# compiler is launched to compile them. Launching other processes is usually a very expensive operation that can have a significant impact on startup.

NGen Performance

Native image generation (NGen) always helps improve warm startup because it avoids the cost of JITing code. NGen can also help in cold startup scenarios if mscorjit.dll does not need to be loaded because all code used by the application is already precompiled using NGen. If only one of the modules does not have a corresponding native image, however, the mscorjit.dll will still be loaded. Then not only will the code will be JITed, consuming CPU cycles, but also many pages in the NGen images will be touched because the JIT compiler needs to read metadata. This will result in an even worse startup. For this reason, it is recommended that you remove any code that might cause JITing during startup. Of course, whether or not this approach should be taken can only be determined after measuring performance of cold startup with and without generated native images because the actual benefit of NGen on cold startup depends on the application code and size, so it is not guaranteed to give a significant startup improvement even if there is no JIT involved at startup time.

One way to determine if and when JITing is happening is to use the Managed Debugging Assistant (MDA). JIT MDA allows you to either break into the debugger or print debug information when a method is being JITed. The MDA can be enabled by setting an environment variable, as follows:

COMPLUS_MDA=JitCompilationStart

The application will break into the debugger when code is being JITed. The MDA can also be set using the registry or the application's .config file. See the "Startup Performance Resources" sidebar for more details on how to use MDA.

In general, in order to ensure that NGen is going to benefit cold startup performance, make sure that:

  • The entire application is NGen-ed.
  • There is no rebasing. Rebasing is a very expensive operation, and rebased code cannot be shared. You can find more details on how to set the base address at msdn.microsoft.com/msdnmag/issues/06/05/CLRInsideOut.
  • Assemblies are installed in the Global Assembly Cache (GAC). Strong name verification requires touching the entire file, but strong name verification is skipped for all assemblies installed in the GAC.

Authenticode Verification

Assemblies can be authenticode-signed using the signcode tool. Authenticode verification always has a negative impact on startup because authenticode-signed assemblies need to be verified by a Certificate Authority (CA). The verification requires validating the CA used to sign the assemblies, which is a very expensive operation that requires network access (if the CA is not installed locally on the same machine).

Ideally, you want to avoid authenticode-signing of your assemblies and use strong name signatures instead. If authenticode-signing cannot be avoided, the verification can be skipped in the .NET Framework 3.5 by using the following configuration option:

<configuration> <runtime> <generatePublisherEvidence enabled="false"/> </runtime> </configuration>

Note, however, that even when authenticode-signing is necessary, most of the time required for the authentication can still be saved simply by installing the CA certificate on the client machine.

Wrapping Up

In order to achieve good cold startup performance, the best thing to do is to keep the code that is executed at startup time very lean. This means postponing initialization that is not strictly needed, checking all the references to make sure they are not loaded too soon and trying to use classes and methods that don't require a lot of code to be loaded. Remember, the goal is to reduce disk access. This is not an easy task, but Xperf, a new, useful tool included in the upcoming Windows Server® 2008 SDK uses Event Tracing for Windows (ETW) to track loaded modules, context switches, and other events to help determine what happens during application startup. With XPerf it will be possible to collect very precise metrics on application startup time. The "Startup Performance Resources" sidebar contains many more references that you'll find helpful.

Send your questions and comments to clrinout@microsoft.com.

Claudio Caldato is a Program Manager for Performance and Garbage Collector on the CLR team at Microsoft.