Troubleshooting Common Problems with Applications: Debugging in the Real World

Article
07/12/2006

Mark Long
Microsoft Corporation

October 2000

Summary: Provides an introduction to debugging software. Focuses on quickly identifying common hard-failure scenarios. (16 printed pages)

Introduction

It happens with any software development: you write the application, test it, and send it out into the wide world. Then a user calls up to tell you that your pride and joy has fallen flat on its face. What do you do next? You debug. There are a few good books on the subject of debugging, and at least one is truly excellent: Debugging Applications by John Robbins. These books will teach you how to debug anything; certainly they will tell you far more than this article ever could. Rather than attempting such a broad look at debugging here, we'll narrow our focus, examining how to quickly identify some common hard-failure scenarios.

Note We will not address debugging of logic problems here, since that is another art entirely and is largely dependant on the application type.

First Steps

First, you need to ask a few questions to narrow down the problem:

Does the application fail on multiple machines?
Does the problem always happen after the same set of actions?
Did the application work until recently? If so, what has changed?
What exactly do we mean by "failure?" Is the failure an access violation, a hang, or an error message?

Does the Application Fail on Multiple Machines?

If the failure is confined to a specific machine, something is almost certainly broken on that machine. You can be pretty certain that what is broken is some part of the system that your application uses. Since it is usually very difficult to get the user to install any utilities on a target system, it is best to do as much as possible from your end.

If the application fails on a number of machines but works on others, it is advisable to look for similarities between the working systems. This may mean looking beyond the immediate computers in question and considering the wider system. For example, if your application works well for the accounts department but fails for the sales department, it would be worth checking to see if the problem is rooted in user account privileges.

Each machine on which the application fails to work is really a separate case, however. Multiple machines that show the same unintended behavior may have the same problem, but it is necessary to identify the cause of the failure on a single machine before attempting to apply the solution to multiple systems or ascertain if there is a single shared problem.

Before you start to look at configuration, it is a good idea to ask these questions:

Did the application installation appear to work without problem?
Is there anything unusual installed on the machine in question? Curious shareware utilities and beta products are favorites.

Restarting in Safe Mode

Even if no unusual software is present on the machine, it is a good idea to terminate all other tasks (typically using Microsoft® Windows® Task Manager, since some tasks or processes have no visible user interface) and check to see if the problem still occurs. If the operating system is Microsoft Windows 95 or Microsoft Windows 98, you can go one step further and boot up in safe mode.

To restart in safe mode

On a Windows 95-based computer, press F5 when you see the Starting Windows 95 message.
On a Windows 98-based computer, when the computer restarts, press and hold SHIFT hold until Windows 98 starts in Safe mode.

Safe mode is a diagnostic mode of Windows—it loads a VGA video driver and a minimal set of known drivers for other hardware. The Knowledge Base (KB) article How Windows 95 Performs a Safe-Mode Start (Q122051) has more details about safe mode. Because so little other software is loaded, you can be sure that any problems that you are still seeing are unlikely to be due to interaction with other software. If your application works in safe mode, your best bet is to first try it in normal mode, but with VGA mode selected. If it still fails, start stripping down the system. This is often a painful process for the user but there is no way to avoid this pain.

Microsoft Windows NT® and Microsoft Windows 2000 don't have a safe mode, but they are also much less prone to interaction between software, because of the much higher level of process isolation. Each process lives in a virtual address space and is largely unaware of other processes.

Examining the Files

If the application is still failing on a target machine, you need to start looking at the versions of files installed on that machine. There are two approaches to this task: a targeted examination, or a scattergun approach. You should start with the targeted approach and only use the scattergun approach as a last resort. To target the examination, you need to know the files on which your application relies. You can make a very good start on this with the Dependency Walker that ships with Microsoft Visual Studio®, or download it as part of the Windows NT Resource Kit included in Windows NT 4.0 Service Pack 4.0. For details, see the KB article Windows NT Service Pack 4.0 Tools Not Included on CD-ROM (Q206848). The Dependency Walker will tell you which DLLs you link to statically in your application. (It can't tell you anything about components that you load dynamically via LoadLibrary or COM.) Armed with the information from the Dependency Walker, as well as your knowledge of your own application, you can get the user to report the versions of each of the files used.

Once you have a list of what components are installed on the failing machine, you can compare it with the requirements of your application. For example, if your application needs MDAC 2.1 or later, does the user actually have the right version installed? Of course, it is hard to look at a file version and know where it comes from—unless you know about the MSDN Online Support DLL Help Database. DLL Help is a marvelous tool for those languishing in DLL Hell.

The scattergun approach is similar except that you check all the DLLs: those in %windir%\system[32] and those salted around the system such as program files\common files. If you have to adopt the scattergun approach, it is probably a lost cause and rebuilding the machine from a freshly formatted hard drive is almost certainly going to be the best approach. You may encounter a lot of resistance to this last resort, but it typically only takes a few hours, which is much quicker than attempting an intelligent piecemeal repair.

Does the Problem Always Happen After the Same Set of Actions?

If the application always crashes at the same point, it is likely that you can reproduce the error in your development environment. This makes debugging a lot simpler, since you can often just walk through the offending routine with a debugger. Of course, there are some scenarios where this can be tricky, as when the failure occurs only under load or when the failure is in a COM component running under Active Server Pages (ASP).

Whatever your user may report, failures are hardly ever random; there is almost always a pattern. If the failure seems to follow no pattern, it is likely that you are looking at a memory corruption. Memory accesses outside of your process space will normally fail with an access violation, but an invalid access to somewhere in your process space will normally succeed. The exception happens on Windows NT and Windows 2000, when your code tries to write to a code segment, since the code segments are normally read-only. Windows 95 and Windows 98 empower the user to write anywhere in the user's process space, no matter how unfortunate the results of that action.

Did the Application Work until Recently?

It often happens that you're told that an application "used to work fine," and the user will assure you that nothing has changed in the meantime. Users generally mean that nothing very much has changed. They may have installed a new application or changed a video card, so it is often worth pressing the user a little on this point. If something has changed, your next action is usually clear.

In one possible case where the application fails without anything having changed, data that the application depends upon has gone bad. If your application uses databases at all, now would be a splendid time to start doing some integrity checks on the database.

Another case where you get sudden failures happens when you are relying on some network resource. Network resources are normally administered by a company's IT department rather than directly by your user. It is not unknown for IT departments to change the name of a server or modify the access rights on a share.

People are often more optimistic about their ability to solve problems than they should be. If an application fails because the network link to the database failed, users often try to fix the problem by changing the ODBC settings; because of this, the application still fails to function when the link is restored. More computer-literate users may even attempt to optimize a system for better performance by marking a COM component as free-threaded rather than apartment-threaded. It is often worth checking to see that your application settings have not mysteriously changed.

What Exactly Do You Mean by "Failure?"

Initial problem reports can be a little vague. Sometimes the report will simply say that "it" doesn't work. Other times, the report may be painstakingly precise but lacks vital information; for example, it may state the address of an access violation but not the name of the executable. Determining the exact nature of the problem depends wholly on getting the best possible report from your user. Is the problem an access violation, an error message, or a hang?

Note A user will sometimes feel uncomfortable when quizzed by a developer. From the user's point of view, your application has made the user stop working, and the time spent telling you about the problem seems to be delaying you from fixing it. However, talking with the user is the only way you will find the problem and improve your application. Even if the user has made an "unfortunate" decision, you will know to guard against anyone else doing the same thing.

Access Violations

Access violations, invalid page faults, general protection faults, unrecoverable application errors; the name changes, but the nature of the beast is always the same. Your code has tried to access a bit of memory that was not available to it, perhaps memory outside of your process space, or memory that is protected within your memory space, such as code inside a Windows NT process. A discussion of some of the common causes of access violations follows.

Levels of information

There are essentially five levels of information that you can get about an access violation:

An address and module reference.
An event log report. You get an event log report if a configured COM component (in Windows 2000) or an MTS component (in Windows NT 4.0) causes an unhandled access violation. The event log entry will typically have an address and a module name. It may have some stack, depending on what COM+/MTS was doing with your component at the time of the violation. It will also have a reference to a line of source code in the operating system. That isn't going to be meaningful to you and is intended for the Microsoft developers, if the problem turns out to be a bug. In all honesty, bugs in MTS and COM+ are pretty rare, and so this probably won't be meaningful.
A DrWatson log file. A DrWatson log is better than nothing because it shows the assembler for the code that was executing at the time of the failure as well as the registers—and stack, which are often the most important of all.
A user dump. A user dump is the most useful postmortem debugging information that you can get. It contains all the code in the process, the data segment, and the heap. This can be immensely helpful, but it also means that the file tends to be huge. A 32-megabyte (MB) user dump is not unusually large.
A live reproduction of the problem. This is always the best possible scenario, because it takes all of the guesswork out of debugging. You can watch your code fail right in front of you. The one exception occurs when running your code in the debugger changes the behavior of the application.

Windows will always try to give you as much information about the failure as it can, but unless you compile with symbols, there is nothing much it can tell you except the hex addresses of your procedures. That can even be misleading, since Windows will give you an address as an offset from the last known symbol. For a COM DLL with no symbols, this is often DLLUnregisterServer. This tends to be a short routine—typically less that 255 bytes—so if you see an address along the lines of DLLUnregisterServer+0x123456, you can be pretty sure that the problem has nothing to do with DLL registration and you need to supply symbols. Symbols are time stamped and the timestamp must match the timestamp of the DLL or EXE for it to be used. Accordingly, you should always compile the final release with symbols and always save the symbols with the source. That is not to say that it must be a debug build—adding symbols has no effect on the speed of an optimized build and typically adds less than 32 bytes to the executable size. Having good symbols reduces the time to debug by several factors.

WinDbg

If you have to do postmortem debugging, the tool of choice is WinDbg (often called Windbag). This tool is available from the Microsoft Web site at the Windows 2000 Customer Support Diagnostics which also contains a number of other excellent debugging tools.

There are a number of good KB articles on WinDbg:

Debugging a Windows NT Service (Q170738)
CTRL+C Exception Handling under WinDbg (Q97858)
Specifying the Debugger for Unhandled User Mode Exceptions (Q121434)
How to Verify Windows NT Debug Symbols (Q148660)

WinDbg also works on Windows 95 and Windows 98, but there is no way of getting a process dump on these platforms, so there are fewer advantages to using WinDbg.

Note A little assembly level knowledge goes a long way when debugging, and the most complete assembly reference, Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual, is available for free on the Intel website. The Pentium datasheets list all the instructions in detail.

Working through a buggy application

Among access violations, there are some old favorites that you will see over and over in debugging. For those of us who do a lot of debugging, they are favorites because they are easy to fix. Everyone has made these mistakes at some time or another. I have written a very buggy application to show what they look like at an assembler level.

Bad pointers (simple or calculated) are a common problem. The most likely causes are an uninitialized pointer (often as a part of a structure) or a pointer that was good at one time, but one for which the memory is no longer allocated.

Let's look at some source code:

   int *p;
   int *q;
   int i;

   p = (int*)malloc(100000);
   q = p;
   p=(int*)realloc((void*)p, 100);
   free(p);
   
   for (i=0; i<1000; i+=4)
   {
     
     q = (int*)(unsigned long)q + i;
     *q = 1234;

   }

Now we'll look at the assembler:

00401296   call        dword ptr ds:[4022F8h]
0040129C   mov         esi,eax
0040129E   push        64h
004012A0   push        esi
004012A1   call        dword ptr ds:[402304h]
004012A7   push        eax
004012A8   call        dword ptr ds:[402308h]
004012AE   add         esp,10h
004012B1   xor         eax,eax
004012B3   add         esi,eax
004012B5   add         eax,10h
004012B8   cmp         eax,0FA0h
004012BD   mov         dword ptr [esi],4D2h  Fault
004012C3   jl          004012B3

Clearly, this is a wildly inappropriate thing to do with a pointer and Windows objects. The line "mov dword ptr [esi], value" is characteristic of a direct memory dereference: *q = 1234 in this case. The ESI register is holding the address of the variable q and 0x4D2 is 1234 in hex. The value that is being put into memory can be a valuable clue.

Often the error will seem to be in a system routine. Look at this code fragment:

   void* p = (void*)1234;
   void* q;

   RtlMoveMemory(p, q, 200);

Access violates in the MSVCRT.DLL—but the fault is of course in the code. Look at the top of the stack:

0012f2d0 00401a39 000004d2 cccccccc 000000c8 MSVCRTD!memmove+0x9e
0012f33c 5f4373fc 0012f5dc 00133300 0060dd38 AV!CAboutDlg__OnMemsetBadPointer+0x39 
0012f374 5f437b2b 0012fa2c 000003e9 00000000 MFC42D!Ordinal563+0x133
0012f3cc 5f43374b 000003e9 00000000 00000000 MFC42D!Ordinal3657+0x274
0012f3fc 5f42fa63 000003e9 00000000 00000000 MFC42D!Ordinal3658+0x24

This is output from WinDbg, and it is worth taking a moment to identify what is here. The first number is the frame pointer; that is to say, it is the start of the stack frame where local data and parameters are stored.

The second number is the return address, where control goes after returning from this function. The third, fourth, and fifth arguments are parameters passed to the function—these are the second most useful things to know in a stack trace. Finally, the most important thing to know: the location in the code—where we are and how we got there.

**Note **The details of stack frame usage are available on MSDN. See Matt Pietrek's Microsoft Systems Journal column "Under the Hood," in February 1998, and his two-part column in April 1997 and May 1997.

There are a few things to note about location:

The first is that the routine in the runtime is memmove, which is perhaps not what you would expect. The Microsoft Visual C++ compiler is very smart when it comes to code generation and will take the easiest route. While writing this article, I called memset to set 20 bytes to a value. Rather than have the overhead of a loop, the compiler did five long movs.
The second thing to note is that the parameters are quite clear—0x4d2 is our old friend 1234. However, cccccccc is less familiar. That is the uninitialized pointer q—it has taken on that value because the memory that represents the pointer was uninitialized. We can clearly see that the values being passed into the runtime are invalid in this case, and it is worth looking at other patterns that you might see while debugging:

Table 1. Potential patterns

Pattern Description

0xFDFDFDFD No man's land (normally outside of a process)

0xDDDDDDDD Freed memory

0xCDCDCDCD Uninitialized (global)

0xCCCCCCCC Uninitialized locals (on the stack)

Pattern	Description
0xFDFDFDFD	No man's land (normally outside of a process)
0xDDDDDDDD	Freed memory
0xCDCDCDCD	Uninitialized (global)
0xCCCCCCCC	Uninitialized locals (on the stack)

These values are undocumented and subject to change, but any sort of a signpost can be helpful while debugging. Just because you don't see those values doesn't mean that the values in memory are valid, of course. These are just some of the common patterns.

COM calls can give rise to some interesting errors. Consider the following code fragment:

      // make an unsafe copy. Should use AddRef.
      pView2 = pView1;
      pView1->Release();
      // Some other code - let COM garbage collect
      ....
      pView2->AddRef();

COM will unload the server at some indefinite point after the reference count goes to zero. In this case, I have deliberately copied the pointer (which doesn't increment the reference count) and then destroyed the object. This means that pView2 is a pointer to a VTable that may or may not still be in memory and points to an object that may or may not be in memory. That is the sort of thing that can be easily missed in testing. If we look at the assembler for this, it is quite characteristic:

mov         edx,dword ptr [ebp-0Ch]
mov         eax,dword ptr [edx]
mov         esi,esp
mov         ecx,dword ptr [ebp-0Ch]
push        ecx
call        dword ptr [eax+4]  Fault
cmp         esi,esp
call        __chkesp (004010f0)

A VTable is just a table of pointers and the faulting instruction is an indirected jump. Early and late binding look similar because late binding is just a call through the Invoke method (5) of an Iunknown-based interface.

The other common error is running off the end of an array. Let's look at another code fragment:

   int aItems[10];
   int iIndex;

   iIndex = 12345678;
   aItems[iIndex] = 1;

This is clearly a bad thing to do. Look at the assembler again:

mov         dword ptr [ebp-30h],0BC614Eh ; Set iIndex
mov         eax,dword ptr [ebp-30h] ; ebp is the stack pointer and iIndex 
is a local
mov         dword ptr [ebp+eax*4-2Ch],1 ;  fault

Quite often the calculated address will be outside of the array but within the process space. This won't give an error, but it will corrupt some aspect of your program's performance. In these cases, an access violation is really a windfall since finding memory corruption is otherwise very difficult. Of course, many of the access violations that you may see will be caused by memory corruption, rather than being the cause of it.

Real life examples are normally more obscure than these. For example, in one case I was writing a compiler that needed to do a string lookup as part of the lexical analyzer. I looted code from an interpreter that I had written the year before. The lookup routine returned a token number that corresponded with an item in the array for the keyword. However, if the keyword was not found, the routine returned –1 (which I had forgotten). This outcome resulted in accessing the –1th element of an array, which happened to be on the stack. This memory access overwrote the return address of the routine; therefore, it caused a crash in a random memory location when the code returned to the wrong place.

Error Messages

There are essentially three classes of error messages:

"You can't do that" messages for the user.
A diagnostic message from your application, which has trapped and handled an error.
An unhandled error that makes your application fail.

The first type of error should be a unproblematic. The second type shouldn't be a problem in a well-written application. Each error message should at least give the error number, the name of the routine, and the name of the application. The application name sounds like an odd choice—you would reasonably expect your users to know what application they are running, but it is not unknown for a user to assume that any error message that pops up is an error in the application that owns the window currently at the front.

Given that we are living in an age of n-tier applications, you have to make intelligent choices about where to log errors. Hopefully, errors will be fairly rare, so the log file, the Windows NT event log, or the database (whatever is being used to log progress) doesn't have to be a fast store, but it does have to be nontransactional. If you are writing server-based components, you could do much worse than logging errors to the event log. This can be read remotely and can be sent in to you via e-mail if need be—and that can be a great help.

Unhandled errors are the biggest problem. Generally an application doesn't handle an error if it has no expectation of getting that error (There is an old Unix adage that you should never test for an error that you don't know how to handle). This generally results in your application dying rather messily.

There are two main types of errors that tend to go unhandled: return value errors and exceptions. Exception handling is something that you should be doing if you are programming in C++ and something that you should consider in Microsoft Visual Basic® and in Java. Both these languages will handle exceptions for you, but their handling is limited to logging or displaying the raw error and terminating your application. The documentation for these languages covers error handling well. Exception handling (and especially unhandled exception handling!) is well covered in John Robbins' Bugslayer columns from August 1998 and December 1999, which can be found on MSDN. Of course, an access violation is just another exception.

Although exceptions are well documented on MSDN, it is worth discussing them a little at this point. There are two types of exceptions: first chance and second chance exceptions. These exceptions are actually identical, except for one difference: the stage at which they are handled. A first chance exception may mean that something has gone wrong, but in practice it is more likely that an application is simply using the exception handling mechanism to handle a normal case. A handled first chance exception is not generally a problem and can be ignored. Visual Basic often uses Inexact Floating Point exceptions as part of its run time system, and as long as there is a first chance exception handler (and there is in this case), this exception indicates that there is no problem. An unhandled first chance exception is thrown to the next level and becomes a second chance exception, which is handled by a second chance exception handler. If the application is running normally, this is likely to be DrWatson.exe. If the application is being debugged, the debugger will normally act as a second chance exception handler and drop into break mode. The exception to this occurs when you are debugging Visual Basic code under the Visual Basic IDE, since this IDE does not handle low-level errors.

Unchecked return values do not cause your application to fail at the actual point of failure. They mainly lead to errors when you are expecting an API to allocate a buffer for some task. If the API fails and you dereference the supposedly allocated memory, you end up back looking at access violations.

Hangs

There are two types of hangs: busy hangs and waiting hangs. With a busy hang, the application is in a tight loop. With a waiting hang, all the threads are waiting on something. The first priority when troubleshooting a hung process is to determine which sort of hang you have. Fortunately, this is easy to do. Just look at the processor utilization of the task. If the utilization is very high, the hang is busy—a busy loop. If the utilization is very low, the hang is waiting—you have a waiting process, which is normally due to a deadlock.

The situation is somewhat more complex when you have a multithreaded application, because not all of the threads are in the same state. It is normal for some threads in an application to be busy, with other threads sitting idle; that is the normal synchronization mechanism. If all the threads are busy or all the threads are idle, you have a problem.

Very often the hang only occurs when the process is under load, or the hang cannot be reproduced in-house at all. This makes conventional debugging techniques ineffective, especially since running the application often changes the timing of the process, preventing or changing the failure.

The best approach is to attach a debugger to the process after it has hung (if you can reproduce the problem or remote debug) or create a process dump. A process dump is almost as good as a live debug in these cases, because the state doesn't change much over time; that is as good a definition of a hang as any. The only real disadvantage of a process dump is that you lose some handle information.

DotCrash and UserDump are the two major tools for creating process (user mode) dumps, both of which are shipped as part of the Platform SDK and several other SDKs. Both do much the same thing, but UserDump has the advantage that it doesn't terminate the process. This means that you can perform two user dumps over a short period and see if any threads have changed state. Unfortunately, this functionality is only available on Windows NT or Windows 2000. These problems are very difficult to solve on Windows 95 and Windows 98.

Before we start looking at the dumps, there is another tool that is invaluable for troubleshooting these scenarios. Perfmon (again, available on Windows NT and Windows 2000 only) lets you monitor almost every aspect of a process. There are some key things to watch for:

Number of threads. If you have a COM server (perhaps under MTS) that is blocking, you will typically see the number of threads grow as new threads are spun up to serve the incoming calls. This is characteristic of a process that is in trouble.
Processor utilization. Watching for processor utilization can be done on a per-process and a per-thread basis, so you can watch individual threads lock up.
Memory and handles. These are most often useful when checking for leaks, but a surprising number of hangs are leak related. In the typical pattern, the application leaks until a resource is no longer available. This causes an allocation to fail, changing the path of execution, which leads to the hang.
Thread wait state. This can give you some insight into what your threads are doing. If most of them are waiting, you probably have a high degree of serialization. If all the threads are cycling between states, the process is highly parallel.

Busy loops generally have the thread inside your code. As long as you have your symbols, it should be immediately apparent where the code is looping. Busy loops can be as simple as:

   for (;;);
   return 0;

A more common scenario is a while loop that never ends:

int iLoop = 0;

    while (iLoop < 10)
      if (iLoop % 2)
         iLoop++;

Deadlocked processes typically have most of the threads in either WaitForSingleObject or WaitForMultipleObject. You will often see calls to Sleep. SleepEx—these are common when an object has performed a wait with a timeout and is sleeping before waiting again. The problem is never in these routines, since they are the core of multithreading and have been subject to excellent testing. The key is to look at the stacks of each of the threads, which can be time consuming. My debugger of choice for these cases is WinDbg, since it is well suited to handle large numbers of threads. Microsoft Visual C (MSVC) is friendlier in that there is a GUI interface for more of the functionality, but this interface generally gets in the way when you need to look at many threads. Under WinDbg, the command "~*kb" lists the stacks of all running threads.

Deadlocks are caused when a thread waits for an event that never occurs. Typically, a thread will be waiting for another thread. If the thread has gone away before signaling, or is itself waiting on an event that will never occur, the first thread will block forever. The classic case of this deadlock is called "deadly embrace."

For example, two threads require both Mutex A and Mutex B. Thread 1 acquires Mutex A and thread 2 acquires Mutex B. Thread 1 is blocked when it asks for Mutex A. Thread 2 is blocked when it asks for Mutex B. Neither thread can release the mutex that it holds because it is blocked. The solution is for both threads to request Mutex A first. The second thread will then block on Mutex A and not seize Mutex B. When the first thread is finished, it releases both Mutex A and Mutex B, and the second thread can run to completion. A leaked mutex (or critical second, a special case of a mutex) will also cause almost immediate deadlock because there is no owning thread to free the mutex.

You can also get a busy hang if a thread waits on itself, which can happen if your application gets its events confused. For example:

   HANDLE hEvent;

   hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);

   WaitForSingleObject(hEvent, INFINITE);

The most famous case of single-threaded deadlock is a little sneakier than that, however. If you load a DLL in DLLMain, DLLMain is called, as always happens after a LoadLibrary. But there is a critical section in the runtime that prevents DLLMain from becoming re-enterant, so the thread blocks until the current caller of DLLMain is done. Since the current caller is on the same thread, you have a deadlock.

Note You may get the report that the application worked well until it was tried on a multiprocessor machine. In fact, problems that occur on a multiprocessor machine can almost always be duplicated on a single processor system, but they tend to be much less common because the single processor imposes a degree of serialization on the process.

Other Clues

When running under the debugger, you may sometimes get a message saying that a user breakpoint has been hit, and the breakpoint seems to be in a system function. This is normally a sign that something is not as it should be. Look at the following code fragment:

   void * lpMem = malloc(8192);

   free(lpMem);
   free(lpMem);

The second call to free is invalid and generates an INT 3 exception: user breakpoint hit. This exception is not normally fatal outside of the debugger but is a signal that you have a bug in your code. This can be more serious in a multithreaded environment. For example, the following code sometimes hangs:

#include <stdlib.h>
#include <process.h>

void threadfunc(void * pmem)
{
   // Do something with pmem
   
   //...

   // Best free the memory now we've finished with it.

   free(pmem);
}

int main(int argc, char* argv[])
{
   void * pmem = malloc(8192);

   // Start two worker threads:

   _beginthread( threadfunc, 0, (void *) pmem );
      
   // Clear up.
   free(pmem);

   return 0;
}

On the subject of prevention, Asserts can save you many late nights and have no impact on the size or performance of your release builds. Ideally, each Assert should test only a single condition, since this makes it clear which condition failed. There really isn't any point in trying to save space or time by not using Asserts, since only debug builds ever contain the Assert code.

Traps to Watch Out For

Debugging can be a rewarding task in much the same way that solving a jigsaw can be fun. However, there are a few cases where it seems that someone has stolen some of the pieces or reshaped them in new and interesting ways. Here are some of the problems that may challenge you.

Bad Symbols or No Symbols

A problem with symbols is difficult. Fortunately, WinDbg will try to use the exports of any DLLs that it finds to try to give you some idea of where you are in the DLL. This can be misleading, however, and it is a good idea to look at the offset given from the start of the routine. If the value is very small, it probably is part of that routine. It is suspect if it is larger than 200 bytes or so. Larger offset generally indicates that you might be anywhere other than the stated function.

Bad symbols can be worse than no symbols, as you get misleading information rather than no information. WinDbg lets you work with mismatched symbols, but MSVC does not. You can take a chance on them, and you may find them tolerably accurate, but working with mismatched symbols isn't something to do if you may possibly be able to find good symbols.

Nonsense Code at the Point of Failure

The appearance of nonsense code is probably due to a jump into an area of memory that doesn't hold code. If you see instructions that don't disassemble, what the process is trying to run is not code. This normally occurs after memory corruption. Possibly the code segment has been overwritten (more likely on Windows 95 and Windows 98), possibly a pointer to a method of a class has been overwritten, or the class in no longer in memory if it is a COM class. It is most likely to be evidence of a corrupted stack—the return address is at the base of the stack frame. If the return address gets corrupted, the processor returns to the wrong place. This causes your code to crash a long way from the actual error, which is normally very difficult to debug after the event. There is no way to make this sound like a good thing.

Corrupted Stack

If the addresses on the stack do not match any procedures or are outside of the process space, you have a corrupted stack. You may get lucky and have part of the stack in one piece, but it is often impossible to make much progress without a good stack. The function that has overwritten the stack may be in the stack, but it is just as likely that the function has executed and gone.

Partial Stacks in a Process Dump

A stack can be up to 1 MB in size, and there may be many threads in a process. This means that not all of the stack is present in a process dump. If you lose the head of the stack, you can find it very difficult to work out how you came to a particular point. Your best chance under such circumstances is to get multiple process dumps and hope that you can catch the call at an earlier stage and marry the stacks together. This process is prone to error but often can be useful.

Corrupted User Dump

A corrupted user dump is of no use, sadly. User dumps created with DotCrash or Userdump are almost always sound, but a crash created by DrWatson may not be valid. This really is DrWatson's fault, since the process was clearly in trouble prior to the generation of the process dump. Anything that you do to an unstable process may fail.

Tools

There are many other tools that help debugging on a day-to-day basis.

For debugging ISAPI problems, including in-process COM components called from ASP, the IIS Exception Monitor is without parallel. The IIS Exception Monitor is documented in Aaron Barth's article Troubleshooting with Exception Monitor which includes links for downloading the Exception Monitor.

For spying on a process, FileMon and RegMon are staple tools. They report all file accesses and registry accesses without any instrumentation in your code. They can be downloaded from the excellent Sysinternals Web site, https://www.sysinternals.com/.

MuTek's BugTrapper is a relatively unobtrusive tool that allows you to discover what your application really is doing. It can also be used to get diagnostic traces off a remote machine, which can be especially helpful.

NuMega SoftIce is a very good debugger, especially for device driver issues and other low-level work. Bounds Checker from the same company is a "must have" debugger for fixing memory leaks.