Concurrent Affairs

Performance-Conscious Thread Synchronization

Jeffrey Richter

Code download available at: ConcurrentAffairs0510.exe (117 KB)
Browse the Code Online

Contents

Thread Synchronization Mind-Set
Performance
Spin-Wait Lock

In my career, I have architected and implemented many thread synchronization techniques. This has provided me with a lot of experience that has shaped the way I now think about thread synchronization problems. In this new column about concurrency, I will discuss my way of thinking on the subject and offer some code that I think you'll find useful when dealing with thread synchronization problems yourself.

To really grasp the concepts in this column, you must have some understanding of thread synchronization already. In particular, you should have knowledge and experience with existing Windows® thread synchronization primitives such as the interlocked functions, mutexes, semaphores, events, and critical sections. It is also helpful to understand the .NET Framework Monitor and ReaderWriterLock classes of the System.Threading namespace.

Thread Synchronization Mind-Set

Thread synchronization is required when two or more threads might access a shared resource at the same time. A resource can be as simple as a block of memory or a single object, or it can be much more complex, like a collection object that contains thousands of objects inside it, each of which may contain other objects as well. Sometimes it is also necessary to treat multiple objects as a single resource for the purposes of thread synchronization. For example, your application might need to add or remove items to or from a queue and update a count of items in the queue in a thread-safe manner. In this example, the queue and the count data must always be manipulated together in an atomic fashion. For purposes of this text, I would consider the queue object and the count object as a single resource.

No programmer in his right mind wants to think about thread synchronization as he is coding his app. This is because thread synchronization has little to do with making the program do what its true intention is. In addition, writing thread synchronization code is difficult, and doing it incorrectly can lead to resources in inconsistent states causing unpredictable behavior. Furthermore, adding thread synchronization to your code makes the code run slower, hurting performance and reducing scalability. But, unfortunately, thread synchronization is a necessary part of life.

Performance

Thread synchronization code makes your application run slower. In the case where there is only one thread attempting to access a resource, this additional code is pure overhead. So how do you protect performance?

First, don't allow threads to transition from user mode to kernel mode. On Windows, this is a very expensive operation. Second, when a thread enters a wait state, it stops running. Obviously, a thread that stops running is exhibiting very bad performance. Fortunately, a thread can only ever enter a wait state if it calls into the kernel. So, if you follow the first rule (avoid calling into the kernel) then the second rule (don't let your thread enter a wait state) will follow automatically. Third, use as few threads as possible to accomplish your task. In an ideal world, there should never be more threads in existence than there are CPUs in your computer. So, on a machine with four CPUs, there really shouldn't be a need for more than four threads. Why? Because when more than four threads exist, the operating system schedules the threads to the CPUs in a round-robin fashion every few milliseconds. This context switching takes time and is more pure overhead. If the number of runnable threads is equal to or less than the number of CPUs then the operating system just runs the threads and no context-switching occurs; performance is at its best.

Now, we don't live in an ideal world, so we usually see the number of threads in existence far outnumber the number of CPUs in the machine. In fact, on my single-CPU machine I currently have 425 threads in existence—a far cry from the ideal. Windows gives each process its own thread in order to isolate processes from one another, so if one process enters an infinite loop, other processes can still execute. I currently have 45 processes running on my machine, which accounts for 45 threads out of the 425. Why the remaining 380 threads? Many processes also use multiple threads for isolation. For example, the common language runtime (CLR) has a finalizer thread that wants to run in a predictable manner regardless of what some other thread happens to do. There is no way for me to tell how many of the 380 threads are running on my machine for this reason but I'm sure it's quite a few.

The last reason why so many threads exist on my machine is because many (or dare I say most) programmers don't know how to use threads efficiently. Over the years, programmers have been taught how to program in a way that is actually detrimental to building high-performance, scalable applications. And operating systems like Windows have provided APIs, samples, and documentation that continue to perpetuate this. In fact, I have also been guilty of perpetuating this way of programming and I will continue doing it in this column, mostly because it is too late now to change. Changing would require too much re-architecting. However, in the third installment of this column, I will explain how to architect an application for the highest degree of performance and scalability, and this architecture will reduce the number of threads that your application thinks it needs.

If you find yourself asking any of the following questions, then you know that you have most likely architected your application incorrectly and you should re-architect it:

  • What is the maximum number of threads Windows supports?
  • What is the maximum number of threads that I can create in a single process?
  • What is the maximum number of threads that I can have in the thread pool?
  • Can I increase the maximum number of threads that I can have in the thread pool?
  • Should I have a 1-to-1 correspondence between users and threads? For example, I know a few server applications that use one thread per connected user or client.

Let's focus more on the performance of user-mode to kernel-mode transitions. I wrote some code (see Figure 1) that runs a series of tests that measure the performance of various operations.

Figure 1 Performance Test

using System; 
using System.Threading;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public sealed class Program 
{
    public static void Main() 
    {
        const Int32 count = 200 * 1000 * 1000; // 200,000,000 iterations
        Stopwatch sw;

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; n++) M();
        Console.WriteLine(sw.Elapsed + "\tM");

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; n++) Thread.SpinWait(1);
        Console.WriteLine(sw.Elapsed + "\tSpinWait");

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; /*n++*/) n++;
        Console.WriteLine(sw.Elapsed + "\t++");

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; /*n++*/) Interlocked.Increment(ref n);
        Console.WriteLine(sw.Elapsed + "\tIncrement");

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; n++) SwitchToThread();
        Console.WriteLine(sw.Elapsed + "\tSwitchToThread");

        sw = Stopwatch.StartNew();
        for (int n = 0; n < count; n++) Thread.Sleep(0);
        Console.WriteLine(sw.Elapsed + "\tSleep");

        using(WaitHandle h = new ManualResetEvent(false))
        {
            sw = Stopwatch.StartNew();
            for (int n = 0; n < count; n++) h.WaitOne(0, false);
            Console.WriteLine(sw.Elapsed + "\tWaitOne");
        }

        Console.WriteLine("Hit Enter");
        Console.ReadLine();
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void M() { }

    [DllImport("Kernel32", ExactSpelling = true)]
    private static extern Boolean SwitchToThread();
}

Running these tests on my personal machine yielded the results shown in Figure 2. Notice that the operations that cause a user-mode-to-kernel-mode transition are very expensive. The time to call Sleep is more than 100 times longer than the time to call an empty method. And the time to call WaitOne (to test a kernel object and always immediately return since timeout is 0) is about 330 times longer than the time to call an empty method!

Figure 2 Test Results

Test Time for 200,000,000 Operations (seconds) Transitions to Kernel Mode
Empty method call 1.0 No
Thread.SpinWait(1) 2.4 No
Interlocked.Increment(Int32) 2.8 No
SwitchToThread() 73.0 Yes
Thread.Sleep(0) 115.2 Yes
WaitHandle.WaitOne(0, false) 328.2 Yes

Now, the first two operations don't actually attempt to do anything that is related to thread synchronization. But, the third operation, Interlocked.Increment, does and because the Interlocked family of methods execute entirely in user mode, you can see that the Interlocked methods hold the key to performing really fast thread synchronization. Unfortunately, though, they are very limited in their capabilities.

The Interlocked methods are usually used to perform atomic, thread-safe operations on Int32 values. There are also Interlocked methods that manipulate Int64, Single, Double, and IntPtr values, as well as references to reference type objects, but these methods are rarely used by anyone and I will not spend time on them here. In fact, the use of Interlocked methods on 64-bit values (Int64 and Double) while running on a 32-bit operating system is strongly discouraged because the CLR offers no way to guarantee that the 64-bit value is aligned properly in memory (and the Interlocked methods require this on some CPUs). In fact, even when running on 64-bit operating systems it is best to avoid the Int64 and Double versions of the Interlocked methods entirely. However, IntPtr and 64-bit object references are guaranteed to be aligned properly and, therefore, you can call the Interlocked methods that work with the Object type successfully.

Here are the prototypes of the Interlocked methods that operate on Int32 values:

public static class Interlocked 
{
    public static Int32 Increment(ref Int32 location);
    public static Int32 Decrement(ref Int32 location);
    public static Int32 Add(ref Int32 location1, Int32 value);
    public static Int32 Exchange(ref Int32 location1, Int32 value);
    public static Int32 CompareExchange(ref Int32 location1, 
        Int32 value, Int32 comparand);
    ...
}

It is important to note that all of these Interlocked methods, except for Increment and Decrement, return the original value that was passed in. Increment and Decrement return the new value rather than the original value.

While the Interlocked methods are great when you want performance, they are limited in functionality. For example, it might be nice to have a way to perform a bit-wise AND, OR, or XOR of two Int32 values in an atomic way. It might also be nice to have methods that could do an atomic bit-test-and-set or bit-test-and-reset, for example. While there are no Interlocked methods for any of these operations, it is possible to build them yourself so that they work atomically. The code in Figure 3 is an example of an Interlocked AND method.Hyper-Threaded CPUs

Hyper-threading is a technology on Xeon, Pentium 4, and later CPUs. These processor chips have multiple "logical" CPUs, each of which can run a thread. Each thread has its own architectural state (set of registers) but all threads share main execution resources such as the CPU cache. When one thread is paused, the CPU automatically executes another thread. A pause can be a cache miss, branch misprediction, waiting for results of a previous instruction, and so on. Intel reports that hyperthreaded CPUs improve throughput somewhere between 10 percent to 30 percent. For more information about hyper-threaded CPUs, see Yaniv Pessach's article on the subject in the June 2005 issue of MSDN Magazine, available at Make It Snappy: Juice Up Your App with the Power of Hyper-Threading.

When executing spin loops on hyper-threaded CPUs, you need to force the current thread to pause so that the other thread has access to the chip's resources. The x86 architecture supports the PAUSE assembly language instruction. The PAUSE instruction ensures that a memory order violation is avoided, improving performance. In addition, the instruction reduces power consumption by placing a hiccup into what would be a very hot, tight loop. On x86, the PAUSE instruction is equivalent to a REP NOP instruction, which makes it compatible on earlier IA-32 CPUs that do not support hyper-threading. PAUSE causes a finite delay (0 on some CPUs). In the Win32 API, the x86 PAUSE instruction is executed by calling the YieldProcessor macro defined in WinNT.h. This macro exists so that you can write unmanaged code in a CPU architecture-independent way. Use of the macro expands the code inline, thus avoiding the overhead of a function call. In the .NET Framework, the Thread class's SpinWait method internally calls the Win32 YieldProcessor macro. In fact, SpinWait is effectively implemented like this on x86 platforms (you should never pass 0 to SpinWait, always pass 1 or greater):

void Thread.SpinWait(Int32 iterations) 
{
   for (Int32 i = 0; i < iterations; i++) YieldProcessor();
}

Figure 3 Interlocked AND Method

public static class InterlockedEx 
{
    public static Int32 And(ref Int32 target, Int32 with) 
    {
        Int32 startVal, currentVal = target;

        // Don't access target in the loop except in an attempt 
        // to change it because another thread may be touching it 
        do 
        {
            // Record this iteration's starting value
            startVal = currentVal;

            // Calculate the desired value (this example ANDs the bits)
            Int32 desiredVal = startVal & with;

            // NOTE: The thread could be pre-empted here

            // if (target == startVal) target = desiredVal;
            // Value prior to potential change is returned
            currentVal = Interlocked.CompareExchange(
                ref target, desiredVal, startVal);

            // If the starting value changed during this iteration, repeat 
        } 
        while (startVal != currentVal);

        // Return this iteration's starting value
        return currentVal;
    }
}

To atomically AND the low-order byte of an Int32 value, you'd call the method like this:

Int32 x;    // Some Int32 value that is accessed by multiple threads
...
InterlockedEx.And(ref x, 0x000000FF); // This call is in some method

This AND method simply saves the original value of the target (x) in startVal, then it calculates the new desired value (startVal & with) and saves the result in desiredVal. Then, it calls CompareExchange: if the target (x) is equal to startVal, then x gets the desiredVal. CompareExchange returns the value that was in x at the time the method is called. If the value in x was what it was when the method first got called, then the loop exits and the original value of x is returned to the caller. This is the most common case; it is expected that x will not change between the top of the loop and the call to CompareExchange and, therefore, the code in the loop will only execute once and the method will immediately return.

In fact, the only time the loop will execute more than once is if another thread just happens to modify x after the first thread has saved x's value and before it calls CompareExchange. If this contention happens, then the first thread loops around and just attempts to do the operation again. In most applications, it is very unlikely that this contention will actually occur and, therefore, it is very unlikely that the loop will ever actually loop. If contention does occur, then it is very unlikely that there will be contention the second or subsequent times through the loop.

Techniques like this can be used to write additional interlocked operations. An interlocked OR method can be implemented simply by changing the line of code in Figure 3 that performs the & operation to instead perform an | operation. You can also use this technique to build more powerful methods, such as an interlocked test-and-set and an interlocked max and min.

Spin-Wait Lock

It is possible to create a very simple and fast thread synchronization lock class using Interlocked's Exchange method. This type of lock is called a Spin Lock because the thread that is attempting to lock the resource logically associated with the spin lock spins in a loop until it is granted access. Figure 4 shows the code for my SpinLock class; Figure 5 shows some code that demonstrates how to define a class that uses a SpinWaitLock object in order to allow only one thread at a time access to a resource.

Figure 5 SpinWaitLock Object

public class UseSomeResource {
    private SpinWaitLock m_swl = new SpinWaitLock();

    public void AccessResource() {
        try 
        {
            m_swl.Enter();
            // Access the resource...
        }
        finally {
            m_swl.Exit();
        }
    }
}

Figure 4 Spin Lock Implementation

// NOTE: This is a value type so it works very efficiently when used as
// a field in a class. Avoid boxing this or you will lose thread safety!
public struct SpinWaitLock
{
    private const Int32 c_lsFree = 0;
    private const Int32 c_lsOwned = 1;
    private Int32 m_LockState; // Defaults to 0=c_lsFree

    public void Enter() 
    {
        Thread.BeginCriticalRegion();
        while (true) 
        {
            // If resource available, set it to in-use and return
            if (Interlocked.Exchange(
                ref m_LockState, c_lsOwned) == c_lsFree) 
            {
                return;
            }

            // Efficiently spin, until the resource looks like it might 
            // be free. NOTE: Just reading here (as compared to repeatedly 
            // calling Exchange) improves performance because writing 
            // forces all CPUs to update this value
            while (Thread.VolatileRead(ref m_LockState) == c_lsOwned) 
            {
               StallThread();
            }
        }
    }

    public void Exit() 
    { 
        // Mark the resource as available
        Interlocked.Exchange(ref m_LockState, c_lsFree);
        Thread.EndCriticalRegion();
    }

    private static readonly Boolean IsSingleCpuMachine =
        (Environment.ProcessorCount == 1);

    private static void StallThread() 
    {
        // On a single-CPU system, spinning does no good
        if (IsSingleCpuMachine) SwitchToThread();
      
        // Multi-CPU system might be hyper-threaded, let other thread run
        else Thread.SpinWait(1);
    }

    [DllImport("kernel32", ExactSpelling=true)]
    private static extern void SwitchToThread();
}

Let's say that a thread creates an instance of UseSomeResource and then calls the AccessResource method. AccessResource then attempts to enter the SpinWaitLock field, m_swl. SpinWaitLock's Enter method calls Exchange to atomically set the m_ResourceInUse field to 1 (c_LockIsOwned). If no thread was using this lock, then m_ResourceInUse was 0 (c_LockIsFree) and Exchange returns 0; then the while loop doesn't execute and Enter returns immediately. Wow, this is fast!

However, if the lock was in use, m_ResourceInUse was 1 and Exchange returns 1 and the while loop will execute. In the loop, I check to see how many CPUs are in the machine and if there is more than one CPU, the loop cycles and checks again to see if m_ResourceInUse is 0 or not. Hopefully, the thread that currently "owns" the resource is running on another CPU and the thread will complete its task and call Exit very soon. When a thread calls Exit, Exchange is called to set m_ResourceInUse back to 0. This allows another thread to "own" the resource.

If the number of CPUs is 1, then Enter calls SwitchToThread before looping around again. The reason is that on a single-CPU machine, it is likely that the other thread that "owns" the resource is not running at this time and it will not be able to give up the resource by calling Exit. This means that it is likely that looping will cause more looping and CPU time is now being wasted. The call to SwitchToThread causes the operating system to context-switch to another runnable thread. Hopefully, this will be the thread that owns the resource; the thread will complete its task and call Exit. Note that the Win32® SwitchToThread API is not wrapped with a managed equivalent and, therefore, I must interop out to the unmanaged API. SwitchToThread is ideal because it allows another thread to run for one time slice. This thread can even be a lower-priority thread. For more information see SwitchToThread in the Win32 SDK documentation.

Let me point out some other facts about the SpinWaitLock class. Spin locks are extremely fast. When there is no contention, Enter simply changes a value from 0 to 1 in a thread-safe way and returns; Exit sets the 1 back to a 0 and returns. When there is contention, StallThread is called.

The purpose of StallThread is to increase efficiency when contention on the SpinWaitLocks object is detected. On a multi-processor machine, the thread will call Thread.SpinWait, which causes the thread to remain in user mode; it will not transition to kernel mode and it will never enter a wait state. Thread's SpinWait method was added to support hyper-threaded CPUs. If your code is running on a machine with hyper-threaded CPUs, this method kicks the other thread so that it starts running (for more information about hyper-threaded CPUs, see the sidebar "Hyper-Threaded CPUs"). When there is contention on a single-processor machine, I do force the thread to transition to kernel mode (by calling SwitchToThread) for if I didn't, CPU time would be wasted as the thread spun without any hope of finding the lock released.

It's impossible to know for sure if spinning in user mode would yield better performance than transitioning to kernel mode, context-switching to another thread, and returning back to user mode. It depends on how much time the other thread needed to complete its task and this information cannot be determined. Furthermore, instrumenting your code in order to attempt to determine it would negatively impact performance, which is what you're trying to improve. There is just no way to guarantee a win with this. Also, there are many factors that impact this: CPU make and model, CPU speed (GHz), hyper-threading, dual-core, other threads running on the system, and so forth.

The SpinWaitLock class makes use of two static methods on the Thread class new to the Microsoft® .NET Framework 2.0: BeginCriticalRegion and EndCriticalRegion. These methods are important for any custom locking mechanism, especially when used in custom CLR hosts. For an in-depth look at these APIs, see Stephen Toub's article on reliability in the .NET Framework in this issue of MSDN®Magazine.

The SpinWaitLock class should only be used for tasks that are known to be short and of a finite period of time. If a wait is known to be long or is of unknown length, threads that want to own the resource should be in an efficient kernel mode wait state so that the CPU can run another thread (possibly the thread that currently owns the lock). And, if a thread is going to wait a long time, then the performance of the user-mode-to-kernel-mode transition doesn't matter anyway.

SpinWaitLocks are unfair. If multiple threads are calling Enter at the same time, the threads are not guaranteed to be serviced in a first-in-first-out (FIFO) order. In fact, if the lock is extremely hot (meaning it is accessed frequently), it is possible (but unlikely) that some threads may not get serviced at all. I will talk more about fairness in an upcoming column.

Send your questions and comments for Jeffrey to  mmsync@microsoft.com.

Jeffrey Richter is a cofounder of Wintellect, a training and consulting firm. He is the author of several books, including Applied Microsoft .NET Framework Programming (Microsoft Press, 2002). Jeffrey is also a contributing editor to MSDN Magazine and has been consulting with Microsoft since 1990.