.NET Matters

Deserialization Progress, and More

Stephen Toub

Code download available at: NET Matters 2006_12.exe(161 KB)

Q In my Windows® Forms application, I use a BinaryFormatter to save a large amount of application state, sometimes serializing to disk very large objects. At a later point, this state file needs to be loaded and deserialized by the application, and this deserialization process can take a measurable amount of time. Currently for both serialization and deserialization, I show a marquee progress bar, but that only provides an indication to the user that something is happening. For deserialization, I'd really like to show a standard progress bar that indicates how much of the deserialization has completed and how much still remains. BinaryFormatter doesn't provide any such feature, and I'm at a bit of a loss for how to extend the class to provide the behavior I need. Do you have any suggestions?

Q In my Windows® Forms application, I use a BinaryFormatter to save a large amount of application state, sometimes serializing to disk very large objects. At a later point, this state file needs to be loaded and deserialized by the application, and this deserialization process can take a measurable amount of time. Currently for both serialization and deserialization, I show a marquee progress bar, but that only provides an indication to the user that something is happening. For deserialization, I'd really like to show a standard progress bar that indicates how much of the deserialization has completed and how much still remains. BinaryFormatter doesn't provide any such feature, and I'm at a bit of a loss for how to extend the class to provide the behavior I need. Do you have any suggestions?

A BinaryFormatter is a difficult beast to extend, as it has a very simple interface. It's sealed with no virtual members, so you can't derive from it to override some key member like you might to add functionality to another class. And it doesn't expose any events you can hook to listen for progress notifications and the like. Instead, you'll need to get a bit creative.

A BinaryFormatter is a difficult beast to extend, as it has a very simple interface. It's sealed with no virtual members, so you can't derive from it to override some key member like you might to add functionality to another class. And it doesn't expose any events you can hook to listen for progress notifications and the like. Instead, you'll need to get a bit creative.

Getting quickly to the punch line, BinaryFormatter consumes the stream to be deserialized as is needed to parse it; this means that if we can figure out how much of the stream the BinaryFormatter has read, we can create a good approximation for how much of the deserialization process has been completed.

Figure 1 shows a first attempt at creating a utility method that allows you to track the progress of deserialization. The generic method accepts the stream to be deserialized and a callback delegate that will be invoked by the method to alert the caller of progress updates. Internally, it creates a BinaryFormatter and deserializes the stream, but before doing so, it first creates a Timer set to expire each second. Thus, after every second the Timer will invoke the callback delegate on a background thread, examining the current position of the stream and the length of the stream, and using that information to supply the callback with a number representing the percentage of deserialization completed. Unfortunately, as you can probably glean from the figure title, this method is rife with bugs. Most importantly, instance methods on Stream are not guaranteed to be thread-safe, so we shouldn't be accessing the Position and Length properties from a secondary thread as we're doing here. Additionally, there are several race conditions that will cause exceptions to be thrown, such as if deserialization finishes and the stream is closed just as the Timer's callback is being raised (you can't access the properties of a disposed FileStream). In short, this is a bad solution.

Figure 1 Buggy Deserialization Progress Tracking

public static T Deserialize<T>(
    Stream stream, ProgressChangedEventHandler callback)
{
    if (stream == null) throw new ArgumentNullException("stream");
    if (!stream.CanSeek) throw new ArgumentException("stream");

    Timer timer = null;
    if (callback != null)
    {
        timer = new Timer(delegate
        {
           callback(null, new ProgressChangedEventArgs(
               (int)(stream.Position * 100.0 / stream.Length), null));
        }, 
        null, 0, 1000);
    }

    BinaryFormatter formatter = new BinaryFormatter();
    T graph = (T)formatter.Deserialize(stream);

    if (timer != null) timer.Dispose();

    return graph;
}

A much better solution involves the creation of a custom decorator stream. I discussed the decorator design pattern and its application to streams in my September 2005 column, but in short, a decorator is an object that has the same interface as another object it contains (in object-oriented terms, it is an object that has both an "is-a" and a "has-a" relationship with a specific type). So, for example, System.IO.Compression.GZipStream is a decorator, as it both derives from Stream and contains a Stream. A user constructs a GZipStream by supplying to its constructor a Stream that should be compressed or decompressed, and calls made to GZipStream's Stream implementation delegate to the underlying Stream's implementation, doing additional work before and after.

Instead of supplying to BinaryFormatter the actual Stream to be deserialized, we can supply to it a surrogate decorator Stream that contains the Stream to be deserialized. This container stream can implement its Read method to retrieve the data from the contained Stream, but also to raise an event that contains information about the current position and length of the stream.

Whenever I set out to implement a new decorator stream, I typically start with a helper base class, like the one shown in Figure 2. This ContainerStream class provides a constructor that accepts and stores the Stream to be contained, a protected property exposing to derived types the underlying Stream, and then implements all of the methods and properties on Stream to delegate to the contained Stream. This is useful since most of the methods and properties on Stream are abstract, and this helper lets you avoid implementing every method in your own derived type if all you're interested in doing is implementing one or two specific methods.

Figure 2 ContainerStream

public abstract class ContainerStream : Stream
{
    private Stream _stream;

    protected ContainerStream(Stream stream)
    {
        if (stream == null) throw new ArgumentNullException("stream");
        _stream = stream;
    }

    protected Stream ContainedStream { get { return _stream; } }

    public override bool CanRead { get { return _stream.CanRead; } }

    public override bool CanSeek { get { return _stream.CanSeek; } }

    public override bool CanWrite { get { return _stream.CanWrite; } }

    public override void Flush() { _stream.Flush(); }

    public override long Length { get { return _stream.Length; } }

    public override long Position
    {
       get { return _stream.Position; }
       set { _stream.Position = value; }
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        return _stream.Read(buffer, offset, count);
    }

    ...
}

My ReadProgressStream implementation is shown in Figure 3. It derives from ContainerStream and overrides the Read method. The Read override calls to the base class's Read method (which in turn calls the contained Stream's Read), and then if anyone has registered with ReadProgressStream's ProgressChanged event, it raises the event with the current percentage of progress made reading through the stream (to improve performance, it only raises the event if the integral percentage has changed in value since the last call to Read).

Figure 3 ReadProgressStream

public class ReadProgressStream : ContainerStream
{
    private int _lastProgress = 0;

    public ReadProgressStream(Stream stream) : base(stream) 
    {
        if (!stream.CanRead || !stream.CanSeek || stream.Length <= 0) 
            throw new ArgumentException("stream");
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        int amountRead = base.Read(buffer, offset, count);
        if (ProgressChanged != null)
        {
            int newProgress = (int)(Position * 100.0 / Length);
            if (newProgress > _lastProgress)
            {
                _lastProgress = newProgress;
                ProgressChanged(this, 
                    new ProgressChangedEventArgs(_lastProgress, null));
            }
        }
        return amountRead;
    }

    public event ProgressChangedEventHandler ProgressChanged;
}

With this class in place, tracking the progress of deserialization is straightforward, and can be done simply by passing a ReadProgressStream rather than the actual stream to be deserialized to the BinaryFormatter:

public static T Deserialize<T>(
  Stream stream, ProgressChangedEventHandler callback)
{
  using (ReadProgressStream rps = new ReadProgressStream(stream))
  {
    rps.ProgressChanged += callback;
    BinaryFormatter formatter = new BinaryFormatter();
    return (T)formatter.Deserialize(rps);
  }
}

This works, accomplishing the goal of being able to generally track the progress of deserialization. Unfortunately, as the Deserialize<T> method is currently written, it has a major performance problem. The BinaryFormatter makes many, many requests to the Stream's Read method, frequently for only a few bytes at a time. While we haven't added much code to the Read method, it's enough to create a significant bottleneck.

To improve performance, we need a way to throttle the number of times the ReadProgressStream's Read method is actually called. To solve this, it's important to realize that this problem isn't new. (Very few problems actually are, and life gets a lot easier if one accepts that most problems have already been solved one way or another.)

For example, FileStream is typically used to read and write data from disk, and disk is relatively slow. If every request to a FileStream required disk access, applications that relied on reading and writing files with FileStream would slow to a crawl. Instead, FileStream buffers data from the disk, reading more than it needs to on small read requests such that future requests for that data may be satisfied out of the buffer cache rather than requiring additional accesses to the disk.

That same solution of buffering data can be used to fix our performance problem. We want to limit the number of times ReadProgressStream's Read method is actually invoked, and one way to do that is by forcing the caller to make several large requests rather than many small requests. And doing that requires adding just one more line of code:

public static T Deserialize<T>(
  Stream stream, ProgressChangedEventHandler callback)
{
  using (ReadProgressStream rps = new ReadProgressStream(stream))
  using (BufferedStream bs = new BufferedStream(rps))
  {
    rps.ProgressChanged += callback;
    BinaryFormatter formatter = new BinaryFormatter();
    return (T)formatter.Deserialize(bs);
  }
}

Now instead of the BinaryFormatter directly accessing the Read ProgressStream, it goes through a System.IO.BufferedStream. This dramatically improves the performance of the solution to the point where it's actually usable. However, it introduces a new minor issue, albeit one we can overcome with a few more lines of code. The constructor for BufferedStream used in the previous example defaults the buffer size to 4KB. This means if we're dealing with a relatively small stream, we're not going to make very many calls to the underlying ReadProgressStream's Read method, and we won't get many progress notifications. Given that it's a small stream, and thus one that should deserialize very quickly, this might not be an issue for you. If, however, it is a problem and you need more frequent updates, you can change the size of the buffer using a second constructor on BufferedStream that accepts the buffer size to use in addition to the stream to wrap. An easy way to determine the buffer size is to take the length of the stream and divide it by a hundred (since the ProgressChangedEventArgs uses an integral value for percentage completed, thus resulting in typical values falling between 0 and 100, inclusive), but for extremely large streams, you might end up with huge buffer sizes. To counteract that, I've chosen to go with whatever's smaller, 1/100th of the stream's length, or 4096 bytes, and my solution that includes that calculation is shown in Figure 4. You should of course test this formula in your own scenarios, and make changes if this is an inappropriate choice for your system.

Figure 4 Deserialization with ProgressChanged Callbacks

public static T Deserialize<T>(
    Stream stream, ProgressChangedEventHandler callback)
{
    if (stream == null) throw new ArgumentNullException("stream");

    using (ReadProgressStream cs = new ReadProgressStream(stream))
    {
        cs.ProgressChanged += callback;

        const int defaultBufferSize = 4096;
        int onePercentSize = (int)Math.Ceiling(stream.Length / 100.0);

        using (BufferedStream bs = new BufferedStream(cs, 
            onePercentSize > defaultBufferSize ? 
            defaultBufferSize : onePercentSize))
        {
            BinaryFormatter formatter = new BinaryFormatter();
            return (T)formatter.Deserialize(bs);
        }
    }
}

Q I'm using the SortedList<TKey, TValue> collection from the System.Collection.Generics namespace, and the performance is not what I'd expect. What sort of algorithms are used under the covers for this collection?

Q I'm using the SortedList<TKey, TValue> collection from the System.Collection.Generics namespace, and the performance is not what I'd expect. What sort of algorithms are used under the covers for this collection?

A Like List<T>, SortedList<TKey, TValue> wraps an underlying array. (SortedList actually wraps two arrays, one for keys and one for values). When you add an item to SortedList, it uses a binary search to find the correct place to insert an item into array, ensures that the array is large enough to hold the existing items plus the new item, shifts the existing items to make room for the new item, and adds the new item to the array in the correct location. Depending on where you're inserting into the list, that shift operation can be expensive. Take the following two loops as an example. The first inserts the numbers from 1 to 10,000 into the list, whereas the second inserts the same numbers but in the opposite order, starting with 10,000 and going down to 1:

for (int i = 1; i <= 10000; ++i) list.Add(i, i); // loop #1

for (int i = 10000; i >= 1; --i) list.Add(i, i); // loop #2

A Like List<T>, SortedList<TKey, TValue> wraps an underlying array. (SortedList actually wraps two arrays, one for keys and one for values). When you add an item to SortedList, it uses a binary search to find the correct place to insert an item into array, ensures that the array is large enough to hold the existing items plus the new item, shifts the existing items to make room for the new item, and adds the new item to the array in the correct location. Depending on where you're inserting into the list, that shift operation can be expensive. Take the following two loops as an example. The first inserts the numbers from 1 to 10,000 into the list, whereas the second inserts the same numbers but in the opposite order, starting with 10,000 and going down to 1:

for (int i = 1; i <= 10000; ++i) list.Add(i, i); // loop #1

for (int i = 10000; i >= 1; --i) list.Add(i, i); // loop #2

On my machine, the first loop is more than 30 times faster than the second, and that gap will continue to grow as the number of elements increases. What's happening here is that in the first loop I'm always adding to the end of the array, which means no elements need to be shifted on an insertion. In the second loop, I'm always adding to the beginning, which means all elements need to be shifted on every insertion. This degrades the algorithm to an O(n2) algorithm, the worst case complexity for an insertion sort of this nature.

The moral of the story is that you need to pick your data structures carefully based on the scenarios in which you plan to use them. Moreover, once you've chosen a data structure, make sure you're using it in a manner that provides optimal performance. As I've just shown, simply changing the order in which you perform certain operations can have a significant impact.

Q I need to convert a standard file path to an 8.3 file path. Does any support for this exist in the .NET Framework?

Q I need to convert a standard file path to an 8.3 file path. Does any support for this exist in the .NET Framework?

A Sure, but only in the sense that through P/Invoke you have access to the entire suite of Win32® APIs. This conversion can be accomplished easily using the GetShortPathName function exposed from kernel32.dll, as shown in Figure 5. It accepts three parameters: the string containing the path to be converted, an output memory buffer to contain the converted string, and an integer containing the size of that buffer in characters. If GetShortPathName is called with a buffer that's too small to receive the converted string (or, more specifically, if the value in cchBuffer is too small), GetShortPathName will return the size of the buffer required to store the string.

A Sure, but only in the sense that through P/Invoke you have access to the entire suite of Win32® APIs. This conversion can be accomplished easily using the GetShortPathName function exposed from kernel32.dll, as shown in Figure 5. It accepts three parameters: the string containing the path to be converted, an output memory buffer to contain the converted string, and an integer containing the size of that buffer in characters. If GetShortPathName is called with a buffer that's too small to receive the converted string (or, more specifically, if the value in cchBuffer is too small), GetShortPathName will return the size of the buffer required to store the string.

Figure 5 GetShortPathName

[DllImport("kernel32.dll", CharSet=CharSet.Auto, SetLastError=true)]
public static extern uint GetShortPathName(
    string lpszLongPath, StringBuilder lpszShortPath, uint cchBuffer);
 
private static string GetShortPathName(FileInfo file)
{
    StringBuilder buffer = new StringBuilder();
    string path = file.FullName;
    uint length = GetShortPathName(path, buffer, 0);
    if (length > 0)
    {
        buffer.Capacity = (int)length;
        GetShortPathName(path, buffer, length);
        return buffer.ToString();
    }
    return null;
}

You can take advantage of this by making two calls to GetShortPathName; the first passes in a buffer size of 0 in order to force GetShortPathName to return the needed buffer size. A second call to GetShortPathName is then made with the appropriately sized buffer. When calling through P/Invoke Win32 functions that expect a pointer to a string buffer to contain the output value, you typically use a StringBuffer, preallocated through its Capacity property with a buffer big enough to hold the output string. After the P/Invoke, the result can be retrieved easily using StringBuffer's ToString method.

Send your questions and comments for Stephen to netqa@microsoft.com.

Stephen Toub is the Technical Editor for MSDN Magazine.