Implementing XmlReader Classes for Non-XML Data Structures and Formats

 

Ralf Westphal
Microsoft MSDN Regional Director, Germany

December 2001

Summary: Developers can create a notation for any data structure using only elements and attributes, and can access any data structure with an XmlReader or XmlDocument—as long as they can "translate" it to XML. (22 printed pages)

Note This article assumes you are familiar with XML, C# and the Microsoft .NET XmlReader class.

Contents

Introduction
Mapping Data Structures to XML
Transforming Arbitrary Data Sources to XML
Conclusion
Appendix: A Custom XmlReader for the File System

Introduction

The Microsoft® .NET Framework XmlReader class provides a very easy pull model for traversing XML data. From a more general point of view, though, XML is just one possible representation for hierarchical information. Thus, an XmlReader can be viewed as a class encapsulating just some arbitrary data structure. It "maps" this data structure/data format to the "XML world" with its nodes, elements and attributes. Now, if handling hierarchical information with an XmlReader is so easy, why not use XmlReader classes to access non-XML data? Why not, for example, transform portions of the file system to HTML using XSLT? Why not query a CSV file with XPath? With custom XmlReader classes this is easy to do.

When you are using an XmlTextReader or an XmlDocument, do you really care if the data "behind" it actually is XML? Do you care if a file containing text complying with the XML well-formedness rules gets loaded and parsed? I'd say, no. You don't care much. You care mainly about the object model you are using to traverse or manipulate the data. The textual notation of XML only is relevant to you as a persistence format for storing or exchanging data.

So if you don't care much about the data format when working with, for example, an XmlReader, what's so good about it? It's the way it lets you access hierarchical data. Data that's conformant with the XML Infoset. To put it simply, we've come to like thinking in terms of nested elements (the text bracketed by starting/ending tags in XML) that can have attributes (the name/value pairs in an XML start tag) when describing/accessing data. We can create a notation for any data structure using only elements and attributes. That means we could access any (!) data structure with an XmlReader or XmlDocument—as long we can "translate" it to XML, so to speak.

Mapping Data Structures to XML

Look at this XML excerpt for a moment:

<ini>
   <section name="My ISP">
      <key name="Encoding" value="1"/>
      <key name="Type" value="1"/>
      <key name="DialParamsUID" value="546395"/>
      <key name="Device" value="AVM NDIS WAN CAPI Driver"/>
      …
   </section>
   <section name="My VPN">
      <key name="Encoding" value="1"/>
      <key name="Type" value="2"/>
      <key name="DialParamsUID" value="2678778"/>
      <key name="Guid" value="013CECC19484F7428730EFD8E53088B3"/>
      …
   </section>
   …
</ini>

Seems familiar? You probably recognized the data as stemming from a rasphone.pbk file. Here's the relevant part of the original file:

[My ISP]
Encoding=1
Type=1
AutoLogon=0
DialParamsUID=546395
Device=AVM NDIS WAN CAPI Driver
…
[My VPN]
Encoding=1
Type=2
AutoLogon=0
DialParamsUID=26787788
Guid=013CECC19484F7428730EFD8E53088B3
…

The XML shown above is just another representation for the dial-up networking information. If Microsoft® Windows® stored the data as XML, we could access it very easily in our .NET applications using an XmlTextReader. But rasphone.pbk is no XML file. Currently, the only way to get at the information is through some INI file API functions like GetPrivateProfileString(), because the .NET Framework Base Class Library does not contain classes to operate with INI files.

However, what if we could make any INI file look like an XML file? Or, could we make the file system's hierarchy of directories and files look like a huge XML file, or the registry, or a CSV file? We would then be able to access all those different data structures/formats and information stores with just one API: the XmlReader interface. The advantages of this strategy are as obvious as those of ODBC:

  • We could use XPath to query all data sources.
  • We could transform all sources data using XSLT.
  • We would not need to familiarize ourselves with many different APIs for all the data sources.

Dynamic Mapping

What needs to be done to reach this vision? The easiest way is to write a transformation tool as shown in Figure 1. The tool translates a data structure to an XML file or stream, which then is then read using an XmlTextReader.

Figure 1. Transforming an INI file to an intermediate XML representation in order to read it using an XmlTextReader

However, this solution looks kind of clumsy. Is there really a need to generate an intermediate XML representation? The answer is "no," as shown in Figure 2.

Figure 2. Reading an INI file using a custom XmlReader without the need for an intermediate XML representation

The application interacts with a class (XmlINIReader) derived from XmlReader, which is specifically designed for reading INI files instead of XML files. This custom XmlReader class lets an INI file look like the XML file in Figure 1. Here is an example of how we could use it to traverse and simply dump rasphone.pbk:

XmlReader ini = new XmlINIReader("rasphone.pbk");
while(ini.Read())
{
   switch(ini.Name)
   {
      case "section":
         Console.WriteLine("[{0}]", ini["name"]);
         break;
      case "key":
         Console.WriteLine("{0}={1}", ini["name"], ini["value"]);
         break;
      default:
         break;
   }
}

Our little application just sees an XmlReader-derived class, passes it a file name, and then reads all the elements from it one after another. If it encounters a <section> or <key> element, it dumps it to the console. The application does not have a clue that the XmlINIReader is not parsing an XML file. Since the interface of XmlINIReader and XmlTextReader are the same—they both are derived from XmlReader—the INI file looks like an XML file.

If we can build a class derived from XmlReader, we can dynamically present any data structure or source to the outside as a stream of XML elements and attributes, without the need to actually transform the source to an intermediate XML format.

Implementing Custom XmlReader Classes

Before we can start building classes like XmlINIReader we should take a minute to see how we can generalize the approach.

We are starting with a given data structure/source XYZ that we want to look at through an XmlReader. For that we need to derive an XmlXYZReader from XmlReader. At the same time, we need to be clear about how the data source should be made to appear as XML to the outside; we need an XML Schema that any application can expect the XmlXYZReader to adhere to. The stream of elements and attributes that we pull from the data source with our custom XmlReader needs to follow this schema, like the intermediate XML representation in Figure 1 follows the XML Schema below.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
      elementFormDefault="qualified">
   <xsd:element name="ini">
      <xsd:complexType>
         <xsd:sequence minOccurs="0" maxOccurs="unbounded">
            <xsd:element name="section">
               <xsd:complexType>
                  <xsd:sequence minOccurs="0" maxOccurs="unbounded">
                     <xsd:element name="key">
                        <xsd:complexType>
                           <xsd:attribute name="name" type="xsd:string" 
                                 use="required"/>
                           <xsd:attribute name="value" type="xsd:string" 
                                 use="required"/>
                        </xsd:complexType>
                     </xsd:element>
                  </xsd:sequence>
                  <xsd:attribute name="name" type="xsd:string" 
                        use="required"/>
               </xsd:complexType>
            </xsd:element>
         </xsd:sequence>
         <xsd:attribute name="filename" type="xsd:string" use="required"/>
      </xsd:complexType>
   </xsd:element>
</xsd:schema>

Figure 3 summarizes the different parts of a custom XmlReader. The XML Schema acts only as a documentation and guideline for the XmlXYZReader and its users, but it does not need to exist as a file and is not used by the XmlXYZReader. Also, no user of the XmlXYZReader class would need the XML Schema to validate XmlXYZReader output because the XML returned from the XmlXYZReader is valid by definition. A custom XmlReader does not read XML—which might be invalid—but instead produces it, so to speak, and thus it is always correct.

Figure 3. Schematic view of a custom XmlReader

Transforming Arbitrary Data Sources to XML

Before we actually plunge into deriving custom XmlReader classes, we should take a moment to ponder the transformation process. The task at hand is to read arbitrary data and map it to a read-only, forward-only stream of nested XML elements with attributes.

If the transformation process could advance through the data source at its own pace, reading and processing source constructs until all data has been transformed, then the mapping process would be very straightforward. The following code sample shows a parser for INI files that output XML compliant with the previous XML Schema.

string line;
StreamReader sr = new StreamReader("rasphone.pbk");
line=sr.ReadLine();
Console.WriteLine("<ini>");
do
{
   // skip all non-section-heading lines
   while(line != null && (line == "" || line[0] != '[')) 
      line = sr.ReadLine();

   if(line != null)
   {
      Console.WriteLine("\t<section name='{0}'>",             line.Substring(1).Split(new char[] {']'})[0]);      line = sr.ReadLine();

      if(line != null)
      {
         do
         {
            // skip all comments and blank lines; find key=value or 
            // section heading
            while(line != null && (line == "" || line[0] == ';' || line[0] 
                     != '[' && line.IndexOf('=')<0))
               line = sr.ReadLine();

            if(line != null && line[0] != '[')
            {
               string[] keyValue = line.Split(new char[] {'='});
               Console.WriteLine("\t\t{0}={1}", keyValue[0], keyValue[1]);
               line = sr.ReadLine();
            }
            else
               break;
         } while (true); 
      }
      Console.WriteLine("\t</section>");
   }
}
while(line != null);
Console.WriteLine("</ini>");

Unfortunately, the transformation process behind a custom XmlReader cannot step through a data source at its own leisure and spit out XML elements/attributes. Instead, it must advance in a non-continuous way, in small bursts, if you will. Each time the custom XmlReader's Read() method is called, the transformation process reads only as much data from the source to determine the next XML element including all its attributes. This is especially true for large data sources that cannot be loaded into memory in their entirety (for example, the file system or the registry). (Custom XmlReaders also appear as XML pull model parsers.)

Then, the next time Read() is called, the process advances its "read head" just enough to "parse" the data source's content and determine the next element. To be able to do that, though, the transformer needs to remember some state between Read() calls; it must remember where its current position is as well as what kind of element it produced last time. Only this state information lets the process know "what to do next."

The transformation process acts almost like a parser in a compiler. A language parser is driven by a syntax definition that tells it what symbols to expect next. The symbols are assembled from the source code's character stream by the lexical analyzer.

The transformation parser, on the other hand, is driven by the XML Schema, which is also a kind of syntax definition. Or you could say, the transformation parser is a manifestation of a XML Schema. Given a particular element it tells the transformer which elements could be next and thereby how to interpret the data source.

Plus, there is another "complication": to express the nesting of elements correctly, Read() twice returns an element which is not empty. First it returns the starting tag (for example, <section name="My ISP">) including all attributes, and later it returns the ending tag (for example, </section>). Since an arbitrary number of nested elements could be read in between, the transformation process needs to remember any element for which it has not yet returned an ending tag.

Using a State Automaton in the Transformation Process

Looking at the facts, we might be wondering how to make the transformation process work in little steps, keep state, and remember to close all elements opened. The solution lies in the similarity of parsers in compilers. They are using implicit or explicit state automatons to keep track of where they are in the parsing process, and their "memory for unfinished business" is an implicit or explicit stack. Implicit state/stack information is kept on the parser's processor stack if it is implemented according to the recursive descent model. But state/stack information can also be stored in explicit data structures like a state table and a stack.

So, how can we apply this to our transformation process? First, let's try to depict the XML Schema as a state graph, as shown in Figure 4 below.

Figure 4. State graph for the INI file-to-XML transformation process.

The state graph of the automaton contains a node for each element of the XML Schema representing the data source to be transformed. There are even separate states for the starting and ending tags of each element.

The arrows between states stand for calls to Read() on the XmlXYZReader default interface. Each Read() moves the automaton from a source state to a destination state. The destination state determines which element gets "output." To be more precise, it determines the values of certain properties (such as Name, Value, and NodeType) of the XmlXYZReader until the next Read() or some other method call like MoveToAttribute().

To illustrate how this works, let's step through the state graph and translate some data source to XML. Here's a simple INI file, and we're going to assume there is a custom XmlReader called XmlINIReader:

[section1]
key1=value1
[section2]
[section3]
key2=value2
key3=value3

Table 1 lists the state transitions through which the transformation process passes. Each row represents a call to Read().

Table 1. State transitions during INI-file processing (the numbers in [] and refer to edges between states in Figure 6)

Source State Condition/Transition Output Dest. State
* - [1] <ini> Ini
ini there is a section in the file: [2] <section name="section1"> section
section there is a key in the section: [3] <key name="key1" value="value1"/> key/
key/ there is another key in the section: [4] - -
  there are no more keys in the section: </section> /section
/section there is another section in the file: [5] <section name="section2"> section
section there is a key in the section: [3] - -
  there is no key in the section: </section> /section
/section there is another section in the file: [5] <section name="section3"> section
section there is a key in the section: [3] <key name="key2" value="value2"/> key/
key/ there is another key in the section: [4] <key name="key3" value="value3"/> key/
key/ there is another key in the section: [4] - -
  there are no more keys in the section: </section> /section
/section there is another section in the file: [5] - -
  there are no more sections in the file: </ini> /ini
/ini - - /*

The processor starts in an initial state labeled "*". There is no decision to be made and the first element "returned" by Read() is the starting tag <ini>.

On the next Read(), the processor just knows the current/source state. Since the XML Schema in Figure 3 determines that within an <ini> element there may be zero or more <section> elements, the processor checks whether it can find a section in the file. For that, of course, it needs to move its "read head" within the INI-source file. In our case, it finds "[section1]" and sets the overall state of the XmlINIReader to <section name="section1">. Its Name property now would return "section1" while its NodeType property would be set to XmlNodeType.Element.

The next call to Read() finds the transformer in source state "section" (according to the last element output), so it needs to check if there are any key/value pairs in the current section. Yes, there is at least one, so it sets the overall state of the XmlINIReader to <key name="key1" value="value1"/>. <key> elements have empty content (i.e. no text nodes) —that's why the destination state it labeled "key/". There won't be an ending tag later on.

When Read() is called again, the transformation process must check if there are any more key/value pairs in the current section. If so, it would continue as before. If not, as is the case with the above example, it does nothing. The nesting of elements at this point looks like this:

<ini>
   <section name="section1">
      <key name="key1" value="value1"/>

Since there are no more key/value pairs, the <section> element needs to be closed, an EndElement node is generated. By not issuing another element, the state automaton falls back to the previous element (<section>) and sets the XmlINIReader state to "</section>".

For this to work, the transformer must keep track of the elements opened. This is best done using an internal explicit stack. At each point in time it contains the current branch of the XML elements produced. The stack's maximum depth is thus limited to the maximum depth of the resulting XML. For an INI file, this is three elements: <ini>, <section>, <key>.

Each time the processor "outputs" a new element, it is pushed onto the stack. And each time it does not "output" a new element, the topmost one is popped off the stack.

From the walk-through so far, the remaining state transitions should be fairly easy to understand.

Here's a summary of the most important points:

  • No XML is actually output (but I'll use this term anyway because it is more "visual"); instead the XML in the Output column is assigned to the XmlINIReader's "inner state" which can be accessed from the outside through properties like Name, NodeType, and Value, with which you are familiar.

  • The decision of what kind of "Output" to produce is based on the current state (source state) and an examination of the data source at the "read head's" position.

  • Ending tags for elements are generated implicitly: an element that just got pushed onto the stack/just got output puts the XmlReader into a starting tag state. We can access its attributes, ask if it is empty, and so on. Its NodeType is set to XmlNodeType.Element.

    Then, when further processing piles more elements on top of this one and later pops them off, the element's state changes once it resurfaces on the stack top. If the processor "returns" to the element, the XmlReader is put into the ending tag state. Its NodeType is set to XmlNodeType.EndElement, and the element is removed from the stack the next time Read() gets called.

    This behavior ensures that nested elements get opened and closed properly without any additional information besides the stack of elements forming the current XML element branch.

    One word about empty elements (such as <key/>): They don't cause an ending tag to be output. Instead, the processor removes them immediately before it pushes another element onto the stack. Empty elements without a closing tag thus are automatically closed.

Implementing a Generic Custom XmlReader

After having stressed your imagination quite a bit by talking about state automatons and stepping through a state transition trace, let's see how we can actually implement a custom XmlReader based on what we've seen.

Seemingly the simplest thing to do would be to derive a custom XmlReader straight from XmlReader:

class XmlXYZReader : XmlReader
{
   …
}

Next, implement at least all abstract methods plus the transformation process for the particular data structure according to some state graph derived from an XYZ XML Schema.

Well, you can do that—and in fact I started out like this when I began coding for this article—but it's a huge effort, at least, if you plan to develop more than one custom XmlReader in your life. The effort lies in repeatedly implementing all those abstract methods that XmlXYZReader inherits from XmlReader.

Fortunately methods such as MoveToAttribute() or NodeType don't need to be implemented differently for a custom XmlReader for INI files versus a custom XmlReader for the file system. (This is at least true for the majority of custom XmlReaders you are going to develop. In any case you are always free to override and enhance any of the inherited methods.)

Instead of deriving a custom XmlReader from XmlReader, we should develop a class CustomXmlReader which implements all the functionality needed again and again, including the general handling of transformation state information and the element stack. This would lead to:

class CustomXmlReader : XmlReader
{
   …
}
class XmlXYZReader : CustomXmlReader
{
   …
}

The benefit is obvious: the hard work of implementing the abstract XmlReader methods needs to be done only once. When writing a custom XmlReader, we can focus on whatever is necessary regarding the specific data source.

But how would we be able to introduce behavior specific to a data source if our custom XmlReader inherits all functionality from CustomXmlReader? The answer lies in concentrating the handling of state transitions into just one virtual method, GetNextElement(), as shown in Figure 5.

Figure 5. Concentrating the handling of state transitions into just one virtual method

Almost all functionality specific to transforming a data source to XML is regarding the transition from source to destination state. Only this code must be implemented in a custom XmlReader and gets called from the generic CustomXmlReader class via the virtual method GetNextElement().

CustomXmlReader keeps track of the current state and the element stack. Whenever its implementation of Read() gets called, it asks GetNextElement() to decide which element to output. GetNextElement(), however, is an abstract method and (almost) the only method we need to implement on any custom XmlReader class like XmlINIReader.

A Custom XmlReader for INI Files

Now that we've reached code-level explanations referring to methods and classes, let's have a look at an actual implementation of a custom XmlReader. Accompanying this article are a couple of implementations, e.g. for CSV-files, the file system (see Appendix A), text files, and INI files. The XmlINIReader for INI files is very straightforward and works right on the text file level with a stream as its "read head," so I think it makes a good sample—we've seen the XML Schema for INI files and know the state graph. The complete code is shown below. Please note again, we don't need to implement all the abstract XmlReader methods because XmlINIReader is derived from CustomXmlReader.

public class XmlINIReader : CustomXmlReader
{
   private string m_filename;
   private StreamReader m_ini;
   private string m_line;
   private string m_baseURI;


    public XmlINIReader(string filename)
    {
      m_ini = new StreamReader(filename);

      m_filename = filename;
      m_baseURI = "file://" + m_filename;
    }


   public override void Close()
   {
      if(System.Xml.ReadState.Closed != m_readState)
      {
         m_ini.Close();
         m_readState = System.Xml.ReadState.Closed;
      }
   }


   protected override CustomElement GetNextElement(string currentState)
   {
      CustomElement nextElem = null;

      switch(currentState)
      {
         case "*": 
            nextElem = new CustomElement("ini", "", false, m_baseURI);
            nextElem.AddAttribute("filename", m_filename, m_baseURI);
            m_line = m_ini.ReadLine();
            break;

         case "ini":
         case "/section":
            if(m_line != null)
            {
               // find next section
               while(m_line != null && (m_line == "" || m_line[0] != '[')) 
                  m_line = m_ini.ReadLine();

               // if there is a next section, output <section name="...">
               if(m_line != null)
               {
                  nextElem = new CustomElement("section", "", false, 
                        m_baseURI);
                  nextElem.AddAttribute("name", m_
                     line.Substring(1).Split(new char[] {']'})[0], m_
                     baseURI);
                  m_line = m_ini.ReadLine();
               }
            }
            break;

         case "section":
         case "key/":
            if(m_line != null)
            {
               // find next key/value pair - or next section
               while(m_line != null && (m_line == "" || m_line[0] == ';' 
                        || m_line[0] != '[' && m_line.IndexOf('=')<0))
                  m_line = m_ini.ReadLine();

               // if there is a next key, output <key name="…" value="…"/>
               if(m_line != null && m_line[0] != '[')
               {
                  string[] keyValue = m_line.Split(new char[] {'='});

                  nextElem = new CustomElement("key", "", true, 
                     m_baseURI);
                  nextElem.AddAttribute("name", keyValue[0], m_baseURI);
                  nextElem.AddAttribute("value", keyValue[1], m_baseURI);

                  m_line = m_ini.ReadLine();
               }
            }
            break;
      }
      return nextElem;
   }
}

The class constructor and the Close() method are very straightforward. They open/close the stream for accessing the INI file. The base class CustomXmlReader implements all other methods that a client of the class would need to use (see the above code sample in which XmlINIReader is used to dump the contents of rasphone.pbk for a sample usage of some of those methods).

The most important method for any custom XmlReader is GetNextElement(), which gets called by the base class for every call to Read(). It must determine if there is a next element to pass back and put on the internal element stack (increasing the nesting), or if none exist for the moment—for the current source state, for example—thus causing the top element on the stack to be closed and popped (decreasing the nesting).

CustomXmlReader passes into GetNextElement() the current state. Our own custom XmlReader does not need to remember what it did the last time it got called. Depending on the data structure to transform, however, it does keep explicit state between calls to GetNextElement() in order to be able to determine the next element. But this state has nothing to do with the state graph or element stack; rather, it's concerned only with the "read head" on the data source.

Let's take a closer look at GetNextElement() of XmlINIReader to understand how the transformation process works:

  • Start state "*"
    The first time Read()
    gets called, the custom XmlReader is in its start state. From there, GetNextElement() always moves to the "ini" state. It "outputs" an <ini filename="…"> start tag.

    case "*": 
       nextElem = new CustomElement("ini", "", false, m_baseURI);   nextElem.AddAttribute("filename", m_filename, m_baseURI);   m_line = m_ini.ReadLine();   break;
    

    CustomElement is a class provided by the CustomXmlReader assembly. Our custom XmlReader classes pass back information about the element to "be returned" by Read() using a CustomElement instance. It represents just the starting/ending tags of an element as well as its attributes. Nested elements get their own instances on future calls to Read(). There is no look-ahead; XmlReader classes work on an element-by-element basis without looking ahead down the XML tree. However, a CustomElement gets marked either as being empty—that is, not containing any other elements—or not. For empty elements, CustomXmlReader won't generate a XmlNodeType.EndElement node. Wherever possible, a custom XmlReader should look ahead just a little bit in order to know if the next element is actually empty. If it is in doubt, however, or to avoid look-ahead, the XmlReader can always mark the element as non-empty. In case it later turns out to be empty, only an ending tag (but no content elements) will get generated.

  • State "ini"
    If the current state is "ini", it's time to look for the first section in the INI file. The method scans the file until it finds a line starting with a "[". If none is found, no next element is returned to CustomXmlReader.Read() and the element on the stack top (<ini>) is closed; an </ini> is output.

    Closing an element always happens when no CustomElement is passed back to Read(), enabling the auto-closing of elements on the stack. This way we don't need to implement all states in the state graph (in this case, "/ini").

    If a section is found, however, a <section> element is created:

    nextElem = new CustomElement("section", "", false, m_baseURI);nextElem.AddAttribute("name", m_line.Substring(1).Split(new char[] {']'})[0], m_baseURI);
    

    The destination state becomes "section". Please note that the method does not look ahead to determine whether or not the section is empty. The method always marks the section as non-empty (parameter value false in CustomElement-constructor).

  • State "section"
    Once a section has been opened, the transformation process needs to look for key/value pairs within it. The section skips all empty lines and all comment lines (starting with a ";") until it reaches EOF or a section (meaning, there are no more key/value pairs) or a line containing a "=".

    If there are no more key/value pairs, no new element is returned, so the enclosing <section> element gets popped of the element stack and closed.

    If there is a key/value pair, a <key> element with key name and value as attributes is created:

    string[] keyValue = m_line.Split(new char[] {'='});
    nextElem = new CustomElement("key", "", true, m_baseURI);nextElem.AddAttribute("name", keyValue[0], m_baseURI);nextElem.AddAttribute("value", keyValue[1], m_baseURI);
    

    Since the <key> element always is empty, the destination state becomes "key/"; no ending tag will be generated on the next Read(), CustomXmlReader will simply pop it off the element stack.

  • State "key/"
    If the current state is "key/"—for example, if a <key> element has just been output—the transformer also needs to check if there is another key/value pair in the section. It thus behaves as if it's in state "section".

  • State "/section"
    If a section has just ended because there were no more key/value pairs in it, GetNextElement()
    acts as if it has just started after having issued the <ini> root element: it looks for the next section. This state and the "ini" state therefore can be handled together.

    As you can see, GetNextElement() behaves like the code in Figure 5 which dumps an INI file to XML—except that the code is not run from start to finish, but called several times. On each round it generates only one output node. Its internal state determines, what to do, what to look for each time. That is the nature of a pull model parser.

Conclusion

Being able to work with any data structure through an XML API like XmlReader offers many advantages. We can navigate through the data using an XML DOM, use XPath to query it, or use XSLT to transform it to other formats. The following is an example for finding files of a certain size in a directory tree:

XmlFileSystemReader fs = new XmlFileSystemReader(path);
XPathDocument doc = new XPathDocument(fs);
fs.Close();
XPathNavigator nav = doc.CreateNavigator();nav.Select("//file[@length>=6656]");while(nav.MoveToNext())   Console.WriteLine("{0} {1}", nav.Name, nav.GetAttribute("name", ""));

A custom XmlReader for the file system (XmlFileSystemReader) maps the directory/file hierarchy on a hard disk to XML, which is then read into a XPathDocument. By generating a XpathNavigator, we're able to query it using XPath. Likewise, we could transform the directory information to HTML or some other format using an XslTransform object.

With the approach shown, implementing a custom XmlReader is quite straightforward. The only tricky thing is to keep the "read head" state, so GetNextElement() can move forward in small steps with each Read(). The sample custom XmlReader classes accompanying this article show how to do that for different scenarios (such as using streams to read files) and the System.IO.Directory class with a stack to traverse the directory hierarchy (see the code sample in Appendix A).

Before you start, however, you should draw up an XML Schema onto which you want to map a data structure. With some sample data in hand, it will be quite easy to come up with a state graph and the code for GetNextElement(). There is just one possible difficulty: the state automaton implemented by GetNextElement() relies on the unambiguous naming of states. As long as elements (XML nodes with tags) are issued and turned into states, this is of little concern. However, if you issue text nodes such as "hello, world!" in <greeting>hello world!</greeting>, then it can be difficult to determine what to do on the next call to Read(), which passes only a very general name for the current state to GetNextElement(). You should avoid outputting XML text nodes, and instead deliver data in attributes.

The CustomXmlReader class I implemented for you to derive from was written for the Microsoft .NET Framework Beta 2. Since the documentation of Beta 2 was pretty sketchy concerning some points, some of the less important methods (such as ResolveEntity) haven't been implemented. Another reason for this is that some methods seem to be of less value since the data sources a custom XmlReader accesses are not XML in itself. Entity handling, for example, does not seem to be necessary since non-XML data sources do not contain entity references.

This said, take the code provided as a guideline and a proof of concept for accessing arbitrary data sources like any XML file by using custom XmlReader classes. Play around with the sample custom XmlReader classes and let your ideas flow. You'll certainly discover many more benefits of this approach to data access.

Appendix: A Custom XmlReader for the File System

The following sample XML was produced using the custom XmlReader for the file system:

<filesystem rootPath="…">
   <dir name="TestApp" creationTime="2001-07-01T14:47:57" 
         lastAccessTime="2001-07-07T08:36:09" lastWriteTime="2001-07-
         05T14:47:38" isHidden="false" isReadOnly="false" 
         isCompressed="false" isEncrypted="false" isArchive="false">
      <dir name="bin" … >
         <dir name="Debug" … >
            <file name="dir.xml" creationTime="2001-07-06T19:46:08" 
            lastAccessTime="2001-07-06T19:46:10" lastWriteTime="2001-07-
            02T13:20:18" isHidden="false" isReadOnly="false" 
            isCompressed="false" isEncrypted="false" isArchive="false" 
            length="8228"/>            <file name="ini.xml" … />

The following insan excerpt from the custom XmlReader for the file system. A stack (m_dirs) of directories (class DirInStack) is kept to track between calls to GetNextElement, and to determine which directories (and files) have been "visited:"

public class XmlFileSystemReader : CustomXmlReader
{
   private class DirInStack
   {…}

   private string m_rootPath;
   private Stack m_dirs;

   public XmlFileSystemReader(string rootPath)
   {
      m_rootPath = rootPath;
      m_dirs = new Stack();
   }
   …
   protected override CustomElement GetNextElement(string currentState)
   {
      DirInStack nextDir;
      CustomElement nextElem = null;

      switch(currentState)
      {
         case "*": 
            nextElem = new CustomElement("filesystem", "", false, "");
            nextElem.AddAttribute("rootPath", m_rootPath, "");
            break;

         case "filesystem": 
            nextDir = new DirInStack(m_rootPath);
            m_dirs.Push(nextDir);
            nextElem = nextDir.CreateCustomElement();
            break;

         case "dir":
            // first subdir or start with files
            nextElem = GetNextElemFromStacktop();
            break;

         case "/dir":
         case "dir/":
            // next dir or start with files
            m_dirs.Pop();
            if(m_dirs.Count > 0)
               nextElem = GetNextElemFromStacktop();
            break;

         case "file/":
            // next file
            nextElem = ((DirInStack)m_dirs.Peek()).NextFile();
            break;
      }

      return nextElem;
   }

Ralf Westphal is a freelance author and consultant on Microsoft software technologies. He is one of the Microsoft MSDN Regional Directors for German, and a frequent speaker at developer conferences like Microsoft Developer Days (DevDays), CMP Software Development, SIGS XML One, or Comdex. From 1998 through 2001 he has been editor-in-chief of Germany's largest software developer magazine, BasicPro for Microsoft Visual Basic® Programmers.