XML Tools Update

 

Chris Lovett
Microsoft Corporation

June 2003 (Revised March 2004)

Applies to:
    Microsoft® .NET Framework
    Microsoft Visual Studio® .NET

Summary: Learn how the .NET Framework's set of classes and associated tools for manipulating XML data can be woven together to form an impressive array of options for constructing XML processing solutions. (11 printed pages)

Contents

Introduction
XML Reader and Writers
Schema Tools
XPathNavigators
Aaron Skonnard's Custom Navigators and Readers
Xslt Related Tools
XmlResolvers
Summary

Introduction

Microsoft's® .NET Frameworks provide a powerful set of classes for manipulating XML data. These classes mostly reside in the System.Xml namespace. Since the release of Microsoft Visual Studio® .NET back in February 2002, people have been busily building various extensions and tools on top of these frameworks. This article summarizes a few of the most useful tools and shows how they can also be woven together to form an impressive array of options for constructing your XML processing solutions.

Most of these tools are available for download from gotdotnet.com. Direct links are provided to each tool, but in case those links don't work you can search the User Samples on gotdotnet.com.

XML Reader and Writers

XmlDiff and XmlPatch

A common thing to want to do when managing XML data across multiple tiers in a distributed system is replication of changes to an XML document across those tiers. This is in fact exactly how the ListEditor works. The XmlDiff and XmlPatch classes, which are available for download from Microsoft XML Diff and Patch 1.0, provide a generalization of this capability. The cool thing about these classes is that they operate at the reader/writer level so that the DiffGram can be efficiently transmitted to the other machine and applied there.

So now a client and server can stay in sync by communicating DiffGrams back and forth gaining added efficiency because the DiffGrams are likely to be smaller than the original documents. Being implemented on the XmlReader/XmlWriter interfaces this Diff and Patch capability should just plug into your existing XML processing pipeline.

SgmlReader

One of the biggest challenges facing developers who are adding XML support to their applications is how to easily migrate legacy data into the XML frameworks. One of the most common forms of such legacy data is HTML. HTML is very close to XML, but not quite the same. HTML is an SGML grammar. XML DTD's are a simplified version of SGML DTD's and are not compatible, so the XmlTextReader cannot read SGML DTD's nor can it read legacy HTML data. So the SgmlReader comes to the rescue. The SgmlReader can load an SGML DTD, and then read SGML instance data (for example HTML) and convert it to XML. The SgmlReader implements the XmlReader interface so that it can plug into the rest of your XML processing pipeline.

The SgmlReader provides built-in knowledge of the HTML DTD and so HTML conversion is very easy. The following example, loads the HTML from xml.com into an XmlDocument and queries it using XPath expression for the headlines and then prints out the URL's found in those headlines.

SgmlReader r = new SgmlReader();
r.DocType = "HTML";
r.Href = "http://www.xml.com/";
r.WebProxy = ...;

XmlDocument doc = new XmlDocument();
doc.Load(r);

// find featured articles
string query = 
  "/html/body/table[3]/tr/td[4]/table[1]/tr/td/p[@class='secondary']";
foreach (XmlElement n in doc.SelectNodes(query))
{
    XmlElement a = (XmlElement)n.SelectSingleNode(".//a[img]");
    string href = a.GetAttribute("href");
    Console.WriteLine(href);
}

XmlCsvReader

Talking about legacy data, not only is HTML a common source of potential XML, but .csv files (other wise known as tab delimited files) are also very common. The XmlCsvReader provides support for reading .csv files and returning the data as XML. The XmlCsvReader provides several properties for specifying whether the .csv file contains column names or not, and if not what XML element names to use instead and whether to return the columns as XML elements or attributes and so on. The XmlCsvReader implements the XmlReader interface so you can use it to load an XmlDocument and/or DataSets or perform other kinds of pipelined XML processing.

XmlNodeWriter

The System.Xml.XmlNodeReader returns an XmlReader interface over an XmlNode tree. The XmlNodeWriter does the complement to this, namely, it builds an XmlNode tree as a response to calls to the XmlWriter interface.

This is handy when someone has a class that can write to an XmlWriter, but does not provide an XmlReader pull model interface. A perfect example of such a class is the XmlSerializer. The following example uses the XmlNodeWriter to write object data into an XmlDocument.

The following code shows how to set this up:

Customer c = new Customer();
XmlDocument doc = new XmlDocument();
XmlSerializer s = new XmlSerializer(typeof(Customer));
XmlNodeWriter w = new XmlNodeWriter(doc, false);
s.Serialize(w, c);

Notice that this approach to loading an XmlDocument from a serialized object graph is far more efficient that writing the XML to a StringWriter then parsing the XML via an XmlTextReader.

XmlPrettyPrinter

By default the XmlDocument object and the XmlSerializer do not "format" the XML output that they generate so a common request from developers is for a tool that pretty prints XML. The ppxml command line tool does exactly this.

It turns out that formatting XML using the XmlTextWriter is very easy. It's just a matter of creating an XmlReader and pumping that through an XmlTextWriter with Formatting.Indented turned on as follows:

XmlTextReader r = new XmlTextReader(inputFile);
XmlTextWriter w = new XmlTextReader(outputFile, Encoding.UTF8);
r.WhitespaceHandling = WhitespaceHandling.None;
w.Formatting = Formatting.Indented;
while (!r.EOF) {
  w.WriteNode(r, true);
}

This produces nicely formatted XML output. You'll probably find the command line executable version of this algorithm very handy.

XmlAttributeFormatter

Talking about pretty printing, one thing that is not so easy with the XmlTextWriter is formatting attributes so they appear on separate lines because the XmlTextWriter does not provide this option. Fortunately there is an XmlAttributeFormatter, which is built into the ppxml command line tool, that provides this functionality. The XmlAttributeFormatter is implemented by subclassing XmlTextWriter and overriding theWriteEndAttributemethod as well asWriteStartElementand WriteEndElement.

XmlStats

I saw a request from Andreas Lang on the microsoft.public.dotnet.xml newsgroup for help in finding a tool that produces statistics about XML documents, like number of element and attributes and so on. The XmlReader interface is perfect for producing such a tool so I cranked out the Xmlstats tool the same day and posted it to the newsgroup. Andreas was very pleased and provided his own improvements to the tool, which I've since published on gotdotnet.com. It now produces a report including a tally of each type of element found in the document. For example, when running through the ot.xml benchmark file you get the following:

Elem/attr Count Chars
bktlong 39 999
bktshort 39 285
book 39 0
chapter 929 0
chtitle 929 9051
fm 1 0
p 23148 3188396
tstmt 1 0
ttitle 1 17
v 23145 0
vn 23145 38150

The really cool thing about this tool is how fast it runs. On my 17GHz Dell Precision 530 machine it is cranking through about 10 megabytes per second and about 12% of this time is being spent in hashtable lookups for collecting the stats on each element.

Schema Tools

XSD Inference

With so many tools and standards now using XML Schema Definitions (XSD) developers are madly scrambling to create XML schemas for their data. This work can be rather tedious, especially if you don't have any other kind of schema information available that you can use to automatically generate XML Schemas from.

The XSD Inference tool can help out in this case. There is similar functionality already available in the DataSet class if you use the XmlReadMode.InferSchema option on the ReadXml method, however, the XSD Inference tool is smarter and supports a lot more of XSD, and is not limited to just producing "relational" schemas for the DataSet class to consume. In fact, it is likely it will produce schemas that the DataSet class cannot consume. Word has it that this improved version of XSD inference is will be built into a future version of the .NET Frameworks.

This tool can infer a whole slew if different simple types from Boolean and short to float, dateTime and duration. Then it infers the appropriate element, complexType, sequence, choice and attribute declarations based on what it finds in the input documents.

The cool thing about this tool is that you can "train" it using more than one input file. The usage is quite simple, you just create an InferSchema class and call one of the following methods:

public XmlSchemaCollection InferSchema(XmlReader); 
public XmlSchemaCollection InferSchema(XmlReader, XmlSchemaCollection);

You use the second method to "refine" the schemas with more input data. There is a live demo you can play with to get a feel for it. Check it out.

Dtd2Xsd

Another tedious task, not automated by any tools provided by Visual Studio .NET is the process of converting DTD's to XML Schemas Definitions (XSDs). The Dtd2Xsd can help you with this task. There are many different Dtd-to-Xsd conversion tools out there, but this one does a particularly nice job of dealing with grouping constructs so that the resulting XML Schema is not so verbose. DTD's use "parameter entities" to group common sets of declarations. For example, the HTML DTD contains the following parameter entity:

<!ENTITY % fontstyle "TT | I | B | U | S | STRIKE | BIG | SMALL">

This parameter entity declaration is then used in every HTML element that allows these kinds of elements as children. If these "groups" are not preserved in the resulting XHTML Schema then the schema bloats out to 240kb. If the "groups" are preserved as "<xsd:group>" elements then the above parameter entity is defined in the schema as follows:

  <xs:group name="fontstyle">
    <xs:choice>
      <xs:element ref="tt" />
      <xs:element ref="i" />
      <xs:element ref="b" />
      <xs:element ref="u" />
      <xs:element ref="s" />
      <xs:element ref="strike" />
      <xs:element ref="big" />
      <xs:element ref="small" />
    </xs:choice>
  </xs:group>

Then the schema shrinks down to a more manageable 52kb, which not bad considering the HTML DTD was 41kb.

XPathNavigators

ObjectXPathNavigator

Are you tired of writing complicated foreach loops in your code searching for objects? Well imagine being able to query your own custom ShoppingBasket objects finding all items in the basket with a line item total (unit price * quantity) over $2,000 by using an XPathNavigator as follows:

XPathNavigator    nav = new ObjectXPathNavigator( basket );
XPathNodeIterator iter = nav.Select( "/*/*[(@UnitPrice * @Quantity));

The ObjectXPathNavigator was written by Steve Saxon of Dell Corporation and is described in detail with source code on MSDN.

It shows how the XPathNavigator interface can be implemented over any arbitrary object graph providing an "XPath Data Model" view of the data in the object graph, including the ability to query that data using XPath query expressions and being able to "transform" that data using XSLT stylesheets.

WritableXPathNavigator

The WritableXPathNavigator adds to an XPathNavigator the much-needed ability to extract the XML content of a node as a string or write it to a stream. One can wrap any existing XPathNavigator with a WritableXPathNavigator as shown below:

XPathDocument doc = new XPathDocument("books.xml");
XPathNavigator nav = doc.CreateNavigator();
XPathNodeIterator ni = nav.Select("/catalog/book[title='Creepy Crawlies']");
ni.MoveNext();
WritableXPathNavigator wnav = new WritableXPathNavigator(ni.Current);
Console.WriteLine(wnav.OuterXml); 
Console.WriteLine(wnav.InnerXml);

The addition of the ability to extract the XML content of a node positioned on by an XPathNavigator makes manipulating XML using an XPathNavigator a lot easier.

Aaron Skonnard's Custom Navigators and Readers

Aaron Skonnard released a whole slew of useful XPathNavigators and XmlReaders on his web site at XML in .NET. Included in his downloadable package are the following:

  • FileSystemNavigator—an XPathNavigator over the Windows File System starting with mycomputer as the root element. This allows you to select files by their type and properties using XPath expressions.
  • RegistryNavigator—an XPathNavigator over the System Registry. This allows you to query information in the registry using XPath expressions.
  • AssemblyNavigator—provides an XPathNavigator over the .NET Managed Assembly returning the namespaces and types defined in that assembly.
  • NavigatorReader—an XmlReader implementation over any XPathNavigator.

Nxstl

A long time ago now (1998, if I recall correctly) Microsoft shipped a command line utility call msxsl on MSDN. Nxslt is the much requested managed version of that tool, implemented by Oleg Tkachenko.

It has a rich set of command line options for controlling whitespace handling, validation, getting timing information, and customizing the XmlResolver used by the XslTransform.

MultiXmlTextWriter

One of the options provided by nxslt is "Allows multiple output documents". So how did Oleg do that when XslTransform does not provide such a feature? Well fortunately, Oleg has packaged his solution to this interesting problem in the MultiXmlTextWriter. The way this works, is that the XSLT stylesheet declares the special namespace http://exslt.org/common and contains the following special markup in the XSLT stylesheet:

<exsl:document href="toc.html" indent="yes">…</exsl:document>

The MultiXmlTextWriter then identifies this special output and redirects everything inside it to the specified file.

To use it you simply pass a MultiXmlTextWriter to the XslTransform as follows:

XPathDocument doc = new XPathDocument("book.xml");
XslTransform xslt = new XslTransform();
xslt.Load("style.xsl");               
MultiXmlTextWriter multiWriter = 
new MultiXmlTextWriter("index.html", Encoding.UTF8);
multiWriter.Formatting = Formatting.Indented;
xslt.Transform(doc, null, multiWriter);

This transform will produce both the "index.html" file and the "toc.html" file as outputs. Oleg provides excellent documentation.

XmlResolvers

XmlAspResolver

A less well-known class in the System.Xml namespace is the XmlResolver. The XmlResolver is called XML reader and documents want to load external XML resources, like other documents, DTD's and external entities. There is a default XmlResolver provided by the system that takes care of most cases, which is why you have probably never had to mess with it, but in a few situations you will need to.

For example, suppose a client sends the following document to your ASP.NET server:

<!DOCTYPE test SYSTEM "test.dtd">
<test>
this is a test
</test>

And suppose you want to validate this XML according to the test.dtd that you have stored in the same location as your ASP.NET page that is receiving this HTTP POST.

Well, if you try and load this XML using the XmlDocument you will get the following error:

Exception Details   System.IO.FileNotFoundException: Could not find file "C:\WINDOWS\system32\test.dtd".

Notice that the XmlDocument is trying to load the DTD "test.dtd" from the wrong location. The XmlTextReader is using the Current Directory instead of the Server.MapPath() to try and find the "test.dtd" resource. To fix the XmlAspResolver is what you need.

The following code shows how to use it:

<%@LANGUAGE=C# src="XmlAspResolver.cs"%>
<%@Import Namespace="System.Xml"%>
<%@Import Namespace="System.Text"%><%
Response.ContentType = "text/xml";
Response.Clear();
XmlDocument doc = new XmlDocument();
doc.XmlResolver = new XmlAspResolver(Context);
doc.Load(Request.InputStream);
XmlTextWriter w = new XmlTextWriter(Response.OutputStream, Encoding.UTF8);
doc.WriteTo(w);
w.Flush();
%>

XmlCachingResolver

Another common operation when dealing with XML data that contains HTTP DTD or Schema references is to want to cache those DTD's and/or schemas locally to avoid expensive HTTP requests upon every document read operation. This is what the XmlCachingResolver does.

The XmlCachingResolver caches the entities in memory, but could be easily modified to cache the entities on the local hard drive for a more scalable solution.

The following is an example usage:

    XmlCachingResolver cache = new XmlCachingResolver();
    XmlDocument doc = new XmlDocument();
    doc.XmlResolver = cache;
    doc.Load(url);

XInclude.NET

XInclude is a World Wide Web Consortium standard for describing generalized links between XML documents. See XML Inclusions (XInclude) Version 1.0.

Besides describing the links, sometimes it is useful to automatically embed the contents of a linked XML document and make it part of the parent document, similar to the way XML DTD Entities work.

Oleg as provided a very nice open source implementation called XInclude.Net on gotdotnet.com. This includes full documentation and a directory of test cases that illustrate the various issues of XInclude, like dealing with circular references for example. It also correctly inserts the xml:base attributes into the final document so that you can keep track of where the various fragments of the document came from.

XInclude.Net also provides a something similar to the XPointer schema for including a fragment from another document. As of this writing, only XPath point locations are supported. The difference is that xpath can only select a set of matching nodes (called a point location), whereas full xpointer support would also include "range" locations which can specify a beginning pattern and the ending pattern.

For example, the following XInclude reference includes all "book" elements from the specified test2.xml file:

<xi:include href="test2.xml#xpath1(//book)" . . .

This xpath1 scheme is proposed by Simon St. Laurent and fits into the overall xpointer framework specified by the W3C.

Summary

I hope you enjoyed this little tour through the various XML tools that have appeared over the last year or so. There is an almost infinite number of ways to assemble all these classes into your own custom high performance XML Processing Pipeline. It is exciting to see so much community effort involved in filling out the capabilities of the System.Xml namespace.

I also want to thank all those folks who have reported bugs and made suggestions on how to improve these tools. I look forward to what other fun stuff folks can come up with over the next year.