Understanding XML

 

Dare Obasanjo
Microsoft Corporation

July 2003

Summary: Learn how the Extensible Markup Language (XML) facilitates universal data access. XML is a plain-text, Unicode-based meta-language: a language for defining markup languages. It is not tied to any programming language, operating system, or software vendor. XML provides access to a plethora of technologies for manipulating, structuring, transforming and querying data. (14 printed pages)

Introduction
XML Everywhere
The XML 1.0 Syntax
The Infoset and the XML Family of Technologies
Conclusion
Further Reading

Introduction

The Extensible Markup Language (XML) was originally envisioned as a language for defining new document formats for the World Wide Web. XML is derived from the Standard Generalized Markup Language (SGML), and can be considered to be a meta-language: a language for defining markup languages. SGML and XML are text-based formats that provide mechanisms for describing document structures using markup tags (words surrounded by '<' and '>'). Web developers may notice some similarity between HTML and XML, which is due to the fact that they are both derived from SGML.

As the use of XML has grown, it is now generally accepted that XML is not only useful for describing new document formats for the Web but is also suitable for describing structured data. Examples of structured data include information that is typically contained in spreadsheets, program configuration files, and network protocols.

XML is preferable to previous data formats because XML can easily represent both tabular data (such as relational data from a database or spreadsheets) and semi-structured data (such as a Web page or business document). Popular pre-existing formats such as comma separated value (CSV) files either work well for tabular data and handle semi-structured data poorly, or like RTF are too specialized for semi-structured text documents. This has led to the widespread adoption of XML as the lingua franca of information interchange.

XML Everywhere

Besides being able to represent both structured and semi-structured data, XML has a number of characteristics that have caused it to be widely adopted as a data representation format. XML is extensible, platform-independent, and supports internationalization by being fully Unicode compliant. The fact that XML is a text-based format means that when the need arises, one can read and edit XML documents using standard text-editing tools.

XML's extensibility manifests itself in a number of ways. First of all, unlike HTML it does not have a fixed vocabulary. Instead, one can define vocabularies specific to particular applications or industries using XML. Secondly, applications that process or consume XML formats are more resistant to changes in the structure of the XML being provided to them than applications that use other formats, as long as such changes are additive. For instance, an application that depends on processing a<Customer>element with acustomer-idattribute typically would not break if another attribute, such as last-purchase-date, was added to the<Customer>element. Such flexibility is uncommon in other data formats and is a significant benefit of using XML.

XML is not tied to any programming language, operating system or software vendor. In fact, it is fairly straightforward to produce or consume XML using a variety of programming languages. Platform independence makes XML very useful as a means for achieving interoperability between different programming platforms and operating systems.

The benefits of exposing data as XML have been acknowledged by many, and have led to a proliferation of XML data sources. Business documents, databases and inter-business communication are all examples of information sources that are moving or have moved to using XML as a representation format. Microsoft products such as Microsoft Office®, Microsoft SQL Server™ and the Microsoft .NET Framework enable end users and developers to produce and consume documents, network messages and other data as XML.

The XML 1.0 Syntax

As mentioned earlier, the W3C XML 1.0 recommendation describes a text-based format for describing structured and semi-structured data using syntax similar to HTML.

XML and HTML Compared

Both HTML and XML documents are made up of elements, each of which consists of a "start tag" (such as <order>), an "end tag" (such as </order>), and the information between the two tags (referred to as the contents of the element). Elements can be annotated with attributes that contain metadata about the element and its contents.

However, there are significant differences between HTML and XML. XML is case sensitive while HTML is not. This means that in XML the start tags<Table>and<table>are different, while in HTML they are the same. Another difference between HTML and XML is that XML introduces the concept of well-formedness. The well-formedness rules of XML remove some of ambiguity inherent in processing markup languages like HTML by enforcing rules such as mandating that all attribute values must be in quotes, and that all elements must have either a start tag and end tag or explicitly indicate that they are empty elements. A succinct description of well-formedness is given in section D.2 of the XML FAQ.

The most significant difference between HTML and XML is that HTML has predefined elements and attributes whose behavior is well specified, while XML does not. Instead, document authors can create their own XML vocabularies that are specific to their application or business needs. XML vocabularies currently exist for a large number of industries and applications from financial filings (XBRL) and financial services (FpML) to Web documents (XHTML) and network protocols (SOAP). The lack of emphasis on predefined elements and attributes that specify how an XML document is rendered or displayed enables document authors to focus on creating documents that contain only relevant semantic information for their particular problem domain. The separation of content from presentation enabled by XML vocabularies allows for greater reuse of information and content repurposing.

The Anatomy of an XML Document

Below is a sample XML document that represents a customer order for a music store. A point of note is how the document easily represents both the rigidly structured data that describes information about compact discs as well as the semi-structured data containing special instructions and comments about a specific customer.

<?xml version="1.0" encoding="iso-8859-1" ?>
<?xml-stylesheet href="orders.xsl"?>

<order id="ord123456">
  <customer id="cust0921">
    <first-name>Dare</first-name>
    <last-name>Obasanjo</last-name>
    <address>
      <street>One Microsoft Way</street>
      <city>Redmond</city>
      <state>WA</state>
      <zip>98052</zip>
    </address>
  </customer>
  <items>
    <compact-disc>
      <price>16.95</price>
      <artist>Nelly</artist>
      <title>Nellyville</title>
    </compact-disc>
    <compact-disc>
      <price>17.55</price>
       <artist>Baby D</artist>
       <title>Lil Chopper Toy</title>
    </compact-disc>
  </items>

  <!--  Always go the extra mile for the customer -->
  <special-instructions xmlns:html="https://www.w3.org/1999/xhtml/">
    <html:p>If customer is not available at the address then attempt 
      leave package at one of the following locations listed in order of 
      which should be attempted first 
    <html:ol>
      <html:li>Next Door</html:li>
      <html:li>Front Desk</html:li>
      <html:li>On Doorstep</html:li>
    </html:ol>
    <html:b>Note</html:b> Remember to leave a note detailing where 
      to pick up the package.
    </html:p>
  </special-instructions>
</order>

The document begins with the optional XML declaration that specifies what version of XML is being used and character encoding used by the document. This is followed by the xml-stylesheet processing instruction, which is used to bind a style sheet containing formatting instructions to the XML document for use in rendering it in a more attractive manner in user applications such as Web browsers. Processing instructions are generally used to embed application-specific information in an XML document. For instance, most applications that process the contents of the above document would ignore the xml-stylesheet processing instruction. On the other hand, applications used for displaying XML documents such as a Web browser would use the information in the processing instruction to determine where to locate the style sheet that contains special instructions for displaying the document.

Unicode + Angle Brackets = Interoperability

The combination of the facts that the XML 1.0 syntax is text based and fairly straightforward to parse has lead to the emergence of XML as the premiere data interchange format when cross-platform interoperability is required. The wide availability of XML parsers for many of the popular operating systems makes it easy for disparate parties on different platforms to standardize on XML as the interchange format when they need to share information.

Being based on Unicode makes XML suitable for sharing information across global networks such as the World Wide Web.

The Infoset and the XML Family of Technologies

Although the platform interoperability and extensibility gained by using the text-based XML syntax are excellent advantages of using XML as a data representation format, they are only one aspect of XML's usefulness to application developers. Another major benefit of using XML is that it gives one access to a plethora of technologies for manipulating, structuring, transforming and querying data.

The XML Infoset

The W3C XML Information Set recommendation describes an abstract representation of an XML document. The XML Infoset is primarily meant to act as a set of definitions used by XML technologies to formally describe what parts of an XML document they operate upon. Several W3C XML technologies are described in terms of the XML Infoset, including SOAP 1.2, XML Schema, and XQuery.

The XML Infoset is a tree-based hierarchical representation of an XML document. An XML document's information set consists of a number of information items, which are abstract representations of the components of an XML document. There are information items representing the document, its elements, attributes, processing instructions, comments, characters, notations, namespaces, unparsed entities, unexpanded entity references, and the document type declaration. The XML Infoset is an official attempt to define what should be considered to be significant information in an XML document. For example, the infoset does not distinguish between the two forms of empty element. So the following

  <test></test>
  <test/>

are considered equivalent according to the XML Infoset. Similarly, the kind of quotation marks used for attributes is not considered significant; thus, the elements

  <test attr='value'/>
  <test attr="value"/>

are considered equivalent according to the XML Infoset. A list of aspects of XML 1.0 syntax that are not considered significant by the XML Infoset is provided in Appendix D of the W3C XML Information Set recommendation.

The XML Information Set recommendation describes the concept of synthetic infosets which are infosets that are created by other means besides parsing a textual XML document. Synthetic infosets pave the way for processing non-XML data using XML technologies as long as this data can be mapped to an XML Infoset. An example of processing a synthetic infoset is the ObjectXPathNavigator which enables one to query objects in the .NET Framework using XPath or transform them using XSLT.

Schema Languages

An XML schema language is used to describe the structure and content of an XML document. For instance, a schema can be used to specify a document that consists of one or morecompact-discelements which each contain a price, title, andartistelement as children. During document interchange, an XML schema describes the contract between the producer and consumer of XML since it describes what constitutes a valid XML message between the two parties. Although a number of schema languages exist for XML, from DTDs to XDR, the one that currently rules the roost is the W3C XML Schema Definition Language typically abbreviated as XSD.

XSD is unique among XML schema languages because it is the first to attempt to expand the role of an XML schema outside of its traditional role of describing the contract between two entities exchanging documents. XSD introduces the concept of a Post Schema Validation Infoset (PSVI). A conformant XSD processor accepts an XML Infoset as input and transforms it into a Post Schema Validation Infoset (PSVI) upon validation. A PSVI is the original input XML Infoset with new information items added and new properties added to existing information items. The W3C XML Schema recommendation lists the contibutions to the Post Schema Validated Infoset.

One important class of PSVI contributions is type annotations. Elements and attributes become strongly typed and have datatype information associated with them. Such strongly-typed XML is very versatile because it can now be mapped to objects using technologies like the .NET Framework's XmlSerializer, mapped to relational tables using technologies like SQLXML and the .NET Framework's DataSet, or it can be processed using XML query languages that take advantage of strong typing, such as XPath 2.0 and XQuery.

Below is a sample schema fragment that describes theitemselement in the sample document in the section entitled The Anatomy of an XML Document.

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema">

<xs:element name="items">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="compact-disc" minOccurs="0" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="compact-disc">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="price" type="xs:decimal" />
      <xs:element name="artist" type="xs:string" />
      <xs:element name="title" type="xs:string" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>

Tree Model-based APIs

A tree model API exposes an XML document as a tree of nodes, which is typically loaded in memory all at once. The most popular tree model API for XML is the W3C Document Object Model (DOM). The DOM enables the programmatic reading, manipulation, and modification of an XML document.

Below is an example of using the XmlDocument class in the .NET Framework to obtain the artist name and title of the firstcompact-discin anitemselement.

using System; 
using System.Xml; 

public class Test{

  public static void Main(string[] args){

    XmlDocument doc = new XmlDocument(); 
    doc.Load("test.xml"); 

    XmlElement firstCD = (XmlElement) doc.DocumentElement.FirstChild;
XmlElement artist  = 
(XmlElement) firstCD.GetElementsByTagName("artist")[0];
    XmlElement title   =
(XmlElement) firstCD.GetElementsByTagName("title")[0]

    Console.WriteLine("Artist={0}, Title={1}", artist.InnerText, title.InnerText);
  }
}

Cursor-based APIs

An XML cursor API can be thought of as a lens that moves across the XML document focusing on distinct aspects of document as directed. The XPathNavigator class in the .NET Framework is an example of an XML cursor API. The advantage of XML cursor APIs over tree model APIs is that it is not required for the entire XML document to be in-memory which opens the door for optimizations on the part of producer of the XML in which the document is produced on an "as needed" basis.

Below is an example of using the XPathNavigator class in the .NET Framework to obtain the artist name and title of the firstcompact-discin anitemselement.

using System; 
using System.Xml; 
using System.Xml.XPath; 

public class Test{

  public static void Main(string[] args){

    XmlDocument doc = new XmlDocument(); 
    doc.Load("test.xml"); 

    XPathNavigator nav = doc.CreateNavigator(); 

    nav.MoveToFirstChild(); //move from root node to document element (items)
    nav.MoveToFirstChild(); //move from items element to first compact-disc element
    
    //move from compact-disc element to artist element 
    nav.MoveToFirstChild();
    nav.MoveToNext(); 
    string artist = nav.Value; 

    //move from artist element to title element
    nav.MoveToNext(); 
    string title = nav.Value; 

    Console.WriteLine("Artist={0}, Title={1}", artist, title);
  }
}

Streaming APIs

A streaming API for processing XML enables one to process an XML document without storing much more than the context of the current node being processed in memory. Such APIs make it possible to process large XML files without incurring an unbearably large memory footprint. There are two main classes of streaming APIs for XML processing: push-based XML parsers and pull-based XML parsers.

Push-based parsers such as SAX work by moving across an XML stream then "pushing" events to registered event handlers (callback methods) when XML nodes are encountered. Pull-based parsers such as the XmlReader class in the .NET Framework act as forward-only cursors over an XML stream.

Below is an example of using the XmlReader class in the .NET Framework to obtain the artist name and title of the firstcompact-discin anitemselement.

using System; 
using System.Xml; 

public class Test{

  public static void Main(string[] args){

    string artist = null, title = null; 
    XmlTextReader reader = new XmlTextReader("test.xml"); 

    reader.MoveToContent(); //move from root node to document element (items)

    /* keep reading until we get to the first <artist> element */
    while(reader.Read()){

      if((reader.NodeType == XmlNodeType.Element) && reader.Name.Equals("artist")){

        artist = reader.ReadElementString();
        title  = reader.ReadElementString(); 
        break; 
      }
    }
    Console.WriteLine("Artist={0}, Title={1}", artist, title);
  }
}

XML Query

In some cases, attempting to extract information from an XML document using an API may be too cumbersome either because the criteria for finding the data are non-trivial or the API fails to expose certain aspects of an XML document that are amenable to certain queries. XML query languages such as XPath 1.0 and the forthcoming XQuery provide rich mechanisms for extracting information from XML infosets.

Below is an example showing how to use XPath to obtain the artist name and title of the firstcompact-discin anitemselement.

using System; 
using System.Xml.XPath; 

public class Test{

  public static void Main(string[] args){
    
    XPathDocument doc   = new XPathDocument("test.xml"); 
    XPathNavigator nav  = doc.CreateNavigator(); 

    XPathNodeIterator iterator = nav.Select("/items/compact-disc[1]/artist | /items/compact-disc[1]/title");

    iterator.MoveNext();
    Console.WriteLine("Artist={0}", iterator.Current);

    iterator.MoveNext();
    Console.WriteLine("Title={0}", iterator.Current);

  }
}

XML Transformation

There is often a need to transform XML documents from one vocabulary to another. Sometimes this is so they can be rendered in a print-friendly format or in a Web browser; it may be to convert documents received from an external entity to a format with which one is more familiar.

XSLT is the premiere XML transformation language. A transformation expressed in XSLT describes rules for transforming a source tree into a result tree. The transformation is achieved by associating patterns with templates. A pattern is an XPath expression, and can be thought of as a regular expression that matches parts of an XML source tree as opposed to matching parts of a string. A pattern is matched against elements in the source tree. On successful matches, a template is instantiated to create part of the result tree. In constructing the result tree, elements from the source tree can be filtered and reordered, and arbitrary structure can be added.

The following XSLT style sheet converts anitemselement into an XHTML Web page containing a table of compact disc information.

<xsl:stylesheet xmlns:xsl="https://www.w3.org/1999/XSL/Transform" version="1.0" xmlns="https://www.w3.org/1999/xhtml">

<xsl:output method="xml" indent="yes"
    doctype-system="https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 
    doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" />


    <xsl:template match="/">

    <html lang="en" xml:lang="en">
     <head>
      <title>Order Information - ord123456</title>
     </head>
     <body>
       <table border="1">
        <tr><th>Artist</th><th>Title</th><th>Price</th></tr>

        <xsl:for-each select="items/compact-disc">
        <tr>
        <td><xsl:value-of  select="artist" /></td>
        <td><xsl:value-of  select="title" /></td>
        <td><xsl:value-of  select="price" /></td>
        </tr>
        </xsl:for-each>

       </table>


     </body>     
    </html>
     
   </xsl:template>

</xsl:stylesheet>

The XHTML document produced by this stylesheet is shown below

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="https://www.w3.org/1999/xhtml">
  <head>
    <title>Order Information - ord123456</title>
  </head>
  <body>
    <table border="1">
      <tr>
        <th>Artist</th>
        <th>Title</th>
        <th>Price</th>
      </tr>
      <tr>
        <td>Nelly</td>
        <td>Nellyville</td>
        <td>16.95</td>
      </tr>
      <tr>
        <td>Baby D</td>
        <td>Lil Chopper Toy</td>
        <td>17.55</td>
      </tr>
    </table>


  </body>
</html>

which looks like this when rendered in a Web browser.

Artist Title Price
Nelly Nellyville 16.95
Baby D Lil Chopper Toy 17.55

Conclusion

XML is more than just a text format for describing documents. It is a mechanism for describing structured and semi-structured data, which provides access to a rich family of technologies for processing such data. Powerful abstractions like the XML Information Set open the door to processing non-textual data such as file systems, the Windows® registry, relational databases and even programming language objects using XML technologies. XML brings us one step closer to universal data access.

Further Reading

XML in 10 Points

Lessons from the Component Wars: An XML Manifesto

XML Information Set Recommendation