The XML Litmus Test

 

Dare Obasanjo
Microsoft Corporation

October 12, 2004

Summary: Dare Obasanjo provides some simple guidelines for determining when XML is the appropriate technology to use in a software application or architecture design. (6 printed pages)

Introduction

As the popularity of XML has grown, its usage has spread to every nook and cranny of software application development. This is due to the fact that, unlike previous data formats, XML can easily represent both rigid, highly structured data (such as relational data from a database, program configuration files, or spreadsheets) and semi-structured data (such as a Web page or business document). Besides being able to represent both structured and semi-structured data, XML has a number of characteristics that have caused it to be widely adopted as a data representation format. XML is extensible, platform-independent, and supports internationalization by being fully Unicode compliant. The fact that XML is a text-based format means that when the need arises, one can read and edit XML documents using standard text-editing tools. This has led to the widespread adoption of XML as the lingua franca of information interchange.

This popularity has meant that in certain cases XML is chosen as a data representation format when it is not the best tool for the job. This article provides a clear set of guidelines for determining when XML is the right choice as a data representation for a particular situation. Additionally, some examples of appropriate and inappropriate uses of XML are described.

The XML Litmus Test

In my article Understanding XML I described some of the benefits of using XML as a data representation format. Due to widespread support for XML across various software platforms, the primary benefit of using XML is that it is easier to interoperate between applications on different platforms using XML than with most other data representation formats. The second major benefit of using XML is a side effect of its popularity. There are a large number of off-the-shelf tools for dealing with XML including parsers, schema languages, query languages, programming models, editors, and much more. No other data representation format has such a wide array of tools for processing instances of the format on so many platforms.

The aforementioned benefits give us the XML Litmus Test. XML is the appropriate tool for the job if the following criteria are satisfied by choosing XML as the data representation format for a given application:

  1. There is a need to interoperate across multiple software platforms.
  2. One or more of the off-the-shelf tools for dealing with XML can be leveraged when producing or consuming the data.
  3. Parsing performance is not critical.
  4. The content is not primarily binary content, such as a music or image file.
  5. The content does not contain control characters that aren't carriage return, line feed, or tab because they are illegal in XML.

If the expected usage scenario does not satisfy at least four of the above criteria, then it may not make much sense to use XML as the data representation format for the situation in question.

Even when the usage of XML satisfies the criteria, there are still trade-offs that need to be considered. For example, because XML is a text based format and uses redundant tags for labeling content, there are often complaints about the bloat caused by choosing XML as a data representation format. However, in cases where XML brings the benefit of interoperability and a wide array of technologies for describing and processing the data, then this trade off is worth it. This is exactly the trade off in using XML as the basis for network protocols such as XMPP and SOAP.

Using XML as an application configuration file format is another area where a trade-off has to be considered as well. On the surface, XML seems more complex than is needed for most application configuration file needs. In most cases this is true. However, XML brings a number of benefits to the table. These benefits include internationalization, built in mechanisms for creating comments and escaping text, and widely deployed tools for consuming and producing XML. It is often the case that the cost of writing a parser for a custom configuration file format, as well as other tools, outweighs any benefits from the formats simplicity over XML. Considering this trade off, XML is usually, but not always, the right tool for the job when coming up with application configuration files.

The rest of this article applies the XML Litmus Test to a number of established uses of XML.

Some Appropriate Uses of XML

XML was originally intended as a language for defining new document formats for the World Wide Web. XML is derived from the Standard Generalized Markup Language (SGML), and can be considered to be a meta-language, a language for defining markup languages. The benefits of using XML on the Web are self-apparent. On a widespread global network such as the World Wide Web, interoperability across platforms is key, as is the fact that it fully supports internationalization by being Unicode compliant.

One popular usage of XML on the Web today is for XML syndication feeds such as RSS 1.0 or RSS 2.0. Below is a sample RSS 2.0 feed.

<rss version="2.0">
  <channel>
    <title>MSDN XML Developer Center</title>
    <link>https://msdn.microsoft.com/xml/</link>
    <description> Extensible Markup Language (XML) is the universal 
     format for data on the Web. XML allows developers to easily 
     describe and deliver rich, structured data from any application 
  in a standard, consistent way. XML does not replace HTML; rather,
  it is a  complementary format. 
 </description>
    <item>
     <title> XML Files: XPath Selections and Custom Functions, 
      and More</title>
     <link> 
      https://msdn.microsoft.com/msdnmag/issues/03/02/xmlfiles/TOC.asp
     </link>
     <description> Get your questions about XPath selections, 
 custom functions, and more answered in this month's column.
</description>
    </item>
    <item>
     <title> Extreme XML: XML Serialization in the .NET Framework </title>
     <link>
     https://msdn.microsoft.com/library/en-us/dnexxml/html/xml01202003.asp
     </link>
     <description> Dare Obasanjo discusses XML serialization and how 
      you can use it within the .NET Framework to improve 
   interoperability and meet W3C standards.  
  </description>
     </item>
   </channel>
 </rss>

Using XML as the data representation format for syndication feeds passes the XML Litmus Test. Interoperability and support for internationalization is a goal of every Web technology, so the benefits of using XML for content syndication is clear in this regard. Secondly, syndication feeds are primarily textual and do not usually contain significant amounts of binary data or control characters. Finally, a number of applications can be built for processing and displaying syndication using off-the-shelf XML technologies. For example, desktop aggregators like RSS Bandit use XSLT for displaying RSS feeds using multiple layouts. Another example is the MSDN Web site, which uses XSLT for rendering RSS feeds to Web browsers in a human readable format as shown by the RSS feed for the MSDN XML Developer Center.

Another area where XML usage is beginning to flourish is as the data representation format for business documents. Not only have office productivity suites such as Microsoft Office begun to use XML as the basis of their document storage formats, but a number of vertical industries have standardized on XML schemas for information interchange. Examples of standardized XML vocabularies for information exchange in specific vertical industries include HR-XML for human resources, HL7-XML for the healthcare industry, XBRL for financial reporting, and much more.

Applying the criteria in the XML Litmus Test to using XML for representing business documents also shows this to be an appropriate usage of XML. First of all, business documents are primarily text so they satisfy the criteria around binary data and illegal XML characters. When exchanging business documents with other entities, ensuring that the documents are platform neutral is an significant factor. This is particularly important because it is unlikely that different business entities run their software on the same platforms or even that different organizations within a particular business entity are running a homogenous network. In typical usage scenarios for business documents, parsing performance is important. After all, no one wants a word processor to take minutes to load a file, and using XML does not require significant enough overhead to make this suboptimal. Finally, the wide array of tools for processing XML comes in very handy for creating and managing business document workflows. Declarative schema languages such as W3C XML Schema and Schematron can be used to enforce constraints on business documents entering or leaving the system. XML Transformation languages such as XSLT can be used to convert between different XML business document formats or publish them as non-XML formats, such as HTML or PDF. Documents can be filtered and routed using XML pattern languages, such as XPath to find documents with characteristics of interest. And the list goes on.

An Inappropriate Use of XML

One usage of XML that I've considered to be inappropriate is the use of XML as a syntax for programming languages, such as is done in o:XML. Consider the following o:XML fragment for a method that takes three string arguments and returns a date element.

<!-- procedure definition -->
<o:procedure name="ex:formatDate">
  <o:param name="day"/>
  <o:param name="month"/>
  <o:param name="year" />
  <o:do>
    <date>
      <day><o:eval select="$day"/></day>
      <month><o:eval select="$month"/></month>
      <year><o:eval select="$year"/></year>
    </date>
  </o:do>
</o:procedure>

<!-- procedure call -->
<!-- 'year' has a default value and so is optional -->
<ex:formatDate year="2002" month="'Aug'" day="31"/>

Compare that snippet to the following XQuery expression that does the same thing:

declare function ex:formatDate($day as xsd:string, $month as xsd:string, $year as xsd:string) 
   as element(date)
{
    <date>
      <day>{$day}</day>
      <month>{$month}</month>
      <year>{$year}</year>
    </date>
};

ex:formatDate("31", "Aug", "2002")

The XQuery version is less verbose than the o:XML version, yet performs the same task. The question then is whether the benefits of XML make up for the increased verbosity of the formatDate function in o:XML. Applying the XML Litmus Test, I'd say the answer is no. The ASCII-based text format of a language such as XQuery is just as interoperable as creating an equivalent language with similar identifiers, but using XML instead. Secondly, of the myriad tools for working with XML, the main one that applies in the case of o:XML is the existence of XML parsers that make processing language tokens in o:XML easier than parsing XQuery. However, as someone who's written his share of language parsers, in a previous life I designed and implemented SiXDML, I'd argue that the bulk of the work of creating a compiler has little to do with processing language tokens and more to do with the execution logic. The amount of development time saved is a small part of the overall cost of writing a compiler. However, every user of the language now has to deal with the more complex syntax of the language and increased verbosity if choosing o:XML over XQuery.

Conclusion

XML is a powerful tool in the software developer's toolbox, but like all tools it has tasks it is excels at and others for which it is not as well suited. The XML Litmus Test provides a simple set of guidelines for weighing the costs and benefits of using XML in a software application or architecture.

Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).

Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.