Things to Know and Avoid When Querying XML Documents with XPath

 

Dare Obasanjo
Microsoft Corporation

June 30, 2002

The Best Laid Plans of Mice and Men

Inspiration for this article arose from a weekend where nothing that I expected to work on actually panned out. My significant other decided to take a celebratory trip to Las Vegas with a coworker, which coincided quite nicely with my plans to go to Ikea and pickup a bookcase so I could finally unpack my books since moving to Redmond a few months ago. After trundling around Ikea for two hours, I found a bookcase on display that fit the color scheme of my living room only to find out that some the necessary pieces were out of stock. I ended up ordering the bookcase and went home empty-handed. Unfortunately I had already unpacked my books at home and they were strewn all over my living room. This became the perfect opportunity to catalog my burgeoning library, and of course I chose to use XML to achieve this task.

More Than Meets the Eye

The primary purpose of the XML catalog I was building was to have a central place where I stored information about the books I owned that was flexible enough to allow querying and different kinds of presentation, and was also portable. Below is a snippet of my first pass at such a document:

Books.xml

<?xml version="1.0" encoding="UTF-8" ?> 
<bk:books xmlns:bk="urn:xmlns:25hoursaday-com:my-bookshelf" on-loan="yes" >
 <bk:book publisher="IDG books" on-loan="Sanjay" >
  <bk:title>XML Bible</bk:title> 
  <bk:author>Elliotte Rusty Harold</bk:author>
 </bk:book>
 <bk:book publisher="QUE">
  <bk:title>XML By Example</bk:title> 
  <bk:author>Benoit Marchal</bk:author>
 </bk:book>
</bk:books>

I wanted to be able to track whether I had books loaned out by using an on-loan attribute. The on-loan attribute on the root element specifies that at least one book was out, and then the same attribute on each book element specified the person to whom the book had been lent. In retrospect, this may not have been the best design because it leads to unnecessary coupling between the root element and its children, but bear with me, this was only my first pass.

With my simple format designed, I decided to run some practice queries on it to see if I was satisfied with the format. The first query I tried using the SelectSingleNode method in the System.Xml.XmlNode class was the following:

   //*[position() = 1]/@on-loan 

I intended to mean "select all the nodes in the document, then give me the on-loan attribute of the first one." The query returned the following:

   on-loan="yes" 

So, the answer to the question, "Do I have any books out?" was yes. However, something interesting happened when I simulated what would happen if I failed to update the on-loan value on the root element when one of my books was lent out. I removed the on-loan attribute from the root element and ran the query again. The result was as follows:

   on-loan="Sanjay"

This result is the value on one of the children of the root element. Suspecting a bug, I tried this on MSXML as well and got similar results. Further investigation led me to enlightening discussions with a number of XPath experts on my team and further reading of the XPath recommendation. What I discovered was that as with any non-trivial language designed by multiple parties, there are a number of quirks, idiosyncrasies, inconsistencies, and just plain pitfalls to avoid when dealing with XPath.

Abbreviations and What They Really Mean

The XPath recommendation lists a number of axes that contain nodes that are related to the currently selected node (also known as the context node). To reduce verbosity, a number of abbreviations for certain commonly used axes were specified. The table below shows these abbreviations and their equivalent axes.

Abbreviation Axis
. self::node()
.. parent::node()
// /descendent-or-self::node()/
@ attribute::

There is also the fact that the default axis used on every location step or path expression is the child:: axis. Thus, /bk:books/bk:book is actually equivalent to /child::bk:book/child::bk:book, but is much easier to type.

The * node test is used to select all nodes of the principal node type of the current axis. * is a node test, not an abbreviation for a step. Finally, a predicate with a number in it is equivalent to checking to see if the position of the context node is the same as that number. This means that the query /bk:book[1] is equivalent to /bk:book[position()=1].

Given the information above, we can go back to my original problem query and see why it gave the unexpected results. //*[position() = 1]/@on-loan is actually an abbreviation for /descendent-or-self::node()/child::*[position() = 1]/@on-loan, which selects every node in the document and retrieves the on-loan attribute of the first child of each of the selected nodes. Judicious use of parenthesis quickly fixes the problem and (//*)[position() = 1]/@on-loan, which is short for (/descendent-or-self::node()/child::*)[position() = 1]/@on-loan, is actually what I wanted.

Funny enough, shortly after figuring out the problem I realized that a simpler and more efficient query for doing what I required would have been:

   /*/@on-loan

This is a better solution because it only needs to look at the first node in the document. I'll leave with one more examples that highlights why one should think about what abbreviations represent in certain cases to avoid befuddling results

Abbreviation Full Query Query Results
//*[1] /descendent-or-self::node()/child::*[position()=1] Select the first child of every node in the document.
(//*)[1] (/descendent-or-self::node()/child::*)[position()=1] Select the first node in the document.

Improving Our Math Skills

Queries involving relational or arithmetic operators and strings typically lead to counter-intuitive results. XPath converts all operands in an expression involving relational or arithmetic operators to numbers. Strings that aren't entirely numeric values are converted to NaN (not a number). The following table shows some XPath expressions, what they are implicitly converted to, and the result of the expression.

Expression Implicit Conversion Results
'5' + 7 5 + 7 12
'5' + '7' 5 + 7 12
5 + 'a' 5 + NaN NaN
'5' < 7 5 < 7 True
'5' < '7' 5 < 7 True
'5' < 'b' 5 < NaN False
'a' < 'b' NaN < NaN False
'a' > 'b' NaN > NaN False

It is important to note that the comparison operators (<, >, <=, >=) do not perform lexicographical comparison of string values.

Another interesting arithmetic quirk is that although unary minus is defined (for example, -6 is a valid XPath expression), unary plus is not (+6 is not a valid XPath expression). Even more surprising is that multiple negations can be stacked together and still be valid. Thus, ------6 is a valid XPath expression equivalent to the value 6.

XPath's lack of support for scientific/exponential notation tends to trip people up because it is supported both by popular query languages like SQL and popular programming languages like C++.

Expressions that combine arithmetic and relational operations on node sets may also lead to surprising results. Arithmetic operations on node sets convert the value of the first node in the set to a number while relational operators evaluate whether any node in the node set satisfies the condition. Below is an XML document that will be used to show how arithmetic operations and relational operators can lead to expressions that are not associative.

Numbers.xml

<Root>
 <Numbers>
  <Integer value="4" />
  <Integer value="2" />
  <Integer value="3" />
 </Numbers>
 <Numbers>
  <Integer value="2" />
  <Integer value="3" />
  <Integer value="6" />
 </Numbers>
</Root>

The following table shows the lack of associativity of arithmetic operations.

Expression Results Explanation
Root/Numbers[Integer/@value > 4 - 1] <Numbers>

<Integer value="4" />

<Integer value="2" />

<Integer value="3" />

</Numbers>

<Numbers>

<Integer value="2" />

<Integer value="3" />

<Integer value="6" />

</Numbers>

Selects all the <Numbers> elements in the document that have at least one <Integer> element with a value attribute whose value is greater than 4 minus 1.
Root/Numbers[ 1 + Integer/@value > 4] <Numbers>

<Integer value="4" />

<Integer value="2" />

<Integer value="3" />

</Numbers>

Selects all the <Numbers> elements in the document where 1 plus the first <Integer> element with a value attribute whose value is greater than 4.

If XPath was algebraically associative, then both queries would return the same results.

When Is A Set Not A Set?

Although node-sets are unordered collections just like sets from mathematics (or your favorite programming language), they are often treated dissimilarly from sets in the mathematical sense. Some operations in XPath use first semantics when dealing with node sets, while others use any semantics. First semantics means that the value of the node set for that operation is obtained from the first node in the set, while any semantics means that the operation on the node-set is dependent on whether any node in the set satisfies the condition. The section entitled Improving Our Math Skills describes situations where any and first semantics are used.

Another characteristic of XPath node sets that makes them different from mathematical sets is that XPath doesn't directly provide mechanisms for performing set operations like subset, intersection, or symmetric difference. Michael Kay, author of XSLT Programmer's Reference 2nd edition, originally discovered how to use combinations of the count() function and the union operator|to mimic the missing set operators. Below is an XSLT style sheet that performs set operations on the XML document from the previous section along with its output.

STYLESHEET

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
 
 <xsl:output method="text" />

 <xsl:variable name="a" select="/Root/Numbers[1]/Integer/@value"/> 
 <xsl:variable name="b" select="/Root/Numbers[1]/Integer/@value[. > 2]"/> 
 <xsl:variable name="c" select="/Root/Numbers[1]/Integer/@value[. = 3]"/> 

 <xsl:template match="/">
 
 SET A: { <xsl:for-each select="$a"> <xsl:value-of select="." />, </xsl:for-each> }
 SET B: { <xsl:for-each select="$b"> <xsl:value-of select="." />, </xsl:for-each> }
 SET C: { <xsl:for-each select="$c"> <xsl:value-of select="." />, </xsl:for-each> }

  a UNION b:  { <xsl:for-each select="$a | $b"> <xsl:value-of select="." 
/>, </xsl:for-each> }
  b UNION c:  { <xsl:for-each select="$b | $c"> <xsl:value-of select="." 
/>, </xsl:for-each> }
  a INTERSECTION b:  { <xsl:for-each select="$a[count(.|$b) = count($b)]"> 
<xsl:value-of select="." />, </xsl:for-each> }
  a INTERSECTION c:  { <xsl:for-each select="$a[count(.|$c) = count($c)]"> 
<xsl:value-of select="." />, </xsl:for-each> }
  a DIFFERENCE b:  { <xsl:for-each select="$a[count(.|$b) != count($b)] | 
$b[count(.|$a) != count($a)]"> <xsl:value-of select="." />, </xsl:for-each> }
  a DIFFERENCE c:  { <xsl:for-each select="$a[count(.|$c) != count($c)] | 
$c[count(.|$a) != count($a)]"> <xsl:value-of select="." />, </xsl:for-each> }
  a SUBSET OF b:  { <xsl:value-of select="count($b | $a) = count($b)"/> }
  b SUBSET OF a:  { <xsl:value-of select="count($b | $a) = count($a)"/> }
 
 </xsl:template>

</xsl:stylesheet>

OUTPUT

  SET A: { 4, 2, 3,  }
  SET B: { 4, 3,  }
  SET C: { 3,  }

  a UNION b:  { 4, 2, 3,  }
  b UNION c:  { 4, 3,  }
  a INTERSECTION b:  { 4, 3,  }
  a INTERSECTION c:  { 3,  }
  a DIFFERENCE b:  { 2,  }
  a DIFFERENCE c:  { 4, 2,  }
  a SUBSET OF b:  { false }
  b SUBSET OF a:  { true }

A final point of difference between node sets and mathematical sets is that node sets are typically ordered. The W3C XPath recommendation describes them as unordered, but XSLT does specify an ordering for node-sets.

Crises of Identity

In XPath, there are no constructs for directly determining the identity of nodes or the equivalence of nodes in different node sets. Comparisons such as whether the node returned by /bk:books is the same as that returned by /bk:books/bk:book[1]/parent::* are not directly supported. Comparisons using the = operator on node sets does not compare the node sets as a whole, but uses any semantics instead. From the W3C XPath recommendation:

"If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true."

To bring this point home, here is a table showing the results of performing comparison operations on node sets from my XML catalog format from the introduction. Be aware that these look like contradictory results on first viewing.

Expression Results Explanation
//bk:book = /bk:books/bk:book[1] TRUE Does at least one node in //bk:book have the same string value as another in /bk:books/bk:book[1]?
//bk:book != /bk:books/bk:book[1] TRUE Does at least one node in //bk:book have a different string value from another in /bk:books/bk:book[1]?
not(//bk:book = /bk:books/bk:book[1]) FALSE The opposite of the answer to the question "Does at least one node in //bk:book have the same string value as another in /bk:books/bk:book[1]?"

Node identity can be mimicked using the XPath count() function and determining whether the intersection of two node-sets of the same length is of the same length as either of the node sets, or in the case of a singleton node set whether it is equal to 1. For example, the following query returns TRUE in this case because both nodes are the same.

    count(/bk:books | /bk:books/bk:book[1]/parent::*) = 1
  

Node identity can also be mimicked by using the generate-id() function in XSLT. The XSLT FAQ has an example of using generate-id().

I Exist Therefore I Am

Although there is no explicit mechanism for testing the existence of a node, it does happen implicitly in a number of expressions involving node sets. A non-existent node set is represented as an empty node set. The empty node set is implicitly converted to the empty string or NaN in situations involving string and numeric operations respectively. This series of implicit conversions may lead to confusing results if queries are performed without looking at the instance document to determine which ones occurred because of the empty node-set and which ones didn't. Below are a few examples of queries that involve the empty node-set and how these implicit conversions affect them.

Expression Results
/NonExistentNode + 5 NaN
/NonExistentNode = 5 False
/NonExistentNode != 5 False
concat(/NonExistentNode, "hello") "hello"
/Root[@nonExistentAttribute] No Results Returned
/Root[@nonExistentAttribute < 5] No Results Returned
/Root[@nonExistentAttribute > 5] No Results Returned

Since it is possible for a node to contain the empty string, it is typically best to test for the existence of a node by using the boolean() function and not by checking for the string value of a node. For example, the following query, which returns FALSE, is the best way to tell for sure that there is no NonExistentNode in the document.

   boolean(/NonExistentNode) 

Namespaces and XPath Redux

The primary pitfall when dealing with namespaces in XPath was covered in my last column and involves having to create mappings between prefixes and namespace names in expressions, even if the document uses a default namespace.

An interesting thing to note is that there is always at least one namespace node available for a document; http://www.w3.org/1998/namespace which is the XML namespace. For example, take a look at the following query:

/bk:books/namespace::*

This query returns the following:

urn:xmlns:25hoursaday-com:my-bookshelf
http://www.w3.org/XML/1998/namespace

The returned items are the namespace nodes available at the root of the books.xml document.

The Untouchables

There is certain information in an XML document that is transparent, or in some cases, invisible to XPath. The XML declaration at the top of an XML document is an example of an XML construct that is invisible to XPath. This means that there is no way to query for the version, encoding, or standalone status of an XML document via XPath.

Syntactic constructs for introducing text that is replaced during the process of parsing the XML document, such as CDATA sections and parsed entities, are similarly transparent to XPath. XPath treats replacement text as regular text nodes.

Acknowledgements

There were numerous contributors to the above list of XPath quirks and idiosyncrasies including Julia Jia, Karthik Ravindran, Martin Gudgin, Michael Brundage, and Michael Rys. Some aspects of this article were inspired by e-mail written by Philip Wadler and Michael Kay.

Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).

Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.