Improving XML Document Validation with Schematron

 

Dare Obasanjo
Microsoft Corporation

September 2004

Applies to:
   XML
   Schematron

Summary: Dare Obasanjo describes how to use the Schematron XML validation language to enforce constraints on XML documents beyond the capabilities of the W3C XML Schema. (10 printed pages)

Click here to download the code sample for this article.

Contents

Introduction
The Sensational Six: The Basic Elements of Schematron
Combining Schematron with W3C XML Schema Validation
Using Schematron in the .NET Framework
Conclusion

Introduction

An XML schema is used to describe the structure of an XML document by specifying the valid elements that can occur in a document, the order in which they can occur, and expressing constraints on certain aspects of these elements. As usage of XML and XML schema languages has become more widespread, two primary usage scenarios have developed around XML document validation and XML schemas.

  1. Describing and enforcing the contract between producers and consumers of XML documents: XML schemas are a fairly terse, machine-readable way to describe what constitutes a valid XML document according to a particular XML vocabulary. Thus a schema can be considered to be a "contract" between the producer and consumer of an XML document. Typically, the consumer ensures that the XML document being received from the producer conforms to the contract by validating the received document against the schema. The above description covers a wide array of XML usage scenarios, from business entities exchanging XML documents to applications that utilize XML configuration files and lots of situations in-between.
  2. Creating the basis for processing and storing typed data represented as XML documents: As XML became popular as a way to represent rigidly structured, strongly typed data such as the content of a relational database or programming language objects, the need to be able to describe the datatypes within an XML document became important. This led to the creation of XML schema languages that provided mechanisms for converting an input XML infoset into a type annotated infoset (TAI) where element and attribute information items are annotated with a type name. The W3C XML Schema Recommendation describes the creation of a type annotated infoset as a consequence of document validation against a schema. During validation against a W3C XML Schema an input XML infoset is converted into a post schema validation infoset (PSVI), which among other things contains type annotations. However, practical experience has shown that one does not need to perform full document validation to create type annotated infosets and in general many applications that use XML schemas to create strongly typed XML such as XML<->object mapping technologies do not perform full document validation.

Currently the most popular XML schema language is the W3C XML Schema Definition language (XSD). Although XSD is capable of satisfying scenarios involving type annotated infosets it is fairly limited when it comes to describing constraints on the structure of an XML document. There are many examples of situations where common idioms in XML vocabulary design are impossible to express using the constraints available in W3C XML Schema. The three most commonly requested constraints that are incapable of being described by W3C XML Schema are:

  1. The ability to specify a choice of attributes. For example, a server-status element should either have a server-uptime attribute or a server-downtime attribute.
  2. The ability to group elements and attributes into model groups. Although one can group elements using compositors such as xs:sequence, xs:choice, and xs:all, the same cannot be done with both elements and attributes. For example, one cannot create a choice between one set of elements and attributes and another.
  3. The ability to vary the content model based on the value of an element or attribute. For example, if the value of the status attribute is available then the element should have an uptime child element; otherwise it should have a downtime child element. The technical name for such constraints is co-occurrence constraints.

Although these idioms are widely used in XML vocabularies it isn't possible to describe them using W3C XML Schema, which makes it difficult to rely on schema validation for enforcing the message contract. This article describes how to layer such functionality on top of the W3C XML Schema language using Schematron.

The Sensational Six: The Basic Elements of Schematron

The Schematron assertion language provides a mechanism for making assertions about the validity of an XML document using XPath expressions. There are six commonly used elements in a Schematron document: schema, ns, pattern, rule, assert, and report. The namespace URI for the elements used by the Schematron assertion language is https://www.ascc.net/xml/schematron.

  1. The <schema> element: The document element of the schema. Its child elements are the title element that contains a human readable name of the schema, ns elements that are used for specifying namespace<->prefix bindings used by the XPath expressions in the schema, phase elements that describe groups of patterns which should be executed together, pattern elements that contain groups of rules to validate the document against, and a diagnostics element that contains one or more diagnostic elements that can be used to provide finer grained error messages when a document fails an assertion. The schema element also has the following attributes: id, version, schemaVersion, fpi, defaultPhase and icon. Below is an example of the schema element:

    <schema xmlns="https://www.ascc.net/xml/schematron" 
            schemaVersion="1.01" >
      <title>A Schema for Books</title>
      <ns prefix="bk" uri="https://www.example.com/books" />
      <pattern id="authorTests">
        <rule context="bk:book">
          <assert test="count(bk:author)!= 0">
       A book must have at least one author
          </assert>
        </rule>
      </pattern>
      <pattern id="onLoanTests">
        <rule context="bk:book">
          <report test="@on-loan and not(@return-date)">
       Every book that is on loan must have a return date
          </report>
        </rule>
      </pattern>
    </schema>
    
  2. The <ns> element: This is used to specify the prefix<->namespace binding used by the XPath expressions in the pattern, rule, and assert elements. It has a required prefix and uri attributes, which define the namespace prefix and namespace name to which the prefix is bound. Below is an example of the ns element:

    <ns prefix="bk" uri="https://www.example.com/books" />
    
  3. The <pattern> element: A pattern contains a list of rule elements. The pattern element also has the following attributes: id, name, see, and icon. The primary purpose of patterns is to group together similar assertions so one can create phases where combinations of different patterns are executed for different stages in the validation pipeline. Below is an example of the pattern element:

    <pattern id="authorTests"
     see="https://www.example.com/books/guidelines.html"
     name="Test for non-zero number of authors">
        <rule context="bk:book">
          <assert test="count(bk:author)!= 0">
             A book must have at least one author
          </assert>
        </rule>
      </pattern>
    
  4. The <rule> element: The assert and report elements are contained within rule elements. The rule element has a context attribute that contains an XPath expression. All nodes from the input document that match the XPath expression specified in the context attribute are then tested against each assert and report element within the rule to see if they satisfy the assertion. A rule also has an abstract attribute which is used to provide a macro inclusion mechanism. When the value of the attribute is true, the contents of the rule can be included in other rule elements. A rule element can have one or more extends elements that reference an abstract rule. During validation the extends element of a rule is replaced with the contents of the target abstract rule. A rule can also have a key element which provides a mechanism for defining cross references between parts of a document analogous to the key element in XSLT. The key element has a name attribute for identifying the key and a path attributes containing the XPath expression to match against. Below is an example of the rule element:

        <rule context="bk:book" role="authorCountRule">
          <assert test="count(bk:author)!= 0">
           A book must have at least one author
          </assert>
        </rule>
    
  5. The <assert> element: The assert element provides a mechanism for testing whether a statement (that is, an assertion) about an element's content model is true. The test attribute of this element contains an XPath expression. If the outcome of converting the results of the XPath query to a Boolean value using the XPath boolean() function is false then a validation error has occurred. In this case the contents of the assert element are emitted as the error message. The assert element allows mixed content, text interspersed with elements, for the error messages. Most of the elements allowed as children are for visual layout and are borrowed from HTML: p, emph, and dir. The final allowed child element is the name element. When the error message is emitted the name element is replaced with the name of the context element. To enable schema authors to reuse error messages, an assert element can have a diagnostics attribute that references one or more diagnostic elements in the diagnostics element of the schema. That way if a validation error occurs, both the content of the assert element and the content of the referenced diagnostic elements are emitted. The assert element also has the following other attributes: id, role, subject, and icon. Below is an example of the assert element:

    <assert test="count(bk:author)!= 0">
           A book must have at least one author
    </assert>
    
  6. The <report> element: This report element is exactly the same as the assert element with one crucial difference. If the outcome of converting the results of the XPath query in the test attribute to a Boolean value using the XPath boolean() function is true then a validation error has occurred. Below is an example of the report element:

     <report test="@on-loan and not(@return-date)">
      Every book that is on loan must have a return date
     </report>
    

Additional information about the elements in Schematron can be obtained from the ZVON Schematron reference

Combining Schematron with W3C XML Schema Validation

One problem with using Schematron for validation is that it makes specifying the structure of an XML document cumbersome. On the other hand, this task is fairly straightforward in W3C XML Schema. Well, you can have your cake and eat it, too, since it is possible to embed Schematron rules in W3C XML Schema.

The W3C XML Schema recommendations allow applications to extend schema validation by adding application specific data in xs:appinfo elements within the xs:annotation of a particular schema element. One can embed Schematron pattern elements within these extension blocks, which can then be applied as part of the schema validation process. Namespaces that are used by patterns should be declared in an xs:annotation at the top level of the schema using ns elements. The following example shows a W3C XML Schema that utilizes embedded Schematron rules to define constraints that are beyond the capabilities of W3C XML Schema. Specifically the pattern enforces that if the optional on-loan attribute appears on a book element, then there must be a return-date attribute as well.

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema"  
    xmlns:sch="https://www.ascc.net/xml/schematron"
    targetNamespace="https://www.example.com/books"
    xmlns:bk="https://www.example.com/books"
    elementFormDefault="qualified" >

  <xs:annotation>
    <xs:appinfo>
     <sch:title>Schematron validation</sch:title>
     <sch:ns prefix="bk" uri="https://www.example.com/books"/>
    </xs:appinfo>
   </xs:annotation>

 <xs:element name="books"> 
  <xs:complexType>
   <xs:sequence>  
    <xs:element name="book" type="bk:bookType" maxOccurs="unbounded">
      <xs:annotation>
       <xs:appinfo>
        <sch:pattern id="onLoanTests">
          <sch:rule context="bk:book">
           <sch:report test="@on-loan and not(@return-date)">
           Every book that is on loan must have a return date
           </sch:report>
          </sch:rule>
        </sch:pattern>
       </xs:appinfo>
      </xs:annotation>
    </xs:element>
   </xs:sequence> 
  </xs:complexType>
 </xs:element>

 <xs:complexType name="bookType">
  <xs:sequence>
   <xs:element name="title" type="xs:string" />
   <xs:element name="author" type="xs:string" />
   <xs:element name="publication-date" type="xs:date" />
  </xs:sequence>
  <xs:attribute name="publisher" type="xs:string" use="required" />
  <xs:attribute name="on-loan" type="xs:string"  use="required" />
  <xs:attribute name="return-date" type="xs:date"  use="required" />
 </xs:complexType>

</xs:schema>

Using Schematron in the .NET Framework

This discussion of the added flexibility gained from using Schematron rules embedded in XML schemas would be moot if there was no way to actually do this in the .NET Framework. There is an implementation of Schematron for the .NET Framework implemented by Microsoft XML MVP, Daniel Cazzulino, called Schematron.NET, which provides classes for validating XML documents against Schematron schemas and XML schemas containing embedded Schematron rules. The class most users will interact directly with in Schematron.NET is the NMatrix.Schematron.Validator class. The following is an overview of the API for this class.

Constructors

  1. An overload that indicates whether the IXPathNavigable instance returned as a result of calling Validate() should be an XmlDocument or XPathDocument.

    public Validator(NavigableType type) 
    
  2. An overload that indicates what format the output should be in.

    public Validator(OutputFormatting format) 
    
  3. An overload that indicates what format the output should be in and whether the IXPathNavigable instance returned as a result of calling Validate() should be an XmlDocument or XPathDocument.

    public Validator(OutputFormatting format, NavigableType type) 
    
  4. Default constructor that sets XPathDocument to be the IXPathNavigable type returned by Validate() methods and uses OutputFormatting.Log as the formatter.

    public Validator()
    

Properties

  1. The evaluation context.

    public EvaluationContextBase Context { get; set; } 
    
  2. The type returned by the Validate() method is an XmlDocument or an XPathDocument.

    public NavigableType ReturnType { get; set; } 
    
  3. The formatter used for generating the output from validation.

    public IFormatter Formatter {get; set;}
    
  4. An identifier for the phase element whose patterns should be evaluated during the validation episode.

    public string Phase {get; set;}
    

Methods

  1. Adds an XML schema or Schematron rules file to the Validator's set of schemas.

    public void AddSchema(XmlSchema schema)
    public void AddSchema(Schema schema)
    public void AddSchema(Stream input)
    public void AddSchema(TextReader reader)
    public void AddSchema(XmlReader reader)
    public void AddSchema(string uri schema)
    
  2. Adds a collection of XML schemas or Schematron rules files to the Validator's set of schemas.

    public void AddSchemas(XmlSchemaCollection schemas)
    public void AddSchemas(SchemaCollection schemas)
    
  3. Validates the input document against the provided Schematron rules. If the Schematron rules are embedded in an XML schema, XSD validation is not performed.

    public void ValidateSchematron(IXPathNavigable source)
    public void ValidateSchematron(XPathNavigator nav)
    
  4. Validates the input document against the provided schemas. If the provided schema is an XML schema with embedded Schematron rules, then both Schematron and XSD validation are performed.

    public IXPathNavigable Validate(string uri)
    public IXPathNavigable Validate(Stream input)
    public IXPathNavigable Validate(XmlReader reader)
    public IXPathNavigable Validate(TextReader reader)
    

Sample

The following code sample validates a books.xml file against the XML schema from the Combining Schematron with W3C XML Schema Validation section of this article.

using System;
using System.Xml;
using NMatrix.Schematron;

class Program{

  public static void Main(string[] args){

    try{
     Validator validator = new Validator();
     validator.AddSchema("books.xsd"); 
     validator.Validate(new XmlTextReader("books.xml"));
 
    }catch(Exception e){
      Console.WriteLine(e);
    }
  }
}

Conclusion

This article shows that it is possible to use Schematron and the W3C XML Schema to have one's cake and eat it, too. One can create typed XML documents using XSD but still get rich validation of business rules in a declarative manner from the schema language, as well. Developers using the .NET Framework can start leveraging Schematron today by downloading Schematron.NET, either from SourceForge or from the assembly in the download attached to this article.

Start using Schematron today for your XML validation needs, it truly offers the best of both worlds.