Using the XSD Inference Utility

 

Nithya Sampathkumar
Microsoft Corporation

February 12, 2003

Summary: Discusses the Microsoft XSD Inference utility, which simplifies the task of writing XML Schema by automatically generating schemas from instance documents. The inferred schema can then be refined with related document instances so that it can be used to describe and validate a whole class of XML documents. (6 printed pages)

Download the Microsoft XSD Inference utility from GotDotNet.

XSD Inference Tool Use Scenarios

Scenario 1

Susan works for a bookstore as a developer. Her company decided to convert their documents to XML so they could share it with subsidiaries. They also wanted to ensure that every subsidiary uses the same XML tags, attributes, and order. Susan suggested that they use XML Schemas to achieve this goal. While writing the schemas, Susan realized that it's much easier to write the document than it is to write the schema for that document.

Scenario 2

Amy is an XML Developer. She often has to deal with XML documents containing document type definitions (DTD). Amy needs to generate schemas for these documents because schemas are more powerful that DTDs and are the recommended way of describing XML documents.

Scenario 3

John is a Web services developer. His company deployed a Web service using the popular "rpc/encoded" mode of the Web Services Definition Language (WSDL). Following the latest recommendations (work in progress) from the Web Services Interoperability Consortium (WS-I), John's company wants to migrate the WSDL description of the service to the "document/literal" style to make it more interoperable. The most challenging part of authoring a "document/literal" WSDL file is to create an XML Schema for the SOAP Message Body and Header.

Scenario Synopsis

John, Amy, and Susan have instance documents that describe the data and need to generate schemas for them that conform to the W3C XML Schema specification. They also realize that these schemas are tedious to write by hand.

The XSD Inference utility solves this problem by automatically generating a schema for a given XML document. The tool also has the ability to refine the schema it generates based on additional XML documents. It's extremely useful to have the ability to refine schema when there is more than one XML document that encapsulates all the variations in the data.

Inference Usage

Let's go back to Susan's bookstore for a demonstration of how the XSD Inference utility can be used. There is an instance document for every different book in the store and we need to come up with one schema that can describe and validate all of them. Book1.xml and Book2.xml are two such book instances. They have slightly different structures and they exist in separate files. So, we need to infer the schema for one book (say Book1.xml) and then refine or fine-tune the inferred schema using Book2.xml and any other book documents that will be included. The samples for Book1.xml and Book2.xml are as follows:

Book1.xml:

<book year="1994" xmlns= "www.samplebookstore.com">
         <title>TCP/IP Illustrated</title>
         <author>Stevens W.</author>
        <publisher>Addison-Wesley</publisher>
        <price> 65.95</price>
    </book>
 

Book2.xml:

<book year="2000" xmlns = "www.samplebookstore.com" 
xmlns:sale=" www.samplebookstore.com/sale">
        <title>Data on the Web</title>
        <author>Abiteboul Serge</author>
        <author>Buneman Peter</author>
        <publisher>Morgan Kaufmann </publisher>
        <price> 39.95</price>
        <sale:price> 20 </sale:price>
        <editor>
               <name> Gerbarg Darcy</name>
               <affiliation>CITI</affiliation>
        </editor>
    </book>   

Let's see how we can use the Infer class from the Microsoft.XSDInference namespace to come up with the schema for Book1.xml and Book2.xml. The Infer class has two InferSchema methods—one that can be used for inferring and the other for refining the inferred schema. The InferSchema method that we will use for inferring accepts an XmlReader for the document for which the schema needs to be inferred and returns an XmlSchemaCollection containing the schema that it infers.

The code using the InferSchema method is given below. We created an XmlTextReader named book1 from the System.Xml namespace to read Book1.xml. We call the InferSchema method passing in book1 as follows:

Infer testInfer = new Infer();
XmlTextReader book1 = new XmlTextReader("book1.xml");
XmlSchemaCollection xsc = testInfer.InferSchema(book1);

After inferring the schema, we can write it out with the code snippet below. The code loops through the XmlSchemaCollection returned by the infer class and prints out the namespace of the schema and the schema. It is possible that the schema imports other schemas, so the second for each loops through the Imports schemas and prints them.

foreach(XmlSchema xs in xsc) 
{
    Console.WriteLine(xs.TargetNamespace);
    xs.Write(Console.Out);
    foreach(XmlSchemaImport xsi in xs.Includes) 
    {
     Console.WriteLine(xsi.TargetNamespace);
        xsi.Schema.Write(Console.Out);
    }
 }

The code above produces the following schema:

<xs:schema xmlns:tns="www.samplebookstore.com" 
attributeFormDefault="unqualified" elementFormDefault="qualified" 
targetNamespace="www.samplebookstore.com" 
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="title" type="xs:string" />
        <xs:element name="author" type="xs:string" />
        <xs:element name="publisher" type="xs:string" />
        <xs:element name="price" type="xs:decimal" />
      </xs:sequence>
      <xs:attribute name="year" type="xs:unsignedShort" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Let's refine this schema using book2.xml, which has a slightly different structure than book1.xml. Book1.xml has only one author, whereas Book2.xml has two authors. In addition, Book2.xml has a <sale:price> element and an <editor> element that holds <name> and <affiliation> elements.

To refine, we'll use the InferSchema method that takes in an XmlReader and an XmlSchemaCollection. This method looks for the schema with the same targetNamespace as the reader passed in and refines that schema based on the new document. If no such schema exists in the collection, it infers the schema for the document and adds it to the collection.

The code to refine the schema we inferred for Book1.xml using Book2.xml is given below:

XmlTextReader book2 = new XmlTextReader("book2.xml");
xsc = testInfer.InferSchema(book2, xsc);

We can write out the refined schema using the printing code we saw earlier. It outputs two schemas—the refined schema for Book and the schema that Book imports.

Schema for book

<xs:schema xmlns:tns="www.samplebookstore.com" 
attributeFormDefault="unqualified" elementFormDefault="qualified" 
targetNamespace="www.samplebookstore.com" 
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:import namespace="www.samplebookstore.com/sale" />
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="title" type="xs:string" />
        <xs:element maxOccurs="unbounded" name="author" type="xs:string" />
        <xs:element name="publisher" type="xs:string" />
        <xs:element name="price" type="xs:decimal" />
        <xs:element minOccurs="0" 
xmlns:q1="www.samplebookstore.com/sale" ref="q1:price" />
        <xs:element minOccurs="0" name="editor">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string" />
              <xs:element name="affiliation" type="xs:string" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="year" type="xs:unsignedShort" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>
     

Imported Schema

<xs:schema xmlns:tns="www.samplebookstore.com/sale" 
attributeFormDefault="unqualified" 
elementFormDefault="qualified" 
targetNamespace="www.samplebookstore.com/sale" 
xmlns:xs="http://www.w3.org/2001/XMLSchema">
       <xs:element name="price" type="xs:unsignedByte" />
</xs:schema>

Since Book2.xml had more than one author, the refining algorithm added a maxoccurs = unbounded on the author element. It also added an editor element with minoccurs = 0 because Book1 does not have an editor.

Book2.xml had a <sale:price> element so a new schema was created for www.samplebookstore.com/sale, which defines <sale:price> and added as an import to the Book schema. The Book schema is a valid W3C schema and can be used to validate both Book1.xml and Book2.xml.

Some details about the Infer Class should be noted:

  1. If entity references are returned by the XmlReader, an exception is thrown. So, if the XML instance document has entity references, you need to create a reader that expands the entities like XmlValidatingReader and pass it to InferSchema. If you are working with XML files from an untrusted source and for security reasons you do not want to expand the entities, you must create an XmlResolver to pass in to the XmlValidatingReader.
  2. Any <![CDATA[ … ]] sections in the XML instance document are treated as text.
  3. The <!DOCTYPE … > nodes in the XML instance document are ignored.
  4. xsi:type, xsi:schemaLocation, and xsi:noNamespaceSchemaLocation are ignored when inferring schema.
  5. If the instance XML document passed into the Infer class is a schema, an exception is thrown.
  6. If there is an inline schema in the document passed into the Infer class, <xs:any processContents="skip" /> is generated for it.

Summary

By using the Microsoft.XSDInference.Infer class, a developer can easily infer a schema for an instance document. The inferred schema can be refined with related document instances so that it can be used to describe and validate a whole class of XML documents.

Questions and comments regarding this article can be posted at https://www.gotdotnet.com/community/messageboard/MessageBoard.aspx?id=207.