An XML Overview Towards Understanding SOAP

 

Scott Seely
Microsoft Developer Network

November 2001

The following article is an excerpt of Chapter 2 from Scott Seely's book,SOAP: Cross Platform Web Service Development Using XML, Prentice Hall-PTR, © 2002.

Summary: This article explains what you need to know about XML in order to understand SOAP. You will learn the basics about Uniform Resource Identifers, XML, XML Schema, and XML Namespaces. (23 printed pages)

Introduction

When looking for a way to express the SOAP payload, the authors of the specification had a number of ways they could have gone. They could have invented their own protocol, declared that CORBA or DCOM will now be known as SOAP, or invented something new by combining existing technologies. In the end, they chose to minimize the amount of required invention by combining existing technologies. To express the content of a SOAP message, the authors chose the eXtensible Markup Language, XML.

XML contains a large number of features—far more than SOAP uses or needs. For example, the SOAP specification states, "A SOAP message MUST NOT contain a Document Type Declaration. A SOAP message MUST NOT contain Processing Instructions" (SOAP 1.1 Specification, section 3, "Relation to XML"). Of the XML standards that SOAP has adopted, it specifies how that feature will be used. You will see this in chapter 3 when looking at SOAP serialization. As we will see later, this decision makes it fairly easy to implement solutions using SOAP because developers do not need to have a full-fledged XML parser in order to use SOAP. In order to understand SOAP, we need to understand the following items first:

  • Uniform Resource Identifiers
  • XML Basics
  • XML Schemas
  • XML Namespaces
  • XML Attributes

Uniform Resource Identifiers

In order to access a unique item over the Internet, you need to know how to identify that one object amongst everything else out there. Uniform Resource Identifiers, or URIs, provide a way of uniquely identifying those many different items. Described in detail by RFC1630, this specification spells out the rules used to use many different protocols within the URI framework. A URI has the form:

<scheme>:<scheme-specific-part>

When the scheme-specific-part contains slashes ('/'), those slashes indicate some hierarchical structure within the path.

Uniform Resource Locators

The best-known type of URI is the Uniform Resource Locator, or URL. Like all URIs, a URL follows the <scheme>:<scheme-specific-part> method of addressing. Table 2.1 identifies the schemes named by RFC1738 and RFC1808. (You can obtain the source to these and other RFCs from ftp://ftp.ietf.org/rfc.) Using these schemes, we can connect to various places on the Web using nothing but a URL translator such as Internet Explorer or Netscape Navigator. URLs define this layout for the scheme-specific-part:

//<user>:<password>@<host>:<port>/<url-path>

If you are at all familiar with URLs, you know that a good number of the items in the above layout are optional. More often than not, you type in URLs such as:

http://www.scottseely.com (my web site) or

ftp://ftp.scottseely.com (my ftp site)

The various parts of the scheme syntax identify:

  • user: User name at the target location (optional)
  • password: The password assigned to user (optional)
  • host: The IP address or fully-qualified domain name of a network host (required)
  • port: Identifies the port to use when establishing a connection. Most protocols identify a default port number. For example, HTTP uses port 80 by default (optional)
  • url-path: Contains details of how to access the specified resource. The "/" immediately after the host or port is not a part of the url-path.

Table 2.1. Currently available URL Schemes

Scheme Name Description
ftp File Transfer Protocol
http Hypertext Transfer Protocol
gopher The Gopher Protocol
mailto Electronic mail address
news USENET news
nntp USENET news using NNTP access
telnet Reference to interactive sessions
wais Wide Area Information Servers
prospero Prospero Directory Service

Uniform Resource Names

Uniform Resource Names (URNs) are much less familiar to the average Web user than the ubiquitous URL. Unlike a URL, a URN does not resolve to a unique, physical location. URNs serve as persistent resource identifiers. They allow other collections of identifiers from one namespace to be mapped into URN-space. Because of this requirement, the URN syntax provides the ability to pass and encode character data using existing protocols. RFC2141 defines how to create and use a URN. The production for a URN follows the general rules for a URI. In general, it looks like this:

<URN> ::= "urn:" <NID> ":" <NSS>

A URN uses the string "urn:" to identify the scheme. NID specifies the Namespace ID and NSS specifies the Namespace Specific String. When interpreting URNs we look to the NID to tell us how to interpret the NSS. When reading or creating a URN, the initial construct "urn:" <NID> is case-insensitive.

URLs and URNs represent two common uses for a URI. In the next section we will see yet another use of URIs: XML Namespaces.

XML Basics

When XML first hit and the trade press began reviewing it in the 1996-1997 timeframe, I dug around looking for examples of what XML looked like. I was surprised at how many industry wonks were saying that it was the next big thing but then would not (or could not) show what this markup language looked like. Given the hype and lack of examples, I imagined it must be a fairly complex, ornery beast. After a few months of hype, developers began writing articles on the topic giving out the details I wanted to know. Some of these articles described it as a descendent of SGML, only better suited for development. How was it made to work better in the program development area? SGML offers you extraordinary levels of flexibility but makes it very difficult to implement a full-featured SGML parser. XML more or less defines a concrete set of rules that readers and writers of XML data must follow. Because the language definition for XML is more rigid, it is easier to create conforming documents and parsers. Do not get the wrong idea—XML is a subset of SGML. Anyhow, after the digerati calmed down and the developers got their chance to speak up I got really excited. Why? Well, the first thing was that I finally saw some practical applications of XML. It works as a data language that both machines and people can easily understand. If you have ever read or written HTML, you will find XML fairly easy to understand and use. Like HTML, it contains begin tags and end tags. Unlike HTML, every begin tag must have a matching end tag. End tags look like their matching begin tag with a leading "/". Let's jump in and take a look at what XML can look like.

The following XML shows one way of encoding the contents of a library:

<?xml version="1.0">
<Library>
    <Book>
        <Title>Green Eggs and Ham</Title>
        <Author>Dr. Seuss</Author>
    </Book>
    <Book>
        <Title>Windows Shell Programming</Title>
        <Author>Scott Seely</Author>
    </Book>
    <Picture>
         <Title>American Gothic</Title>
         <Artist>Grant Wood</Artist>
    </Picture>
</Library>

Even if you have never read XML in your life, the above makes a fair amount of sense. The document demonstrates a number of the rules found in an XML document. The first line in the above sample is a processing instruction declaring the version of XML used by the document. Documents do not have to include this element, but normally you should include it. All XML documents must have one enclosing element (the version information does not count as an enclosing element). The Library element wraps the entire document above. It contains three sub-elements: two books and a picture. As you may guess, not one word in the above XML document is an XML keyword. If you want to be a freewheeling XML author, all you need to do is watch the spelling in your tag names and make sure that every begin tag has an end tag. Writing XML documents this way can cause problems. For example, you could accidentally write this:

<Library>
    <Book>
        <Title>Green Eggs and Ham</Title>
        <Author>Dr. Seuss</Author>
    </Book><Bokk>
        <Title>Windows Shell Programming</Title>
        <Author>Scott Seely</Author>
    </Bokk>
    <Picture>
         <Title>American Gothic</Title>
         <Artist>Grant Wood</Artist>
    </Picture>
</Library>

As a human reader, you recognize that the author of the document misspelled "Book" for the book, Windows Shell Programming. Likewise, the parser will accept the document but it will not realize that you have two books in the library list. Instead, it will think you have one Book, one Bokk, and one Picture. If you want the XML parser to do some checking for you and only read valid constructs, you can use something called a Document Type Declaration (DTD) or an XML Schema. DTDs are not covered in this book because section 3 of the SOAP specification specifies that a SOAP message "MUST NOT contain a Document Type Declaration." If you really must know how to use DTDs, see the recommended reading list at the end of the chapter. With a few exceptions (i.e. publishing, document management, etc.), you should always use XSD to describe data.

XML Schemas

An XML Schema provides a superset of the capabilities found in a DTD. They both provide a method for specifying the structure of an XML element. Whereas both schemas and DTDs allow for element definitions, only schemas allow you to specify type information. All XML data is character-based. It will specify a 4 as the character 4, rarely as the binary representation 0100. (XML does allow for encoding binary data within the message. This method allows us to send things such as image data inside of an XML message.) We can enhance the library example to demonstrate the benefits of schemas over DTDs. We will add copyright date to the book information.

A simple DTD will have elements that contain other elements and/or character data. The simplest element declaration would declare the element name and the contents as character data:

<!ELEMENT element-name (#PCDATA)>

An element may also consist of other elements. If an element contains exactly one instance of a given element, we would have the following DTD:

<!ELEMENT parentElement (childElement)>
<!ELEMENT childElement (#PCDATA)>

Alternatively, the parentElement might contain zero or more childElements. We indicate this using an asterisk, *.

<!ELEMENT parentElement (childElement*)>
<!ELEMENT childElement (#PCDATA)>

Finally, you can also indicate composition of elements in a DTD. For example, parentElement might contain two different pieces of data.

<!ELEMENT parentElement (childElem1, childElem2)>
<!ELEMENT childElem1 (#PCDATA)>
<!ELEMENT childElem2 (#PCDATA)>

If we wanted to generate a DTD for a library of books, it might look like this:

<!ELEMENT Library (Book*)>
<!ELEMENT Book ( Title, Author*, Copyright )>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Author (#PCDATA)>
<!ELEMENT Copyright (#PCDATA)>

The Library consists of zero or more elements of type Book. Each Book has a Title, zero or more elements of type Author, and a Copyright. The Title, Author, and Copyright elements all contain character data. Rewriting the library example to use the DTD, we have the following XML document:

<?xml version="1.0" ?>
<!DOCTYPE Library PUBLIC "." "Library.dtd" >
<Library>
    <Book>
        <Title>Green Eggs and Ham</Title>
        <Author>Dr. Seuss</Author>
        <Copyright>1957</Copyright>
    </Book>
    <Book>
        <Title>Windows Shell Programming</Title>
        <Author>Scott Seely</Author>
        <Copyright>2000</Copyright>
    </Book>
</Library>

A validating parser will load Library.dtd and use it to validate the contents of the document. This is all well and good, but wouldn't it be nice if we could specify more information than "this element contains character data"? You see, DTDs come from SGML. SGML primarily concerned itself with document publishing. As such, the print industry has been using it for years. Because they deal with print issues all the time, SGML provided ways to reproduce the same document in lots of different forms. Now that computing has embraced XML, the programmer types (i.e. you and me), wanted a way to express the characteristics of the data. A DTD can specify the number of instances of a piece of data and what a particular structure looks like. By extending a SGML dialect, I could even specify the characteristics of the data. The problem here is that every developer may come up with a different naming system. I also have a gripe with DTDs—they do not look like XML. For these and other reasons, the W3C eventually published the XML Schema recommendation. Here is the Library DTD defined as a schema:

<schema xmlns:xsd=
"http://www.w3.org/2001/XMLSchema"

    targetNamespace=
        "http://www.scottseely.com/LibrarySchema.xml"    
    xmlns:xsi=
       "http://www.w3.org/2001/XMLSchema-instance">
    <complexType name="Book">
        <element type="Title"></element>
        <element type="Author"></element>
        <element type="Copyright"></element>
    </complexType>
    <simpleType name="Title" xsi:type="string">
    </simpleType>
    <simpleType name="Author" xsi:type="string">
    </simpleType>
    <simpleType name="Copyright" xsi:type="integer">
    </simpleType>
</schema>

You would save the above as an XML file. To use the schema, simply reference the targetNamespace in your document like so:

<myLibrary:Library xmlns:myLibrary=
    "http://www.scottseely.com/LibrarySchema.xml">
    <myLibrary:Book>
        <myLibrary:Title>Green Eggs and Ham
        </myLibrary:Title>
        <myLibrary:Author>Dr. Seuss
        </myLibrary:Author>
        <myLibrary:Copyright>1957
        </myLibrary:Copyright>
    </myLibrary:Book>
    <myLibrary:Book>
        <myLibrary:Title>Windows Shell Programming
        </myLibrary:Title>
        <myLibrary:Author>Scott Seely
        </myLibrary:Author>
        <myLibrary:Copyright>2000
        </myLibrary:Copyright>
    </myLibrary:Book>
</myLibrary:Library>

Both the schema and the document use the text, xmlns. This string tells the parser to use the set of names specified by the namespace identified by the indicated URI. This means that both the reader and writer of the XML document must agree on what the particular XML Namespace means. Without this agreement, the XML Schema will lose any potential value. All elements inside the tag using the xmlns declaration are part of the enclosing namespace unless otherwise specified.

Facets

To aid with the definition and validation of data, XML Schema uses facets to define characteristics of a specific datatype. A facet defines an aspect of a value space. A "value space" is the set of all valid values for a given datatype. You use a facet to distinguish what makes one datatype different from another. The XML Schema document specifies two types of facets: fundamental and non-fundamental facets.

A fundamental facet is an abstract property that characterizes the values of a value space. These include the following facets:

  • Equal: Defines the notion of two values of the same datatype being equal. The following rules apply to this concept:
    1. For any two values (a, b), a is equal to b (denoted a=b) or a is not equal to b (a!=b).
    2. No pair of values (a, b) exists such that a=b and a!=b.
    3. For every valid value a, a=a.
    4. For any two values (a, b) in the value space, a=b if and only if b=a.
    5. For any three valid values (a, b, c), if a=b and b=c then a=c.
  • Order: This specifies a mathematical relation to set the total order of members in the value space. For every pair of values (a, b), their relationship is either a<b, b<a, or a=b. For every triple (a, b, c), if a<b and b<c then a<c.
  • Bounds: This simply states that a given value space may be bounded above or bounded below. If a value U exists such that for all values v in the value space the statement v<=U is true, U represents the upper bound of the value space (bounded above). If a value L exists such that for all values v in the value space the statement v>=L is true, L represents the lower bound of the value space (bounded below). If the datatype has both an upper and lower bound, than that datatype is bounded.
  • Cardinality: Some value spaces have a finite set of values. Others have an unlimited set of values. A datatype has the cardinality of the value space which is either "finite" or "countable infinite".
  • Numeric: If the values of the datatype are quantities in any mathematical number system, then the datatype is numeric. Everything else is non-numeric.

The non-fundamental or constraining facets are optional properties that you can apply to a datatype to constrain its value space. The following facets do this for you:

  • length: This facet has a different meaning depending on the base type. If the type derives from string, length measures units of Unicode code points (i.e. characters). For binary datatypes this facet is measured in octets (8 bits) of binary data. List datatypes (ex. NMTOKENS, IDREFS, etc.) use this facet to indicate the number of list items.
  • minLength: Sets the minimum number of units of length. The value of the facet must be a nonNegativeInteger.
  • maxLength: Sets the maximum number of units of length. The value of the facet must be a nonNegativeInteger.
  • pattern: This constrains the value space to values that match a regular expression defined by the pattern facet.
  • enumeration: Specifies a value space by setting a set of values. This does not impose order on the created value space. Order is imposed on the enumeration's base type.
  • maxInclusive: States the upper bound for an ordered datatype. Inclusive means that the upper bound is also in the value space. For an upper bound U, all values v must be v<=U.
  • maxExclusive: States the upper bound for an ordered datatype. Exclusive means that the upper bound is not in the value space. For an upper bound U, all values v must be v<U.
  • minInclusive: States the lower bound for an ordered datatype. Inclusive means that the lower bound is also in the value space. For a lower bound L, all values v must be v>=L.
  • minExclusive: States the lower bound for an ordered datatype. Exclusive means that the lower bound is not in the value space. For a lower bound L, all values v must be v>L.
  • precision: Used for value types derived from decimal, this facet defines the maximum number of decimal digits. Its value must be a positiveInteger.
  • scale: Used for value types derived from decimal, this facet defines the maximum number of decimal digits in the fractional part of the value. Its value must be a positiveInteger.
  • encoding: Used to form the lexical space for datatypes derived from binary. Its value must be either hex or base64. If the value is hex, the value consists of the two hexadecimal digits needed to represent the octet code. For example, "20" is the hex value for the US-ASCII space character. If the value is base64, the binary stream must use the Base64 Content-Transfer-Encoding defined in Section 6.8 of RFC 2045.
  • duration: Set of values for datatypes derived from recurringDuration. Its value must be a timeDuration.
  • period: Set of values for datatypes used to define the period for datatypes derived from recurringDuration. Its value must be a timeDuration.

Using all of these facets you can constrain existing datatypes. This helps perform tasks such as data validation and verifying the overall "correctness" of an XML document.

Datatypes

Combined with facets, the XML Schema datatypes can help you give meaning to the items contained by your schema. For a comprehensive listing of all available data types, look at http://www.w3.org/TR/xmlschema-2/#built-in-datatypes.

XML Namespaces

We already saw these in use in the last section, XML Schemas. Simply put, namespaces define a set of unique names within a given context. A namespace can use any URN as long as that URN is unique. For example, the preceding schema defined the namespace myLibrary. The schema contained in the file LibrarySchema.xml is in the same directory as the source page and uniquely identifies the namespace.

What does a namespace do for us? It allows us to create multiple elements with the same name (such as postOffice:address and memory:address). Putting these similar structures into unique namespaces helps prevent the concepts from clashing with each other and allows the computer to unequivocally determine which structure is being referenced. This same practice exists in C++, Java, C#, and a number of other languages. A number of arguments exist that are both for and against namespaces. Many of the arguments against namespaces boil down to the idea that namespaces are a solution in search of a problem. The arguments for them state that developers are better off when they do not have to re-architect an application because someone else used a function with the same name. With regards to the C language and pre-standardized C++, people avoided collisions with things such as standard library functions all the time. For better or worse, these same people also had to avoid collisions with names of functions supplied by various vendors. Often, the code supplied by these vendors would clash with functions written by the developer. In Java, the location of the package often defines the namespace. For example, you can create two classes named Foo and differentiate them by putting them in different packages (com.scottseely.foo is different from com.prenticehall.foo). This is a bit off the topic, but here is an example of how namespaces work in C++.

Developer code:

#include "someVendorHeader.h"

void someFunc() 
{
    // Code to do something
}

Inside of the vendor's header file, they have a function with the same signature as someFunc(), which means that the code will not compile. To fix this, the programmer can write this:

#include "someVendorHeader.h"

namespace myFuncs
{
    void someFunc() 
    {
        // Code to do something, even call the
        // vendor's function!-- :: says to use the
        // function in the global namespace.
        ::someFunc();
    }
}

Problem solved! With the various DTDs and schemas being created, the creators of XML namespaces figured that they could learn from others and include similar functionality. Let's look back at the schema example and the lines that define the namespace:

<myLibrary:Library xmlns:myLibrary=
    "http://www.scottseely.com/LibrarySchema.xml">

Regardless of the name of the schema in LibrarySchema.xml, the enclosing namespace is named myLibrary. The namespace could have easily been called x, Bob, or yth443. By using a namespace, we make it possible to use many different schemas that define Book. Imagine that you are an online bookseller. All of your vendors ship you their catalogs via XML. Each vendor defines the Book element slightly differently. Because there is no standardization, you have to read in the various catalogs and normalize the data for your database. Namespaces can help you do this by putting each Book definition into a uniquely identified namespace. (Other examples abound: You could use namespaces to aggregate job databases, stock market data, or cooking recipes.)

Namespaces also come in handy for creating self-documenting XML. If you are using schema from many different sources, using namespaces will help the human reader know where the various bits of data came from. Within an XML document, a namespace remains active for the element declaring it and all elements contained by the declarer. Likewise, if an inner element declares a different namespace then all of its inner elements use the new namespace. To see this, consider the following example. Elements in the outer namespace are displayed using regular characters and the inner namespace is in italics.

<outer:library xmlns:outer=
    "http://www.scottseely.com/library">
    <book>
        <title>The XML Handbook</title>
        <author>Charles F. Goldfarb</author>
        <author>Paul Prescod</author>
    </book>
    <inner:book        xmlns:inner="http://www.phptr.com/book">        <title>Windows Shell Programming</title>        <writer>Scott Seely</writer>    </inner:book>
</outer:library>

These scoping rules help out by reducing the verbosity of the XML document. An XML document may also mix and match namespaces within a single element. The scoping rules outlined above still apply—they just seem to get a bit more complex. Consider this example that matches up a book with some Library of Congress information.

<lib:library xmlns:lib=
    "http://www.scottseely.com/lib"
    xmlns:LOC=        "http://www.libraryofcongress.gov/book>
    <book>
        <title>The XML Handbook</title>
        <author>Charles F. Goldfarb</author>
        <author>Paul Prescod</author>
        <LOC:ISBN>0-13-014714-1</LOC:ISBN>
    </book>
</lib:library>

In the above example, the elements book, author, and library are all part of the lib namespace. ISBN exists as a part of the LOC namespace.

Namespaces combined with schemas provide some great opportunities for document validation and ease of readability. The XML Namespace may reference a targetNamespace that is part of a XML Schema known to both the reader and writer of the XML document. Fortunately, this is not the only possibility. A namespace can be used to simply make the named element unique within the XML document.

XML Attributes

All of the XML documents presented in this chapter have used elements to present data. XML also supports attributes. We saw these used as facets within the description of XML Schema. As stated earlier, elements require begin and end tags. Attributes do not. Instead, they are contained by the begin tag of an element. A given element can have one or more elements of the same type. It can only have one attribute of any given type. The following XML is legal:

<Library>
    <Book title="Windows Shell Programming">
        <Author>Scott Seely</Author>
    </Book>
</Library>

The Book element has an attribute, title, which gives the title of the book. The XML expresses Author as a sub-element. This could have easily been expressed as another attribute and been 100% valid.

<Library>
    <Book title="Windows Shell Programming"
        author="Scott Seely" />
</Library>

How would you express a book with more than one author? You could try this:

<Library>
    <Book title="The XML Handbook" 
         author="Charles F. Goldfarb" 
         author="Paul Prescod">
    </Book>
</Library>

As I mentioned already, the above fragment is invalid. You cannot have two attributes with the same name. You could achieve a similar effect by writing this:

<Library>
    <Book title="The XML Handbook">
         <author name="Charles F. Goldfarb" />
         <author name="Paul Prescod" />
    </Book>
</Library>

The author element uses a name attribute to contain the names of any writers associated with the Book. Because these are empty elements, the fragment uses the empty element notation: "/>". Attributes can be declared in three different ways:

  • Well-formed XML with no DTD or schema
  • Well-formed XML using a DTD
  • Well-formed XML using schema

The above examples use option 1. This works well for learning but poorly for production environments. As mentioned earlier, SOAP forbids the use of DTDs, so we will not investigate that option. This leaves us with Option 3, schema. When creating attributes for an XML Schema, you use the attribute keyword. This word only has meaning within the schema namespace. You use attribute to define characteristics of the type. Attribute includes the item within an element type definition. To create the book example using attributes, the schema would look something like this:

<Schema xmlns:xsd
    "http://www.w3.org/2001/XMLSchema"
        targetNamespace=
        http://www.scottseely.com/LibrarySchema.xml
    xmlns:xsi=
       "http://www.w3.org/2001/XMLSchema-instance">
    <attribute name="title" 
        xsi:type="string" />
    <attribute name="name" xsi:type="string" />
    <complexType name="Author" content="empty">
        <attribute type="name" />
    </complexType>
    <complexType name="Book" content="eltOnly">
        <attribute type="title" />
        <element type="Author" />
    </complexType>
</Schema>

Looking at the title attribute definition, we see that it specifies the datatype (string) and the name of the element. Fairly easy, right? The full syntax for an attribute is:

<attribute
    default="default value"
    fixed = "fixed value"
    form = "{qualified | unqualified}"
    id = "ID"
    name="NCName"
    ref = "QName"
    xsi:type="type"
    use="{optional | prohibited | required}" >
  • default: The default value for the attribute. This value must be legal. For example, enumerations can only use elements in the enumeration as the default value.
  • fixed: If the attribute is filled in, it must have the value specified by fixed. If the value is not filled in, then the value is equal to the value stated in default.
  • form: States whether or not the attribute itself must be expliceitly prefixed in the XML document the attribute appears in.
  • id: This is a unique ID for the attribute within the XML Schema Document.
  • name: Identifies the attribute type. The attribute must have a name in order to be valid.
  • ref: This refers to another attribute declared elsewhere in the XML Schema Document. It allows one to reuse attributes in several type definitions.
  • xsi:type: The data type for the attribute.
  • use: Indicates how the attribute is to be used within the document. The values are descriptive enough.

For an example using all the fields, let's add a new attribute to the myBook schema, format.

<Schema xmlns:xsd=
"http://www.w3.org/2001/XMLSchema"
    xmlns:xsi=
       "http://www.w3.org/2001/XMLSchema-instance">
    targetNamespace=
        "http://www.scottseely.com/BookSchema.xml">
    <attribute name="title" 
        xsixsd:type="string" />
    <attribute name="name" xsi:type="string" />
    <attribute name="format" 
        default="soft-cover" use="optional">
        <enumeration value="soft-cover" />
        <enumeration value="hard-cover" />
    </attribute>
    <complexType name="Author">
        <attribute type="name" />
    </complexType>
    <complexType name="Book">
        <attribute type="title" />
        <attribute type="format" />
        <element type="Author" />
    </complexType>
</Schema>

Using this schema for one title, we would have the following XML:

<lib:Book xmlns:lib=
    "http://www.scottseely.com/BookSchema.xml"
    title="The XML Handbook">
    <author name="Charles F. Goldfarb" />
    <author name="Paul Prescod" />
</lib:Book>

If a program requested the format attribute from the Book element, it should get back the value soft-cover. Viewing this XML document in Internet Explorer 5.5 yields the data shown in Figure 1.

Figure 1. Using Microsoft Internet Explorer to view XML documents

Internet Explorer will not flag invalid data, but it will flag properly (and improperly) structured data. For example, you could set the format attribute to "stone-tablet" and Internet Explorer would still display the document.

Summary

This chapter presented just enough information to make SOAP accessible to you. Many Internet technologies use URIs to express locations and other concepts. You must understand how these are formed and what they mean in order to appreciate there usefulness when use by other markup languages and protocols. After discussing the basics, we took a quick look at XML. Since this language came onto the scene in late 1997, many new ideas have been layered on top of it. Besides XML Schemas and Namespaces we have also seen other technologies layered on top of XML. Among the proposals winding their way through the W3C approval process are:

  • XML Style Language (XSL): Specifies a way of converting documents from one format to another. XSL will let you convert documents between various schemas, generate text files, or create an HTML view of the data.
  • XML Schema: An improvement over DTDs that allows the author to specify data types, maximum and minimum values, enumerations, and other items.
  • XPointer: An extension and customization of XPath. (XPath is already a W3C recommendation. In W3C jargon, "recommendation" refers to the accepted, ratified standard.)
  • XLink: Allows XML authors the ability to establish relations within documents as well as between them. For example, XPath is used within XSL to specify which element to transform.
  • XML Query Language: Allows an external entity to query an XML document for specific data.

At the request of my reviewers, I have to point out that the above synopses are very limited descriptions of all you can do with the various W3C recommendations and their related implementations. Many of the specifications are fairly long. I would recommend visiting www.w3c.org to read the current overviews of the various technologies if one looks interesting to you.

Of course, there are many other ideas related to XML winding their way through the standards process such as SOAP. While working with SOAP, you will find it handy to have XML reference material handy. I went through a lot of effort to make sure this book stands on its own. Still, it is hard to cover XML in a chapter. Fortunately, a lot of good books exist. The best all-around book on the market that I have found is The XML Handbook by Charles F. Goldfarb and Paul Prescod. Mr. Goldfarb has been involved with SGML (and consequently XML) since its inception. If you have good financial resources you should also purchase the XML Developers Toolkit. The Toolkit contains three books at a reduced price. Even though Prentice Hall publishes these books, I do not recommend them just to keep my publisher happy. These truly are the best books I own regarding XML and I went through a lot of books before I found these.

At this point you should understand enough about XML to make the SOAP specification readable. Let's get moving and cover the specification!