Putting XML to Work

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Greater Office

This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

Parsing XML Data with MSXML

By Mike Gunderloy

Unless you've been on a long vacation lately, you've probably run across Extensible Markup Language (XML), which is being widely touted as the next great breakthrough for communication between applications. But you may be a bit fuzzy on how to use it in your own applications.

In brief, XML provides a text format for transmitting structured information. The good news is that you don't have to write your own parser for this format. Starting with Internet Explorer 4.01, Microsoft has shipped their own XML parser, MSXML, as a COM component. In this article, I'll show you how to use the MSXML parser from VBA to quickly and easily extract information from an XML document.

Why would you want to do this? The fact is that XML is one of this year's hot new terms in personal computing, and Microsoft is making a concerted push to incorporate it into all aspects of their product line. Other major software companies - from IBM to Oracle - have also produced XML tools and interfaces. With that much pressure behind XML as a standard, there's little doubt that you'll be faced with an XML data stream, sooner rather than later. Knowing how to read it quickly and easily will let you make the best use of your time in working with this data.

Understanding XML

XML is a huge topic, and one which very few people understand in depth. If you want to see the full standard, you need to go the World Wide Web Consortium (W3C) Web site at http://www.w3.org/Consortium/siteindex#X. A warning, though: The standard is very heavy going if you're not used to interpreting such legalistic documents. You might find the annotated version of the specification at http://xml.com/pub/axml/axmlintro.html more useful, or some of the articles at Microsoft's MSDN XML Developer Center (http://msdn.microsoft.com/en-us/xml/default.aspx).

Fortunately, if all you want to do is read and parse an XML file from someone else, and you can make the assumption that the document is valid XML, you don't have to know the entire standard. In fact, there are just a few key terms you need to understand to get started in the XML world. You can always pick up more later, but for now, just concentrate on these four ideas:

  • A document is a chunk of XML. You can think of an XML document as the equivalent of an HTML page. It's a bit of XML that's served up by some Web server, or other data feed, as a single chunk.
  • An XML declaration is a processing directive that identifies the document as being XML, and includes the XML version.
  • Elements are the building blocks of the document. If you're familiar with HTML, HTML tags are the rough equivalent of XML elements. One of the key distinctions is that in XML, new elements may be defined by any document.
  • Attributes describe some feature of an element.

Of course, I've left a lot out to produce this simple picture of XML. In particular, I'm ignoring the topic of Document Type Declarations (DTDs). A DTD is a section of XML that explains the rules for constructing a particular XML document, and it's important because it allows you to validate that an XML document is actually in the intended format. Since I'm assuming the source document comes from someone who knows how to construct XML, I can ignore this piece of the puzzle - at least for this article.

An XML Example

For this article, I'll be parsing the sample file shown in FIGURE 1, traffic.xml. This file tracks Web site visits to a fictitious Web site by country/region of origin. If you refer to the figure, you'll be able to identify all the parts of XML that I talked about in the previous section:

  • The document is the entire file.
  • The XML declaration is the first line in the file, <?xml version="1.0"?>. This shows that the current document is XML, and that it meets the standards of version 1.0 of the XML specification.
  • The elements of the document include SiteVisits, Country, TotalVisits, and LatestVisit. As you can see, each element is marked by a starting and ending tag, and elements can be nested.
  • CountryName is an attribute of the Country element.
<?xml version="1.0"?> 
      <SiteVisits> 
        <Country CountryName="USA"> 
          <TotalVisits>1348</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="UK"> 
          <TotalVisits>764</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Argentina"> 
          <TotalVisits>175</TotalVisits> 
          <LatestVisit>1/2/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Brazil"> 
          <TotalVisits>182</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Canada"> 
          <TotalVisits>688</TotalVisits> 
          <LatestVisit>1/3/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Denmark"> 
          <TotalVisits>204</TotalVisits> 
          <LatestVisit>1/1/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Germany"> 
          <TotalVisits>351</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Hong Kong"> 
          <TotalVisits>97</TotalVisits> 
          <LatestVisit>12/30/1999</LatestVisit> 
        </Country> 
        <Country CountryName="Ireland"> 
          <TotalVisits>522</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Malaysia"> 
          <TotalVisits>14</TotalVisits> 
          <LatestVisit>12/31/1999</LatestVisit> 
        </Country> 
        <Country CountryName="Netherlands"> 
          <TotalVisits>542</TotalVisits>
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="New Zealand"> 
          <TotalVisits>599</TotalVisits> 
          <LatestVisit>1/3/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Norway"> 
          <TotalVisits>452</TotalVisits> 
          <LatestVisit>1/3/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Scotland"> 
          <TotalVisits>538</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Sweden"> 
          <TotalVisits>422</TotalVisits> 
          <LatestVisit>1/2/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Wales"> 
          <TotalVisits>301</TotalVisits> 
          <LatestVisit>1/1/2000</LatestVisit> 
        </Country> 
        <Country CountryName="Zambia"> 
          <TotalVisits>42</TotalVisits> 
          <LatestVisit>1/4/2000</LatestVisit> 
        </Country> 
      </SiteVisits>

FIGURE 1: The example XML file, traffic.xml, as ASCII text

You can see that the traffic.xml file consists of markup (the mechanics of XML: angle brackets, element names, and so on) and data (the country/region names and numbers of visitors). Suppose someone sent you this XML document once a day, and your job was to extract the data and display it in Microsoft Excel? Because the structure is so simple, it's easy to see how you might start breaking it apart with VBA string functions, looking for angle brackets, element names, and data. You could build a special-purpose parser that understood the structure of this particular XML document, and feed the new document into your parser on a daily basis.

But there's a better way. Starting with Internet Explorer 4.01, Microsoft has been shipping a generalized XML parser with every Web browser. Even better, this generalized parser is available via COM. So there's no need to create your own!

The MSXML Object Model

The parser in question is called simply the Microsoft XML library, or MSXML. To understand the object model used by MSXML, it's helpful to think of an XML document, not as a nested set of elements, but as a tree. FIGURE 2 shows the traffic.xml document open in Microsoft XML Notepad, which uses the tree metaphor. (Microsoft XML Notepad is in beta as of this writing. You can download it at http://www.microsoft.com/downloads/details.aspx?familyid=72d6aa49-787d-4118-ba5f-4f30fe913628&displaylang=en.) Rather than showing the TotalVisits and LatestVisit elements as nested within the Country element, they're presented as child nodes on a tree of nodes. This tree metaphor is also implemented by the MSXML parser.


FIGURE 2: Another view of traffic.xml, i.e. as it appears in Microsoft XML Notepad

The MSXML library provides a wide range of objects that you can use to deal with the complexity of XML. Just as most of that complexity is beyond the scope of this article, most of the objects are unnecessary for the simple job of parsing straightforward XML. In fact, you can do the job with only four objects:

  • The XMLDocument object represents an entire XML document.
  • The IXMLDOMNode object represents a single entity (node in the tree).
  • The IXMLDOMNodeList object represents a collection of child nodes for a particular entity.
  • The IXMLDOMNamedNodeMap object represents a collection of attributes for an entity.

You'll note that the naming of these objects is a bit peculiar. DOM stands for Document Object Model, a general notion that documents of any sort can be assigned to a particular object model. This library provides the particular object model for XML. The "I" stands for interface, as most of these objects are actually COM interfaces.

The table in FIGURE 3 shows some of the properties and methods of these four objects, including all of the ones that I'll be using in this article. For a complete list, refer to the XML Developer's Guide in the Web Services section of the Platform SDK, or just use the object browser built into VBA to explore the objects. In particular, I've omitted the rich set of methods concerned with modifying XML documents, since the example here is only designed to read an existing document.

Object

Member

Type

Description

XMLDocument

async

Property

Set to True to enable asynchronous loading

XMLDocument

attributes

Property

List of attributes for the root node of the document

XMLDocument

childNodes

Property

Returns a NodeList containing all the children of the root node

XMLDocument

documentElement

Property

Returns the root node itself

XMLDocument

getElementsByTag

Method

Returns a collection of elements having the specified tag

XMLDocument

hasChildNodes

Property

True if the root node has children

XMLDocument

load

Method

Loads an XML document from a disk file or URL

XMLDocument

loadXML

Method

Loads an XML document specified as a string

XMLDocument

url

Property

Returns the URL for the most recently loaded document

XMLDocument

validateOnParse

Property

Set to True to validate the XML as it's loaded

IXMLDOMNode

attributes

Property

List of attributes for this node

IXMLDOMNode

childNodes

Property

Returns a NodeList containing all the children of this node

IXMLDOMNode

hasChildNodes

Property

True if this node has children

IXMLDOMNode

nodeName

Property

The XML name of the node

IXMLDOMNode

nodeTypedValue

Property

Formatted value of the node

IXMLDOMNode

parentNode

Property

Pointer to the parent node of this node

IXMLDOMNode

parsed

Property

True if this node and its children have been parsed

IXMLDOMNode

text

Property

Unformatted text of this node

IXMLDOMNodeList

item

Property

Zero-based index to the members of this collection

IXMLDOMNodeList

length

Property

Count of members in this collection

IXMLDOMNamedNodeMap

getNamedItem

Method

Returns the attribute with the specified name (if any)

IXMLDOMNamedNodeMap

item

Property

Zero-based index to the members of this collection

IXMLDOMNamedNodeMap

length

Property

Count of members in this collection

FIGURE 3: Representative methods and properties of selected XML objects

Using MSXML

Now that you've seen the basic objects you need to read XML, it's time to see what you can do with them. FIGURE 4 shows a presentation of the data from traffic.xml in an Excel 2000 worksheet. Note that the worksheet really doesn't contain any more information than the original XML file; it's just been transformed into a format that's easier for human beings to grasp.

FIGURE 4: Data from traffic.xml file as an Excel 2000 worksheet.

The code to generate the worksheet and embedded chart is shown in Listing One. It's tied to the Load button on the worksheet. When the user clicks this button, the VBA code loads and parses an XML file, and uses it to generate the worksheet.

The procedure starts, of course, by declaring variables and setting up an error handler. Then it calls the Load method of the XMLDocument object. This sets up the association between the object model and the actual XML. In this particular case, I don't want the code to proceed until the XML is fully loaded, and I don't care about validating the XML (because I trust the XML supplier). So the applicable part of the code looks like this:

oDoc.async = False
      oDoc.validateOnParse = False
      fSuccess = oDoc.Load(ActiveWorkbook.Path 
      & "\traffic.xml") 

Note that in this example, the XML is being loaded from a disk file in the same folder as the worksheet itself. However, the Load method is a good deal more flexible than that; it can also accept a URL. In fact, the odds are good that you'll actually be using that method to retrieve XML from a Web server somewhere that a supplier is maintaining for you. So in a real example, the load line would look something like this:

fSuccess = oDoc.Load( _
        "http://TheirServer.com/traffic/traffic.xml") 

Once the XML document is loaded, the MSXML object model makes it simple to pick it apart into its component pieces. The documentElement property of the XMLDocument object returns an IXMLDOMNode object that represents the root node of the XML tree. This particular code then uses the childNodes property of that root node object with a ForEach loop to visit each of the top-level child nodes in turn. If you refer back to the XML in FIGURE 1, you'll see that this amounts to passing through the loop once for each country/region in the XML file.

The name of each country/region is available as an XML attribute. To get this attribute requires three steps. First, use the attributes property of the node to get the IXMLDOMNamedNodeMap object representing all the attributes of the node. Then, use the getNamedItem method of the node map to get the particular attribute we're interested in. This attribute is returned as a node. Finally, use the text property of the node to retrieve the country/region name:

SetoAttributes = oCountry.Attributes
      SetoCountryName = oAttributes.getNamedItem("CountryName") 
      ActiveSheet.Cells(intI, 1).Value = oCountryName.Text

Retrieving the visit information makes use of the fact that the properties and methods of nodes are recursive (representing the nested nature of XML). Given the country/region node, the code can walk through its children to find the nodes representing the total visits and the most recent visit. The only complication comes about in that the IXMLDOMNodeList object (the collection of nodes) doesn't support retrieving items by name. So the code has to step through all the children and decide what to do with each child based on its nodeName property:

ForEachoChild In oChildren
        If oChild.nodeName = "TotalVisits" Then
          ActiveSheet.Cells(intI, 2) =  oChild.nodeTypedValue
        EndIf
        If oChild.nodeName = "LatestVisit" Then
          ActiveSheet.Cells(intI, 3) = oChild.nodeTypedValue
        End If
      Next oChild

Once you understand these code snippets, you've grasped all you need to know to parse simple XML documents using the MSXML object model. There is other code in the VBA procedure in Listing One, but it's all Excel code, concerned with clearing the worksheet and generating the chart. The core XML parsing code is really quite simple.

XML in Your Future

If you're a VBA developer, it's very likely that there is XML somewhere in your future. Consider just a few recent moves from Microsoft:

  • Windows DNA uses XML as a pervasive protocol to communicate between different tiers of distributed applications.
  • Microsoft is the major supporter of the BizTalk initiative, which seeks to define standard XML documents for use in business.
  • The future Microsoft BizTalk Server and Microsoft Host Integration Server will help integrate XML with legacy and Web environments.
  • The XML-based SOAP protocol, submitted to the IETF for approval, will enable distributed COM and other object services via the Internet.
  • SQL Server 2000 will support importing and exporting data in XML formats.

Remember 1995, when just about every new Microsoft application included HTML support in one form or another? Well, the year 2000 is shaping up as the year for pervasive XML support in Microsoft products. While I'm not aware of any XML announcements directly related to VBA, I'd say it's a safe bet that the VBA and Office teams won't be left out of this initiative.

Fortunately, although XML can be complex, you can dodge most of the complexity if you're just using it for data interchange. I hope this article has convinced you that parsing XML is no big deal, and that the tools already exist to integrate this into your VBA applications more easily than many other types of data.

Mike Gunderloy (mailto:MikeG1@mcwtech.com) is a Senior Consultant with MCW Technologies, a Microsoft Solution Provider. He's also the author of Visual Basic Developer's Guide to ADO (SYBEX, 1999) and the forthcoming Visual Basic and VBA Developer's Guide to the Windows Installer (SYBEX, 2000).

Begin Listing One - The cmdLoad_Click Sub procedure

PrivateSubcmdLoad_Click()
        DimoDoc As MSXML.DOMDocument
        DimfSuccess AsBoolean
        DimoRoot As MSXML.IXMLDOMNode
        DimoCountry As MSXML.IXMLDOMNode
        DimoAttributes As MSXML.IXMLDOMNamedNodeMap
        Dim oCountryName AsMSXML.IXMLDOMNode
        DimoChildren As MSXML.IXMLDOMNodeList
        DimoChild As MSXML.IXMLDOMNode
        DimintI AsInteger
         On ErrorGoTo HandleErr
         SetoDoc = New MSXML.DOMDocument
         ' Load the  XML from disk, without validating it. Wait 
        ' for the load to finish before proceeding. 
        oDoc.async = False
        oDoc.validateOnParse = False
        fSuccess = oDoc.Load( _
          ActiveWorkbook.Path & "\traffic.xml") 
         ' If anything went wrong, quit now. 
        If NotfSuccess Then
          GoToExitHere
        EndIf
       
        ' Set up a row counter. 
        intI = 5
         ' Delete the previous information. 
        ActiveSheet.Cells(4, 1).CurrentRegion.ClearContents
        ActiveSheet.Shapes(2).Delete
        ' Create column headers. 
        ActiveSheet.Cells(4, 1) = "Country" 
        ActiveSheet.Cells(4, 2) = "Total Visits" 
        ActiveSheet.Cells(4, 3) = "Latest Visit" 
         ' Get the root of the XML tree. 
        Set oRoot = oDoc.documentElement
         ' Go through all children of the root. 
        ForEachoCountry InoRoot.childNodes
          ' Collect the attributes for this country/region. 
          SetoAttributes = oCountry.Attributes
          ' Extract the country/region name and
          ' place it on the worksheet. 
          SetoCountryName = _
            oAttributes.getNamedItem("CountryName") 
          ActiveSheet.Cells(intI, 1).Value = oCountryName.Text
          ' Go through all the children of the country/region node. 
          SetoChildren = oCountry.childNodes
          ForEachoChild InoChildren
            ' Get information from each child node to the sheet. 
            IfoChild.nodeName = "TotalVisits" Then
              ActiveSheet.Cells(intI, 2) = oChild.nodeTypedValue
            EndIf
            IfoChild.nodeName = "LatestVisit" Then
              ActiveSheet.Cells(intI, 3) = oChild.nodeTypedValue
            EndIf
          NextoChild
          intI = intI + 1
        Next oCountry
         ' Now for some eye candy; build a chart of the data. 
        Charts.Add
        WithActiveChart
          .ChartType = xl3DPieExploded
          .SetSourceData Source:=Sheets("Sheet1"). _
            Range("A5:B" & CStr(intI - 1)), PlotBy :=xlColumns
          .Location Where:=xlLocationAsObject, Name:="Sheet1" 
        End With
         ActiveChart.HasTitle = True
        ActiveChart.ChartTitle.Characters.Text =_
          "Web Site Visits" 
         ActiveSheet.Shapes(2).Top = 0
        ActiveSheet.Shapes(2).Left = 200
       ExitHere: 
        ExitSub
       HandleErr: 
        MsgBox "Error " & Err.Number & ": " & Err.Description
        ResumeExitHere
        Resume
      End Sub

End Listing One