Advanced Basics

Office 2007 Files and LINQ

Ken Getz

Code download available at:AdvancedBasics2008_03.exe(162 KB)

Contents

Getting Started
Investigating the Code
Wrapping It Up

I n the June 2006 installment of this column, I described how you can programmatically interact with the Microsoft® 2007 Office system Open XML File Formats using classes in the System.IO.Packaging and System.Xml namespaces (see msdn.microsoft.com/msdnmag/issues/06/06/AdvancedBasics). As I was writing that column, I was aware of some new technologies that would make my code simpler, but they weren't quite ready yet. Now that Visual Studio® 2008 has been released, I'll revisit the code using LINQ to XML and the Community Technology Preview (CTP) edition of the Microsoft SDK for Open XML Formats, which you'll need to download from https://www.microsoft.com/en-us/download/details.aspx?id=5124 in order to compile and test the code presented here. These two technologies can change the way in which you access the parts of the 2007 Office system Open XML document (created by Microsoft Word, Excel®, and PowerPoint®) when retrieving or modifying data.

As I described in the June 2006 column, in order to programmatically interact with the Office Open XML File Formats you must be able to navigate to the exact document part you require and then interact with the XML in that part once you retrieve a reference to it. Usually you use the System.IO.Packaging namespace to navigate through the file format's hierarchy of parts. Once you get to the part that you want to program against, you can use any existing XML technology to interact with it; generally, you'll either use the XmlDocument class or the XmlReader and XmlWriter classes.

The classes in the System.IO.Packaging namespace certainly make it possible to navigate the file format's hierarchy of parts, but because it's so general (it works with any Open XML File Format, not just 2007 Office system documents), the code you write using this API is generally more repetitive than you might like. To make this navigation easier, members of the Office team at Microsoft have been working on an SDK that encapsulates the generalized System.IO.Packaging namespace, providing an API that's focused directly on the 2007 Office system file formats. I decided to use this SDK, even though it's currently only available as a CTP, because it so greatly simplifies code that works with Office documents.

The 2007 Office system Open XML SDK only handles the navigation to the individual part of interest; it is not used for manipulating the part's XML content to retrieve or modify the document information. In the earlier column, I used the XmlDocument class to work with document properties. Now, you can use LINQ to XML, a new feature in the .NET Framework 3.5 (exposed in Visual Studio 2008) to make working with the XML content far easier.

For this column, I'll modify the WDSetCustomProperty code, which was one of the forty snippets I wrote when the 2007 Office system first launched (go.microsoft.com/fwlink/?LinkId=106943). Before investigating the code in this article, you might want to examine the original code snippet to see how it works. (You'll find a walkthrough of the code, plus a video demonstration, at msdn2.microsoft.com/bb308936.)

You should also be relatively comfortable with the 2007 Office system Open XML File Formats, so begin by reviewing my earlier column. The code related to the 2007 Office system Open XML SDK is fairly self-evident, as you'll see when you review the sample code for this article. (In order to use the SDK, you'll need to download and install it, and within your project, set a reference to the Microsoft.Office.DocumentFormat.OpenXml.dll assembly included in the SDK.) The final technology I'll cover here, LINQ to XML, is a broad topic on its own, so all I'll do is discuss how I've used it in the context of this column. Although much has already been written on this important subject, you may want to start with the video presentation "How Do I Get Started with LINQ to XML?" at msdn2.microsoft.com/bb466226.

In creating the code sample for this column, 90 percent of my time was devoted to XML namespace manipulation. Visual Basic® provides some amazing support for LINQ to XML in its latest version, but it's easy to get tangled up in namespace manipulation. The description of the feature at msdn2.microsoft.com/bb687701 will be useful to you—it certainly helped me work things out!

Getting Started

In order to set or retrieve Word 2007 custom document properties, you'll need to work with the docProps/custom.xml part. As you attempt to set a property, one of four situations can occur:

  • The custom property part may not yet exist, in which case, you must create it.
  • The custom property may not exist, in which case, you must create it.
  • The custom property may exist, and its type matches the type of the new value. In this case, you simply set the property's value.
  • The custom property may exist, but its type differs from the type of the new value. In this case, delete the property, and recreate it using the new type. (You could devise other means of resetting the type, but the code you'll see here follows the same path if the property doesn't yet exist, or it does exist and has the wrong type. Either way, the code creates a new element for the property).

A sample set of custom properties might look like Figure 1.This particular document contains four custom properties: Author (Ken), Birthdate (1/1/1956), Chapter (2), revisions (5). Note that the XML content is scoped to the default namespace:

Figure 1 Some Custom Properties

<?xml version="1.0" encoding="utf-8" standalone="yes"?> <Properties xmlns:vt="https://schemas.openxmlformats.org/officeDocument/2006 /docPropsVTypes" xmlns="https://schemas.openxmlformats.org/officeDocument/2006 /custom-properties"> <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Author"> <vt:lpwstr>Ken</vt:lpwstr> </property> <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Birthdate"> <vt:filetime>1956-01-01T06:00:00Z</vt:filetime> </property> <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Chapter"> <vt:i4>2</vt:i4> </property> <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="5" name="Revisions"> <vt:i4>5</vt:i4> </property> </Properties>

https://schemas.openxmlformats.org/officeDocument/2006/custom-properties

The content also includes a non-default namespace that it uses to describe the property types:

vt="https://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"

Any code you write that works with this part needs to respect and maintain those namespaces or Word 2007 simply won't load your document (a fact I learned the hard way, after many failed attempts.)

Each property element contains an fmtid attribute with the same GUID value ({D5CDD505-2E9C-101B-9397-08002B2CF9AE}). Each property also includes a pid attribute, which must be a numeric value, starting with 2, and increasing in value. Although Word 2007 allows you to skip values (you can have properties with pid values of 2, 3, and 5, for example), these values cannot be the same, and they must appear in numeric order. Finally, each property includes a name attribute containing the name of the property.

Within each property element, you must add an element that contains the value of the property. The element name, in each case, indicates the type of value the property contains. The sample code provides an enumeration with a value representing each of these types:

Public Enum PropertyTypes YesNo Text DateTime NumberInteger NumberDouble End Enum

In the XML content, these types correspond to bool, lpwstr, filetime, i4, and r8, respectively. Each of these type names belongs to the XML namespace which has been assigned to the alias vt.

In order to make it easy to interact with XML namespaces, Visual Basic overrides the behavior of the file-level Imports statement. You can specify abbreviations for namespaces at the top of a file and use those abbreviations within the file. The sample code uses the following Imports statements:

Imports System.IO Imports Packaging = Microsoft.Office.DocumentFormat.OpenXml.Packaging Imports <xmlns="https://schemas.openxmlformats.org/officeDocument/2006 /custom-properties"> Imports <xmlns:vt="https://schemas.openxmlformats.org/officeDocument /2006/docPropsVTypes">

I've used the Packaging alias so that it's clear within the code when I'm using code from the 2007 Office system Open XML SDK. The code also includes Imports statements for the default namespace and the namespace with the alias vt. (The fact that you can import namespaces in this manner makes coding easier, but it does force one limitation: you can define only one default namespace per file. If you want to include XML code that uses a different default namespace, you'll need to place that code in a separate file. Of course, because you can split a single class into multiple files by creating partial classes, this isn't a restriction except with regard to file management in your project.)

In order to set the document property, you'll need to create a method that can be called like this:

WDSetCustomProperty("C:\YourDocument.docx", "PropertyName", _ "PropertyValue", PropertyTypes.Text)

If the property value you supply doesn't match the PropertyTypes enumerated value you specified, the code should raise an exception and return without attempting to perform any operation.

Throughout the sample code, you'll see examples of features that are new in Visual Studio 2008. Among other features that I'll call out specifically, you'll see code that uses implicit type declaration. Rather than specifying a type for each variable, Visual Basic uses type inference to infer the type at compile time. This means that you can write code like this:

Using wdPackage = _ Packaging.WordprocessingDocument.Open(docName, True) ... End Using

The explicit declaration would look like this:

Using wdPackage As Packaging.WordprocessingDocument = _ Packaging.WordprocessingDocument.Open(docName, True) ' Code removed here. End Using

Not specifying the exact type doesn't detract from the readability of the code, and Visual Basic compiles both versions in the same way. (This feature is controlled by the Option Infer file-level option, which is set to On by default.)

Investigating the Code

Now I'll investigate the sample code, explaining each block. The entire WDSetCustomProperty procedure is available in the code download on the MSDN® Magazine Web site. You might want to refer to it as the text here walks through the fragments of the procedure. It begins by investigating the parameters, converting the input value as necessary, as shown in Figure 2.

Figure 2 Checking the Types and Performing Conversions

Dim propertyTypeName As String = Nothing Dim propertyValueString As String = Nothing ' Calculate the correct type: Select Case propertyType Case PropertyTypes.DateTime propertyTypeName = "filetime" ' Make sure you were passed a real date, ' and if so, format in the correct way. ' The date/time value passed in should ' represent a UTC date/time. If TypeOf (propertyValue) Is DateTime Then propertyValueString = String.Format("{0:s}Z", _ Convert.ToDateTime(propertyValue)) End If Case PropertyTypes.NumberInteger propertyTypeName = "i4" If TypeOf (propertyValue) Is Integer Then propertyValueString = propertyValue.ToString() End If Case PropertyTypes.NumberDouble propertyTypeName = "r8" If TypeOf propertyValue Is Double Then propertyValueString = propertyValue.ToString() End If Case PropertyTypes.Text propertyTypeName = "lpwstr" propertyValueString = propertyValue.ToString() Case PropertyTypes.YesNo propertyTypeName = "bool" If TypeOf propertyValue Is Boolean Then ' Must be lower case! propertyValueString = _ Convert.ToBoolean(propertyValue).ToString().ToLower() End If End Select If propertyValueString Is Nothing Then ' If the code wasn't able to convert the ' property to a valid value, throw an exception: Throw New InvalidDataException("Invalid parameter value.") End If

Based on the enumerated value you supply, indicating the type of the value, the code attempts to convert the Object value you've provided to the correct type. Note that in order for Word 2007 to accept the values, date/time values must be in Universal Time Coordinate (UTC) time (that is, based on Greenwich mean time) and Boolean values must be in lower case. If, after investigating the values, the propertyValueString variable is still Nothing, the code raises an exception back to the caller, ending the procedure.

Next, the code uses the Microsoft SDK for Open XML Formats to find the document part and the custom properties part within the package. Compare this code to the code from the original snippet, and you'll have to agree that the SDK makes it far easier to navigate within the document package:

Using wdPackage = _ Packaging.WordprocessingDocument.Open(docName, True) Dim documentPart = wdPackage.MainDocumentPart Dim customPropsPart = wdPackage.CustomFilePropertiesPart ... End Using

If the code is able to retrieve a reference to the custom properties part (that is, if the customPropsPart variable isn't Nothing), the code loads the contents of the part using the XDocument.Load method (see Figure 3), allowing the GetStream method of the part to retrieve the XML contents. If the part doesn't exist, the code uses the SDK to create the custom properties part. If the part does exist, the code also attempts to retrieve the first property node for which the name attribute matches the property name you've specified. Again, compare this to the original code snippet—you'll be amazed how much easier this technique is.

Figure 3 Find the Custom Properties Part and Set Properties

Dim customDoc As XDocument = Nothing Dim propertyNode As XElement = Nothing If customPropsPart IsNot Nothing Then ' Load the existing custom properties part: customDoc = XDocument.Load( _ New StreamReader(customPropsPart.GetStream())) propertyNode = _ (From node In customDoc.<Properties>.<property> _ Where node.@name = propertyName).FirstOrDefault() Else customPropsPart = wdPackage.AddCustomFilePropertiesPart End If

This code statement makes use of LINQ to XML and the Visual Basic axis properties to simplify the XML manipulation:

propertyNode = _ (From node In customDoc.<Properties>.<property> _ Where node.@name = propertyName).FirstOrDefault()

The "." (period) separator indicates a node one level beneath the parent (customDoc). The <> syntax indicates an element name. The @ syntax indicates an attribute name. These are specific Visual Basic features (in C#, you would need to use the Element or Attribute property of the XDocument object to get the same information). The code also uses the FirstOrDefault LINQ extension method. This method returns the first matching node, or the default value (Nothing) if it finds no match. You could also use the First or Single extension methods, but each of those raises an exception if no match is found. Here, the code handles the case in which no match was found without raising an exception.

After this code executes, customPropsPart contains either a reference to the custom properties part, or Nothing. Likewise, propertyNode contains either a reference to the XElement object corresponding to the property you want to modify, or Nothing if the property doesn't yet exist.

If the propertyNode variable isn't equal to Nothing, the code retrieves the node's first child element, using the First method (in this case, if the code can't find the first child element, it raises an exception—the property node must contain an element, and if it doesn't, the code can't continue because the document is damaged). If the name of the child element matches the type of the new property value, the code changes the value of the element. If not, the code removes the node and sets the propertyNode variable to Nothing. After this code runs, the propertyNode variable either contains the new value or is set to Nothing, as you see in Figure 4.

Figure 4 Setting the PropertyNode Variable Value

If propertyNode IsNot Nothing Then Dim propertyContentNode = _ propertyNode.Elements.First If propertyContentNode.Name.LocalName = _ propertyTypeName Then ' Type is the same, so just change the value. propertyContentNode.Value = propertyValue.ToString() Else ' Type is different, so delete the node. propertyNode.Remove() propertyNode = Nothing End If End If

If the propertyNode variable is Nothing, it's time to create the node and set its content. The code in Figure 5 starts out by creating the fmtid constant and calculates the pid value to use. To calculate the pid to use for the new element, the code retrieves a reference to the Properties element and then finds the last child of the element. If the last child exists, the code retrieves the child's pid value, increments it, and uses the new value. Note the various LINQ to XML features used in Figure 5: the First and LastOrDefault methods provide simple ways to access child elements, given specific criteria.

Figure 5 Creating the Node

If propertyNode Is Nothing Then Const fmtid As String = _ "{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" Dim pid As String = "2" Dim propertiesNode As XElement = Nothing If customDoc IsNot Nothing Then propertiesNode = customDoc.<Properties>.First Dim lastChild = propertiesNode.Elements.LastOrDefault If lastChild IsNot Nothing Then pid = (Integer.Parse(lastChild.@pid) + 1).ToString() End If End If ... End If

Next, the code uses some features specific to Visual Basic to create the new property node:

Dim propertyTypeXName As XName = _ GetXmlNamespace(vt) + propertyTypeName propertyNode = _ <property fmtid=<%= fmtid %> pid=<%= pid %> name=<%= propertyName %>> <<%= propertyTypeXName %>><%= propertyValueString %></> </property>

In order to create the property node's value element, the code requires an XName object containing the full type name (vt:i4, for example). To create that XName object, you use the new GetXmlNamespace keyword. Supplying the namespace abbreviation that you created in the Imports statement enables GetXmlNamespace to retrieve the corresponding XML namespace. Visual Basic overloads the + operator when used with XML namespaces, so you can write a statement like this to create the full XName instance:

Dim propertyTypeXName As XName = _ GetXmlNamespace(vt) + propertyTypeName

Next, the code uses one of the best new features in Visual Basic—replacement tokens within XML literals. Visual Basic allows you to embed XML content directly in the code, so you can refer to and create XML without needing to directly program against the Document Object Model (DOM) objects. In this case, the code creates a new XML element:

propertyNode = _ <property fmtid=<%= fmtid %> pid=<%= pid %> name=<%= propertyName %>> <<%= propertyTypeXName %>><%= propertyValueString %></> </property>

After this statement executes, the propertyNode variable contains an XElement object containing XML that looks like this:

<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Chapter"> <vt:i4>2</vt:i4> </property>

Comparing the two code fragments, it's easy to see how Visual Basic makes the replacements: each time it encounters a replacement token (delimited with <%= %>) it replaces the named token with its value. The tokens can even contain expressions, as you'll see later. Note that the replacement token for an element name must contain an XName reference, not a string or some other value. (I have to thank Avner Aharoni, of the Visual Basic team, for helping me work out this little detail.)

The final goal, before saving the custom XML part back to the document, is to recreate the entire contents of the part. The code must create the container XML document, insert all the existing custom properties, and then append the new custom property.

It's possible to take the existing custom properties part and simply call the Add method of the Properties element to add a new property child element (and that's how I originally wrote the code). Doing so bypasses an important feature of Visual Basic namespace handling. When you use the Add method, Visual Basic can't determine how it should handle namespace factoring, resulting in the insertion of redundant namespace information. In this case, Visual Basic inserts an explicit default namespace reference. Although the XML it creates is perfectly valid, the XML parser in Word fails when it tries to load the custom properties part. If you use the Add method, you'll often find that you must explicitly remove redundant namespace references. (Thanks again to Avner, who explained this behavior to me.)

To avoid this problem, the sample code completely recreates the custom properties part. Using this technique allows Visual Basic to factor out redundant namespace references and creates exactly the XML that Word expects:

customDoc = _ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Properties xmlns:vt= "https://schemas.openxmlformats.org/officeDocument /2006/docPropsVTypes"> <%= If(propertiesNode IsNot Nothing, _ propertiesNode.Elements, Nothing) %> <%= propertyNode %> </Properties>

This code doesn't need to reference the default namespace; because Visual Basic already has the information about the default namespace from the Imports statements, it can infer that information. For any specific namespaces that the code requires, you'll need to include the information in the XML literal, as you just saw for the vt namespace.

This XML content includes a replacement token for all the existing property elements and uses the new If function in Visual Basic. The If function, unlike the older IIf function, doesn't have the side effect of the original; instead of evaluating both expressions and determining which to return, the If function evaluates its parameters from left to right. This bypasses an error that would have occurred in this case using the IIf function (if propertiesNode was Nothing, the IIf function would still have attempted to retrieve its Elements property, causing an exception). After outputting the existing property elements (if they exist), the code outputs the new property node. This results in customDoc containing the full XML content of the custom properties part.

There is one final challenge left—saving the part back to file storage. The XDocument class and the SDK provide the solution here. You can call the Save method of the XDocument object, customDoc, which allows you to specify a stream into which to write the XML content. Conveniently, the CustomFilePropertiesPart class provides a GetStream method, returning the stream corresponding to the part's storage within the package:

customDoc.Save(New StreamWriter( _ customPropsPart.GetStream( _ FileMode.Create, FileAccess.Write)))

Wrapping It Up

Although this seems like a lot of code to satisfy a rather simple goal, once you remove the comments and the code that formats the parameters, the code that actually performs the work with the document is relatively short. Compare it to the code in the original snippet, and you'll have to agree that using the Microsoft SDK for 2007 Office system Open XML Formats and LINQ to XML in Visual Basic dramatically simplifies the process of working with 2007 Office system Open XML documents.

If you search the Web for LINQ to XML, you'll find many documents to help get over the hurdle of working with XML using LINQ. Although not specific to Visual Basic, the document ".NET Language-Integrated Query for XML Data" (msdn2.microsoft.com/bb308960) covers much of what you'll need to know. You'll also find the Visual Basic team's blog (blogs.msdn.com/vbteam) to be a good source. I'd love to have the bandwidth to go back and revise all 40 of the original code snippets to use these new technologies, but it's not likely I'll have time to do that. You can, however, take what you've learned here and use it to modify any code you have written that interacts with the 2007 Office system Open XML File Formats. You'll certainly reduce the complexity and the volume of code in doing so.

Send your questions and comments for Ken to basics@microsoft.com.

Ken Getz is a Senior Consultant with MCW Technologies, and a courseware author for AppDev (www.appdev.com). He is coauthor of ASP .NET Developers Jumpstart, Access Developer's Handbook, and VBA Developer's Handbook, 2nd Edition. Reach him at keng@mcwtech.com.