XML to SQL: Using SQLXML Bulkload in the .NET Framework

 

Amar Nalla
Microsoft Corporation

June 2004

Applies to:
   Microsoft Visual Studio .NET
   Microsoft SQL Server
   Microsoft .NET Framework
   Microsoft SQLXML Bulkload

Summary: Amar Nalla showcases the Bulkload functionality available in SQLXML, which can be used to transform XML input into SQL data by building a .NET Framework-based application that shreds RSS feed data into a database. (11 printed pages)

Click here to download the code sample for this article.

Contents

Introduction
The Problem: Storing RSS Feeds in SQL Server
A Brief Introduction to SQLXML Annotated Schemas
Using Bulkload from Managed Code
Loading the Database Using SQLXML Bulkload
Bulkload Object Model
Conclusion

Introduction

XML has become the industry-wide standard for exchanging data across business systems. At the same time, the use of a relational backend as the data-store is well established. As developers, we are sometimes faced with the task of exposing existing relational data as XML to share it across multiple systems, or we are required to take input XML and shred the data into a relational database. Microsoft SQL Server 2000 has excellent built in support to achieve both tasks. In addition to the support inside the server, SQL Server 2000 has XML support at the middle-tier that can be used to perform the above tasks. The middle-tier based solution is enabled by using SQLXML. SQLXML 3.0, which is the current released version, is an add-on product that can be used along with SQL Server to enable extensive XML features. For an overview of the XML features provided by SQL Server 2000 and SQLXML, read Andrew Conrad's Survey of SQL Server 2000's XML Features.

**Note   **This column is based on the SQL Server XML functionality that is available in SQLXML 3.0 SP2. It is the third major release in the SQLXML web release series. SQLXML periodically adds new XML functionality to SQL Server 2000, to keep pace with the fast changing world of XML and in response to customer requests. To download the latest Web release, or to find more information on the new features offered in the XML for SQL Server Web Releases, see the SQLXML Developer Center.

This article takes an in-depth look at the SQLXML Bulkload functionality, by building an application that takes an input XML stream and shreds the information into SQL Server relational tables. This article also tries to answer some of the common questions that arise when using Bulkload in an enterprise application. Specifically, the article cover the following:

  • Transforming a complex hierarchy of XML into a single table.
  • Using Bulkload in a .NET application.
  • Using Streams as input data mechanism for Bulkload in .NET.

The Problem: Storing RSS Feeds in SQL Server

In this article, I will build an application that takes Really Simple Syndication (RSS) feeds and stores relevant information into a relational database. RSS is an XML format used for syndicating news and similar content from news sources. There are two aspects to solving this problem: first, we need a mechanism to map the input XML into relational tables and columns. The XML format is predefined, and we do not want to map into the database all the information in the feed. Second, we want to directly stream the input to our backend. Both the tasks can be achieved by using a mapping schema and the Bulkload functionality of SQLXML.

This problem is representative of a typical requirement in business systems, where the input data is in the form of XML but the backend store is a relational database. The input XML could be from web services, other XML messages from external business partners, or from entities within the company.

A Brief Introduction to SQLXML Annotated Schemas

XSD schemas are often used to describe the content and structure of XML data, so it is therefore possible to write an XSD schema that describes the structure of the RSS feeds. For our application we will not build an XSD schema for generic RSS feeds, but instead focus on a subset of XML structure that is commonly used by most feeds. The feed I will be using for this article is based on the Yahoo! News RSS feed. We need to annotate the XSD with relational metadata that SQLXML uses to achieve mapping from the input XML into the relational backend. Let's look the annotated schema we will use for our application.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:sql="urn:schemas-microsoft-com:mapping-schema">
  <xsd:element name="rss" sql:is-constant="1">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="channel" sql:is-constant="1">
          <xsd:complexType>
            <xsd:sequence>
              <xsd:element name="title"  type="xsd:string" sql:mapped="false" />
              <xsd:element name="link" type="xsd:string" sql:mapped="false"/>
              <xsd:element name="description" type="xsd:string" sql:mapped="false"/>
              <xsd:element name="language" type="xsd:string" sql:mapped="false"/>
              <xsd:element name="lastBuildDate" type="xsd:string" sql:mapped="false"/>
              <xsd:element name="ttl" type="xsd:int" sql:mapped="false"/>
              <xsd:element name="image" sql:mapped="false">                          
                <xsd:complexType>
                  <xsd:sequence>
                    <xsd:element name="title"  type="xsd:string" />   
                    <xsd:element name="width"  type="xsd:int" />
                    <xsd:element name="height"  type="xsd:int" />      
                    <xsd:element name="link"  type="xsd:string" />   
                    <xsd:element name="url"  type="xsd:string" />
                  </xsd:sequence>
                </xsd:complexType>
              </xsd:element>
              <xsd:element name="item" sql:relation="NewsFeed">                          
                <xsd:complexType>
                  <xsd:sequence>
                    <xsd:element name="title"  type="xsd:string"/>
                    <xsd:element name="link"  type="xsd:string"/>
                    <xsd:element name="guid" sql:field="guid">
                      <xsd:complexType>
                        <xsd:simpleContent>
                          <xsd:extension base="xsd:string">
                            <xsd:attribute name="isPermaLink"  type="xsd:string" sql:field="IsPermanent"/>
                          </xsd:extension>
                        </xsd:simpleContent>
                      </xsd:complexType>
                    </xsd:element>
                    <xsd:element name="pubDate"  type="xsd:string" sql:field="PublicationDate" />
                    <xsd:element name="description"  type="xsd:string"/>
                  </xsd:sequence>
                </xsd:complexType>
              </xsd:element>
            </xsd:sequence>   
          </xsd:complexType>
        </xsd:element>
      </xsd:sequence>
      <xsd:attribute name="version" type="xsd:string" sql:mapped="false"/>
    </xsd:complexType>
  </xsd:element>
</xsd:schema>

The above XSD has the relational-based annotations that enable the mapping. All SQLXML specific mapping annotations belong to the xmlns:sql="urn:schemas-microsoft-com:mapping-schema" namespace. The most important annotations are the sql:relation annotation on the "item" element, and the various sql:field annotations on its sub-elements. The sql:relation annotation maps the item element to the "NewsFeed" table, and the sql:field annotation maps the child elements to the columns of the table. It is also possible to map the attributes of an element to the columns in the table. In fact, the sql:field annotation is optional, as the table-element mapping is scoped to all the child elements and attributes for that element. In the above XSD schema, only the isPermaLink and the pubDate elements have a sql:field annotation, as the names of the corresponding column names are different. For all the other element names the default mapping applies; they are automatically mapped to the columns with the same name.

On the other hand, the RSS feed has many other elements, like "channel," "ttl," and "image," that we do not want to map to the relational backend. There are two ways to achieve this in SQLXML using the sql:is-constant or the sql:mapped annotations.

The sql:is-constant annotation is primarily used to create wrapper elements that are not mapped to the database. In our case, the "rss" and the "channel" elements are wrapper elements in the RSS feed, so they are annotated with the sql:is-constant annotation. There are some things to keep in mind when using the sql:is-constant annotation: first, it can only be used on complex-type elements; and second, if there are attributes on the constant element, they cannot be mapped to columns in the database.

I use the sql:mapped annotation to explicitly mark that we do not want to map a particular element or attribute to any entities in the relational backend. This has been done in a couple of places in the schema. I explicitly annotate the attribute "version" with sql:mapped="false", since it is part of the constant element. I also annotate all the simple-type child elements of the "channel" element with sql:mapped="false". Another place where I use this annotation is on the "image" element. Note that "image" is a complex-type and, by doing so, I now prevent the mapping of that element, along with all its child elements and attributes. The sql:mapped annotation can be used on both complex-type and simple-type elements. When used on complex-type elements, all the child elements and attributes that are scoped to that complex-type also get the value for the sql:mapped annotation.

The advantage of using a mapping schema to map the input XML to the database is that the same mapping can be used to query the database and generate various XML views of the data. To achieve this you can use the XPath over views functionality available in SQLXML.

Using Bulkload from Managed Code

SQLXML Bulkload enables the loading of input XML into the relational backend. Internally, it uses the SQL Server bcp process, and is the best mechanism to efficiently upload large input XML into the server. It is implemented as a COM object, and it uses SQLOLEDB providers. This makes it convenient to use when programming through ActiveX Data Objects (ADO), or when using scripting languages like VBScript. In my application I use Bulkload in a .NET application. There are two mechanisms available for achieving this functionality:

  • Use tlbimp.exe to import the COM type definitions into a runtime assembly, and then add the generated Interop assembly to the project.
  • Directly add a reference to the Bulkload dll (xblkld3.dll) in a Visual Studio.NET project. This creates an Interop assembly that can be used with the project.

After performing either of the above steps, using Bulkload becomes as simple as this.

...
SQLXMLBULKLOADLib.SQLXMLBulkLoad3Class objBL = new SQLXMLBULKLOADLib.SQLXMLBulkLoad3Class();
objBL.ConnectionString = "Provider=sqloledb;server=server;database=databaseName;integrated security=SSPI";
objBL.ErrorLogFile = "error.xml";
objBL.Execute ("schema.xml","data.xml");
...

Loading the Database Using SQLXML Bulkload

Now that I have shown the mapping process and given a brief idea of how Bulkload works, let's try to put the above gained knowledge to use. In the above code snippet, the input XML is read from a file on the local disk. For our application, the XML for the RSS feed is available via a URL. It is usually best to process it as a stream, since it might not be feasible to store the input XML first as a file and then load it later. The Execute method of Bulkload has 2 overloads; one of which takes a stream as input source for the data, while the other takes a data file. As it often happens, however, things are not as simple as they sound. In our case, we cannot simply use one of the managed Stream classes that derive from the System.IO.Stream namespace. Since Bulkload is a COM object, it expects the stream to be a native IStream interface, but none of the stream classes that derive from the System.IO.Stream class in .NET implement IStream. This causes our application to be slightly more involved, as we now have to be able to translate the input XML into a stream that is recognized by Bulkload.

I will show two alternate solutions for the above problem. For both the solutions, I use the UCOMIStream interface, which provides the Managed definition of the IStream Interface. In the first solution, I use a stream wrapper object that implements the UCOMIStream interface. Here is a sample implementation of the interface.

internal class UCOMStreamWrapper : UCOMIStream
{
Stream innerStream;
   String _file;

   public UCOMStreamWrapper(Stream inner, string file)
   {
      innerStream = inner;
      _file       = file;
   }
   public UCOMStreamWrapper(Stream inner)
   {
      innerStream = inner;
      _file = null;
   }
    
   public void Read(byte[] pv, int cb, IntPtr pcbRead)
   {
      Marshal.WriteInt32(pcbRead, innerStream.Read(pv, 0, cb));
   }

   public void Stat(out STATSTG pstatstg, int grfStatFlag)
   {
      DateTime curTime;
      long     curFileTime;

      pstatstg = new STATSTG();
      if (_file == null)
      {
         curTime = DateTime.Now;
      }
      else
      {
         if (grfStatFlag == 0)   // default
         {    
            pstatstg.pwcsName=_file;
         }
         else                    // noname
         {
            pstatstg.pwcsName=null;
         }
         curTime = File.GetLastWriteTime(_file);
      }
      curFileTime = curTime.ToFileTime();
      pstatstg.cbSize = innerStream.Length;
      pstatstg.type = 2;                  // STGM_READ
      pstatstg.grfMode = 0;               // STGM_READ
      pstatstg.grfLocksSupported = 2;     // LOC_EXCLUSIVE
      pstatstg.mtime.dwHighDateTime = (int) (curFileTime >> 32);
      pstatstg.mtime.dwLowDateTime  = (int) (curFileTime & 0xffffffff);
      pstatstg.atime = pstatstg.mtime;      
   }

   public void Clone(out UCOMIStream ppstm)
   {
      throw new NotImplementedException("UCOMIStream: Clone");
   }
   public void Commit(int grfCommitFlags)
   {
      throw new NotImplementedException("UCOMIStream: Commit");
   }
   public void CopyTo(UCOMIStream pstm, long cb, IntPtr pcbRead, IntPtr pcbWritten)
   {
      throw new NotImplementedException("UCOMIStream: CopyTo");
   }
   public void LockRegion(long libOffset, long cb, int dwLockType)
   {
      throw new NotImplementedException("UCOMIStream: LockRegion");
   }
   public void Revert()
   {
      throw new NotImplementedException("UCOMIStream: Revert");
   }

   public void Seek(long dlibMove, int dwOrigin, IntPtr plibNewPosition)
   {
      long newPosition = 0;
      if (innerStream.CanSeek == true)
      {
         SeekOrigin origin = SeekOrigin.Begin;
         switch (dwOrigin)
         {
            case 1:
               origin = SeekOrigin.Current;
               break;
            case 2:
               origin = SeekOrigin.End;
               break;
         }
         newPosition = innerStream.Seek(dlibMove, origin);
      }
      if (plibNewPosition != (IntPtr)0)
      {
         Marshal.WriteInt64(plibNewPosition, newPosition);
      }         
   }

   public void SetSize(long libNewSize)
   {
      throw new NotImplementedException("UCOMIStream: SetSize");   
   }
   public void UnlockRegion(long libOffset, long cb, int dwLockType)
   {
      throw new NotImplementedException("UCOMIStream: UnlockRegion");
   }
   private void Write(byte[] buf, int offset, int len)
   {
      innerStream.Write(buf, offset, len);
   }
   public void Write(byte[] pv, int cb, IntPtr pcbWritten)
   {
      Write(pv, 0, cb);
      if (pcbWritten != (IntPtr)0)
      {
         Marshal.WriteInt32(pcbWritten, cb);
      }
   }      
}

As you can see from the above code snippet, this is not a full implementation of the UCOMIStream interface. Bulkload only needs the following methods to be implemented—Read, Write, Stat, and Seek. You can pass the above UCOMStreamWrapper class directly to the Execute method in Bulkload. The above solution is an efficient way to pass streaming data to Bulkload in a .NET application.

An alternate solution that does not require us to implement the UCOMIStream interface is shown below.

public const int MAXLENGTH = 1024;      
[DllImport("OLE32.DLL", EntryPoint="CreateStreamOnHGlobal")]
extern public static int CreateStreamOnHGlobal( int hGlobalMemHandle, bool fDeleteOnRelease,
                                                                      out UCOMIStream pOutStm);
[STAThread]
static void Main(string[] args)
{         
   UCOMIStream dataStream;         
   byte[] dataBytes = new byte[MAXLENGTH];
   WebRequest request = WebRequest.Create("http://rss.news.yahoo.com/rss/topstories");      
   WebResponse response = request.GetResponse();
   Stream responseStream = response.GetResponseStream();         
   CreateStreamOnHGlobal( 0, true, out dataStream );         
   int read = 0;
   //Read the response stream and write to the dataStream that can be sent to Bulkload
   while ((read = responseStream.Read(dataBytes, 0, MAXLENGTH)) > 0)
   {                     
      dataStream.Write(dataBytes, read, IntPtr.Zero);         
   }
   dataStream.Seek(0, 0, System.IntPtr.Zero);   
   //This creates the Bulkload object using Interop
   SQLXMLBULKLOADLib.SQLXMLBulkLoad3Class objBL = new SQLXMLBULKLOADLib.SQLXMLBulkLoad3Class();         
   //Location of the mapping schema
   string schemaName = "RSSFeedSchema.xml";
   objBL.ConnectionString = 
         "provider=SQLOLEDB;data source=localhost;database=SQLXMLMSDN;integrated security=SSPI";
   objBL.ErrorLogFile = "error.log";
   objBL.KeepIdentity = false;
   //Execute function performs the actual Bulkload
   objBL.Execute(schemaName, dataStream);
}

The first step is to use the Platform Invoke (PInvoke) functionality, which is available in the .NET Framework by using the DllImport attribute, to get access to the CreateStreamOnHGlobal function that is defined in OLE32.DLL. In the above code, I use the WebRequest and WebResponse classes, which are available in the System.Net namespace, to read the RSS stream. The next step is to create dataStream, which is an instance of the UCOMIStream. I pipe the response stream obtained from the URL into dataStream, and pass this stream to the Execute method in Bulkload.

The drawback of the above solution is that it not an efficient way to pass streams to Bulkload. The problem lies in the use of the CreateStreamOnHGlobal function, which caches the data in memory and imposes a performance hit.

In the downloadable code, I have provided a sample of each of the above solutions in separate projects.

It is important to note that if you are using ADO or a script language, you can directly pass the ADODB.Stream to Bulkload. The SQLXML documentation contains a sample of passing a stream in VBScript.

Bulkload Object Model

The Bulkload object model has many properties that provide additional flexibility and functionality. The ones I am using are the ErrorLogFile and KeepIdentity properties. If the ErrorLogFile property is specified, Bulkload will log any error and warning messages that might occur during the process in that file. The KeepIdentity property is used to define the behavior for identity columns. In our application the ID column is an identity column, and by setting KeepIdentity to false, Bulkload uses the identity values that are generated by SQL Server. On the other hand, if you want identity values to be used from the source data, KeepIdentity needs to be set to true, which is the default value.

Another interesting property that I am not using in my application, but which is very useful, is the SchemaGen property. This property can be used to generate backend tables in the database before the actual loading process. Bulkload uses the element, and the attribute names and type information from the XSD schema, to generate the appropriate tables and columns. SchemaGen does not create constraints for the tables, but if you set the SGUseID property to true, and the primary key is defined to have an XSD type of ID, it uses this information to generate the primary key. It also uses sql:relationship annotation in the mapping, to generate foreign key constraints. More information about all the properties used by Bulkload is available in the documentation that comes with the installation of SQLXML.

Conclusion

In this article I demonstrate the use of the Bulkload functionality, which is available in SQLXML, to populate a database when the input is in the form of XML. The relational data can be easily queried later on from the database by using technologies like ADO.NET, as illustrated by the sample application available in the download. Bulkload can be used both via classical ADO and script-based mechanisms. It can also be used in .NET through COM Interop. I also show the flexibility of using a mapping schema to transform a complex hierarchy of XML into a simple database schema.