Efficient Techniques for Modifying Large XML Files

 

By Dare Obasanjo
Microsoft Corporation

April 2004

Summary: Dare Obasanjo shows two techniques for efficiently updating or modifying large XML files such as log files and database dumps.

Contents

Introduction
Using XML Inclusion Techniques
Chaining an XmlReader to an XmlWriter
Acknowledgements

Introduction

As XML has become popular as a representation format for large sources of information, developers have begun to have problems with editing large XML files. This is especially true for applications that process large log files and need to constantly append information to these files. The most straightforward way to edit an XML file is to load it into an XmlDocument, modify the document in memory and then save it back to disk. However, doing so means that the entire XML document has to be loaded in memory, which may be infeasible depending on the size of the document and the memory requirements of the application.

This article shows some alternative approaches to modifying an XML document that don't involve loading it into an XmlDocument instance.

Using XML Inclusion Techniques

The first approach I'll suggest is most useful for appending values to an XML log file. A common problem faced by developers is a need to be able to simply append new entries to a log file instead of having to load up the document. Because of XML well-formedness rules, it is often difficult to use traditional means to append entries to an XML log file in a way that does not make the log file end up being malformed.

The first technique I'll demonstrate is targeted at situations where the goal is to be able to quickly append entries to an XML document. This approach involves creating two files. The first file is a well-formed XML file while the second is an XML fragment. The well-formed XML file includes the XML fragment using either an external entity declared in a DTD or using an xi:include element. This way the file containing the XML fragment can be updated efficiently by simply appending to it while processing is done using the including file. Examples of an including file and the included file are shown below:

Logfile.xml: 
<?xml version="1.0"?>
<!DOCTYPE logfile [
<!ENTITY events    
 SYSTEM "logfile-entries.txt">
]>
<logfile>
&events;
</logfile>

Logfile-events.txt:
<event>
 <ip>127.0.0.1</ip>
 <http_method>GET</http_method>
 <file>index.html</file>
 <date>2004-04-01T17:35:20.0656808-08:00</date>
</event>
<event>
 <ip>127.0.0.1</ip>
 <http_method>GET</http_method>
 <file>stylesheet.css</file>
 <date>2004-04-01T17:35:23.0656120-08:00</date>
 <referrer>http://www.example.com/index.html</referrer>
</event>
<event>
  <ip>127.0.0.1</ip>
  <http_method>GET</http_method>
  <file>logo.gif</file>
  <date>2004-04-01T17:35:25.238220-08:00</date>
  <referrer>http://www.example.com/index.html</referrer>
</event>

The logfile-entries.txt file contains an XML fragment and can be updated efficiently using typical file IO methods. The following code shows how an entry can be added to the XML log file by appending it to the end of the text file:

using System;
using System.IO;
using System.Xml; 

public class Test{ 
  public static void Main(string[] args){

    StreamWriter sw = File.AppendText("logfile-entries.txt");
    XmlTextWriter xtw =  new XmlTextWriter(sw); 

    xtw.WriteStartElement("event"); 
    xtw.WriteElementString("ip", "192.168.0.1");
    xtw.WriteElementString("http_method", "POST");
    xtw.WriteElementString("file", "comments.aspx");
    xtw.WriteElementString("date", "1999-05-05T19:25:13.238220-08:00");    

    xtw.Close();
                 
  }
}

Once the entries have been appended to the text file, they can be processed from the XML log file using traditional XML processing techniques. The following code uses XPath to iterate over the log events in logfile.xml, listing the files that were accessed and when they were accessed.

using System;
using System.Xml; 

public class Test2{
 
  public static void Main(string[] args){

    XmlValidatingReader vr = 
    new XmlValidatingReader(new XmlTextReader("logfile.xml"));
    vr.ValidationType = ValidationType.None;          
    vr.EntityHandling = EntityHandling.ExpandEntities; 

    XmlDocument doc = new XmlDocument(); 
    doc.Load(vr); 

    foreach(XmlElement element in doc.SelectNodes("//event")){
      
      string file = element.ChildNodes[2].InnerText; 
      string date = element.ChildNodes[3].InnerText; 
      
      Console.WriteLine("{0} accessed at {1}", file, date);

    }                 
  }
} 

The code above results in the following output:

index.html accessed at 2004-04-01T17:35:20.0656808-08:00
stylesheet.css accessed at 2004-04-01T17:35:23.0656120-08:00
logo.gif accessed at 2004-04-01T17:35:25.238220-08:00
comments.aspx accessed at 1999-05-05T19:25:13.238220-08:00

Chaining an XmlReader to an XmlWriter

In certain cases one may want to perform more sophisticated manipulation of an XML file besides merely appending elements to the root element. For example, one may want to filter out every entry in a log file that doesn't meet some particular criteria before archiving the log file. One approach to doing this would be loading the XML file into an XmlDocument and then selecting the events that one is interested in via XPath. However, doing so involves loading the entire document into memory, which may be prohibitive if the document is large. Another option would involve using XSLT for such tasks, but this suffers from the same problem as the XmlDocument approach since the entire XML document has to be in memory. Also, for developers unfamiliar with XSLT, there is a steep learning curve when figuring out how to use template matches properly.

One approach to solving the problem of how to process a very large XML document is to read in the XML using an XmlReader and write it out as it is being read with an XmlWriter. With this approach the entire document is never in memory at once and more granular changes can be made to the XML than simply appending elements. The following code sample reads in the XML document from the previous section and saves it as an archive file after filtering out all the events whose ip element has the value "127.0.0.1".

using System;
using System.Xml; 
using System.IO;
using System.Text;
public class Test2{
  static string ipKey;
  static string httpMethodKey;
  static string fileKey; 
  static string dateKey;
  static string referrerKey; 

  public static void WriteAttributes(XmlReader reader, XmlWriter writer){
    
    if(reader.MoveToFirstAttribute()){
      do{
   writer.WriteAttributeString(reader.Prefix, 
                reader.LocalName, 
                reader.NamespaceURI,
                reader.Value); 
      }while(reader.MoveToNextAttribute());
      reader.MoveToElement(); 
    }
  }

  public static void WriteEvent(XmlWriter writer, string ip,
                                 string httpMethod, string file,
                                 string date, string referrer){
    
    writer.WriteStartElement("event"); 
    writer.WriteElementString("ip", ip);
    writer.WriteElementString("http_method", httpMethod);
    writer.WriteElementString("file", file);
    writer.WriteElementString("date", date);    
    if(referrer != null) writer.WriteElementString("referrer", referrer);
    writer.WriteEndElement(); 

  } 

  public static void ReadEvent(XmlReader reader, out string ip,
                              out string httpMethod, out string file,
                              out string date, out string referrer){

    ip = httpMethod = file = date = referrer = null; 

    while( reader.Read() && reader.NodeType != XmlNodeType.EndElement){                
      
 if (reader.NodeType == XmlNodeType.Element) {
          
     if(reader.Name == ipKey){   
       ip = reader.ReadString(); 
     }else if(reader.Name == httpMethodKey){ 
       httpMethod = reader.ReadString();
     }else if(reader.Name == fileKey){ 
       file = reader.ReadString();
     }else if(reader.Name == dateKey){ 
       date = reader.ReadString();
       // reader.Read(); // consume end tag
     }else if(reader.Name == referrerKey){ 
       referrer = reader.ReadString();
      }
        }//if 
    }//while   
  }

  public static void Main(string[] args){
    string ip, httpMethod, file, date, referrer; 
    //setup XmlNameTable with strings we'll be using for comparisons
    XmlNameTable xnt = new NameTable(); 
    ipKey            = xnt.Add("ip"); 
    httpMethodKey    = xnt.Add("http_method"); 
    fileKey          = xnt.Add("file");
    dateKey          = xnt.Add("date");
    referrerKey      = xnt.Add("referrer");
    
    //load XmlTextReader using XmlNameTable above 
    XmlTextReader xr = new XmlTextReader("logfile.xml", xnt);
    xr.WhitespaceHandling = WhitespaceHandling.Significant;

    XmlValidatingReader vr = new XmlValidatingReader(xr);
    vr.ValidationType = ValidationType.None;
    vr.EntityHandling = EntityHandling.ExpandEntities; 


    StreamWriter sw =  
      new StreamWriter ("logfile-archive.xml", false, Encoding.UTF8 ); 
    XmlWriter xw    = new XmlTextWriter (sw);                 
    
    vr.MoveToContent(); // Move to document element   
    xw.WriteStartElement(vr.Prefix, vr.LocalName, vr.NamespaceURI);
    WriteAttributes(vr, xw);    
     
    vr.Read(); // Move to first <event> child of document element
    // Write out each event that isn't from 127.0.0.1 (localhost)
    do
    {
      ReadEvent(vr, out ip, out httpMethod, 
               out file, out date, out referrer);
      if(!ip.Equals("127.0.0.1")){
        WriteEvent(xw,ip, httpMethod, file, date, referrer); 
      }
      vr.Read(); //move to next <event> element or end tag of <logfile>
    } while(vr.NodeType == XmlNodeType.Element);
     
    Console.WriteLine("Done");
    
    vr.Close();
    xw.Close();
  }
}

The code sample above results in the following output being written to the logfile-archive.xml file:

<logfile>
 <event>
   <ip>192.168.0.1</ip>
   <http_method>POST</http_method>
    <file>comments.aspx</file>
    <date>1999-05-05T19:25:13.238220-08:00</date>
  </event>
</logfile>

The one interesting point in the above code, besides the fact that it uses a chaining of an XmlReader to an XmlWriter, is that it uses the NameTable to improve the performance of text comparisons when checking the tag names of elements in the ReadEvent() method. The benefits of using this approach to checking the tag names of elements in the XmlReader are outlined in the MSDN documentation topic Object Comparison Using XmlNameTable with XmlReader.

Acknowledgements

Thanks go to Martin Gudgin who inspired me to write this article by suggesting chaining an XmlReader to an XmlWriter as an answer to a question on editing large XML log files.