Click here to Skip to main content
Click here to Skip to main content

Merging XML Files with XStreamingElement

, 3 May 2010
Rate this:
Please Sign up or sign in to vote.
Got a few Gigs of XML data you need to combine? Here's one way that won't blow your RAM.

Introduction

Let's face it. Sometimes the amount of XML you need to work with just gets a little out of hand. I have around 3 Gigs of files a day that I need to merge and then process. But pulling all of that lot into memory just to merge it seems a bit much. Streaming must be the answer, surely someone has done this before? Well, not that I could find, so here's what I came up with that uses XStreamingElement to do a merge of multiple files.

For my processing, I've assumed the following about the XML files I'm working with:

  • The files have various elements that nominally make up a header, a body, and a trailer (or footer).
  • Only one part of the above 'structure' for each file will be merged, i.e., the 'body'.
  • The 'body' is contained inside a single element in each file.
  • I don't need (or want) to know what elements are in the 'body'. In fact, I want to be able to merge the files with a minimum of knowledge about what is in them.
  • I won't be doing any reordering of elements. Each file gets processed in turn, and the contents of each file likewise get processed in turn.

Now all of the above may be different for you, but hopefully, you can take this code and modify it to suit your purpose.

Using the code

The project comes with some sample XML files to merge. Once you've downloaded the project, adjust the paths in the project files and within the program itself to match where you put it. You should then be able to compile and run and see the three test files merged into one.

When I first tried doing this, there was a particular point that I hadn't yet grasped, so to save you all some wasted time, here it is:

Normally, XStreamingElement is 'built' by using an extension method that streams the source XML a node at a time using a Reader. But this process is only really suitable for filtering out nodes or manipulating existing nodes into new forms. If you want to actually insert nodes into an 'existing' node (which is what we want to do for a merge), you can't, as the (yield) returned node is by definition not built yet (you've effectively only got a pointer to it). To do that, you'll have to build a new version of that 'existing' node and put all of the original content plus the new stuff into it.

And here's how I did it:

// Get 'collections' of the attributes and elements we want to combine together
IEnumerable<XAttribute> rootAttr = FileMergeAttributeStreamAxis(localFiles.First(), rootName);
IEnumerable<XElement> headerElem = 
            FileHeaderStreamAxis(localFiles.First(), rootName, mergeName);
IEnumerable<XAttribute> mergeAttr = 
            FileMergeAttributeStreamAxis(localFiles.First(), mergeName);
IEnumerable<XElement> mergeElem = FileMergeElementStreamAxis(localFiles, mergeName);
IEnumerable<XElement> trailerElem = FileTrailerStreamAxis(localFiles.Last(), mergeName);

// Now piece them all together in our new XStreamingElement
// - note the internal XStreamingElement
XStreamingElement mergeElement = new XStreamingElement(rootName, rootAttr, headerElem,
     new XStreamingElement(mergeName, mergeAttr, mergeElem),
     trailerElem);

// Write it all to disk
mergeElement.Save(folder + mergeFileName);

This builds up the new 'merged' file out of the pieces of the source files, as follows:

  • root element (new)
    • attributes for root element (taken from first file)
    • 'header' elements (taken from first file), anything we find before the element we're merging
    • merged element (new)
      • attributes for merged element (taken from first file)
      • contents of merged element (taken from all files)
    • 'trailer' elements (taken from the last file), anything we find after the element we're merging

You'll also notice the use of multiple extension methods, each one with a different responsibility. Initially, I tried to get one extension method to do everything, but it just wouldn't work (because you can't insert into an already existing element). However, using multiple calls to multiple extension methods works just fine.

If your needs in terms of merging documents are different, then the above statement is where you should start. For instance, if you want the 'header' and 'trailer' to come from the first file, and the 'body' of each of your files that needs to be merged is not contained within a single element, but rather it's just every element after some element in the 'header' and before some element in the 'trailer', then you may need something like:

IEnumerable<XAttribute> rootAttr = FileMergeAttributeStreamAxis(localFiles.First(), rootName);
IEnumerable<XElement> headerElem = 
            FileHeaderStreamAxis(localFiles.First(), rootName, mergeName);
IEnumerable<XElement> mergeElem = FileRangeElementStreamAxis(localFiles, mergeName);
IEnumerable<XElement> trailerElem = FileTrailerStreamAxis(localFiles.Last(), mergeName);

XStreamingElement mergeElement = new XStreamingElement(rootName, rootAttr, 
                                     headerElem, mergeElem, trailerElem);

And, you'd also need to create FileRangeStreamAxis to pull together every element after the last of the header elements and before the first of the trailer elements.

Here's the extension method that does the actually 'merging', plus a further method that makes getting the contents of the element to be merged easier.

/// <summary>
/// Read through each of the files until the node with
/// 'mergeElementName' is found, then read all of its element nodes
/// </summary>
/// <param name="mergeFiles">Names of files to read through</param>
/// <param name="mergeElementName">Name of node from which to obtain the elements</param>
/// <returns>All element nodes found in the node named
///             'mergeElementName' within each file</returns>
private static IEnumerable<XElement> 
        FileMergeElementStreamAxis(IEnumerable<string> mergeFiles, string mergeElementName)
{
  foreach (string mergeFile in mergeFiles)
  {
    using (XmlReader reader = XmlReader.Create(mergeFile))
    {
      XmlReader subReader = FileMergeElementReader(reader, mergeElementName);

      if (subReader != null)
      {
        do
        {
          // Test if this is an element and it is not the merge element
          // - if not the merge element then by definition we will be 'in' the merge element
          if (subReader.NodeType == XmlNodeType.Element && subReader.Name != mergeElementName)
          {
            XElement el = XElement.ReadFrom(subReader) as XElement;
            if (el != null)
              yield return el;
          }
          else
            subReader.Read();
        } while (!subReader.EOF);

        subReader.Close();
      }
      reader.Close();
    }
  }
}

/// <summary>
/// This method returns a subtree reader positioned
/// on the contents of the mergeElementName element.
/// It is the responsibility of the calling method to close both the reader passed in
/// and the reader that is returned by this method.
/// </summary>
/// <param name="reader">The reader we will use to try
///            and find the 'mergeElementName' node</param>
/// <param name="mergeElementName">Name of node
///            for which we want to read its subtree</param>
/// <returns>A reader positioned on the subtree identified
///            by 'mergeElementName' or null</returns>
private static XmlReader FileMergeElementReader(XmlReader reader, 
                         string mergeElementName)
{
  XmlReader subReader = null;

  if (reader == null || mergeElementName == "")
    return subReader;

  do
  {
    if (reader.NodeType == XmlNodeType.Element && reader.Name == mergeElementName)
    {
      subReader = reader.ReadSubtree();
      break;
    }
    else
      reader.Read();
  } while (!reader.EOF);

  return subReader;
}

This particular extension method (plus its helper) are actually just three loops: a foreach to process each file in turn, an outer while loop to get to the element that we'll be merging (in the helper method), and finally, an inner while loop to actually do the merge.

The trick here is to use the ReadSubTree method of the XmlReader. This gets us the entire element to be merged as something independent from the main XmlReader. Once we're finished with it, it leaves the main XmlReader conveniently sitting on the end of the element we're merging, and in our case, that means job done.

History

  • 16-April-2010 - First cut.
  • 04-May-2010 - *New* and *Improved* code - works much better with less weird bugs.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Lee Humphries
Founder md8n
Australia Australia
If it ain't broke - that can be arranged.
Follow on   Twitter

Comments and Discussions

 
Questionhow to deal with namespaces? PinmemberMember 985040922-Jan-14 13:41 
AnswerRe: how to deal with namespaces? PinprofessionalLee Humphries22-Jan-14 14:28 
GeneralRe: how to deal with namespaces? PinmemberMember 985040923-Jan-14 2:22 
QuestionThis is almost perfect for me Lee except for one small thing PinmemberMitchster9-Aug-13 16:42 
AnswerRe: This is almost perfect for me Lee except for one small thing PinprofessionalLee Humphries10-Aug-13 0:27 
GeneralMy vote of 5 Pinmemberitsjayraj13-Dec-11 20:38 
GeneralMy vote of 5 PinmemberMember 404039530-Jun-11 15:21 
GeneralProcessing Large XML Files Pinmembergokul7817-Jan-11 6:02 
GeneralRe: Processing Large XML Files PinmemberLee Humphries17-Jan-11 9:43 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 3 May 2010
Article Copyright 2010 by Lee Humphries
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid