(untagged)

XML Document Splitter Class

Detlef Grohs

0.00/5 (No votes)

1 Jul 2003

An article demonstration using the XMLTextReader class to split large XML files.

Introduction

This article will demonstrate the use of the XMLTextReader and XMLTextWriter classes to split XML streams. The class XMLDocumentSplitter is included, that provides the functionality to split XML streams through a passed in TextReader object and calls a XMLDocumentSplitHandler delegate for each portion of the XML stream. The XML stream can be separated by element count or size as specified by parameters to the XMLDocumentSplitter class. A Splitter VB.NET Windows Project has been provided that shows how to use the XMLDocumentSplitter class and provides an easy way to split XML files.

Background

This class and functionality was developed to help solve a problem with large document processing in BizTalk 2002. BizTalk 2002 cannot process documents that are larger than 4 megabytes and there was a situation where some of the XML documents that we wished to be processed, could exceed this limit. A solution to the limitation was found in the BizTalk 2002 .NET Toolkit in the form of a custom pipeline, that would receive the large document and would split it into smaller documents to be submitted to BizTalk. The example provided, used the XMLTextReader class to read the XML elements, but it manually created the XML document sections that were submitted back to BizTalk! I thought that the manual creation of the output XML nodes was unnecessarily complicated, since it was possible to retrieve the text for each element with the OuterXML method. I decided to rewrite this code as a seperate class so that I could use for several other solutions that I was working on. Later I started looking at the XMLTextWriter class in the .NET Framework and decided that some of the processing that I was doing with attributes and nodes, before the first element, could be handled better by using the Write* methods provided. I rewrote the Split method to use the XMLTextWriter class to generate the output XML documents.

Using the code

The following code shows how the XMLDocumentSplitter class is used within the Splitter test project. By default the XMLDocumentSplitter class is set for SplitTypes.SplitByDocumentSize with a size of 512,000. The btnSplit_Click method sets the SplitType and SplitSize based on the radio buttons and text fields from the dialog. The Split method is then called with a StreamReader initialized to the source document and with the SplitHandler method passed in as the delegate.

Private m_XMLSplitter As New XMLDocumentSplitter()

Private Sub btnSplit_Click(ByVal sender As System.Object, 
        ByVal e As System.EventArgs) Handles btnSplit.Click
    If rbSplitByDocumentSize.Checked Then
        m_XMLSplitter.SplitType = 
               XMLDocumentSplitter.SplitTypes.ByDocumentSize
        m_XMLSplitter.SplitSize = CType(txtDocumentSize.Text, Long)
    Else
        m_XMLSplitter.SplitType = 
             XMLDocumentSplitter.SplitTypes.ByElementCount
        m_XMLSplitter.SplitSize = CType(txtElementCount.Text, Long)
    End If

    m_XMLSplitter.Split(New 
      System.IO.StreamReader(txtSourceDocument.Text), 
      AddressOf SplitHandler)

    MessageBox.Show(String.Format("Split '{0}' into {1} files.", 
            txtSourceDocument.Text, m_XMLSplitter.SplitCount))
End Sub

Private Sub SplitHandler(ByVal count As Long, ByVal document As String)
    Dim stream_writer As New 
            System.IO.StreamWriter(txtDestinationPath.Text & 
            String.Format(txtDestinationFilenamePattern.Text, count))
    stream_writer.Write(document)
    stream_writer.Close()
End Sub

The SplitHandler delegate is called when a sufficient portion of the source XML document has been processed. count is passed as the count of the portions processed so far. document is the XML document as string that has been pulled from the source document. SplitHandler simply creates a new StreamWriter based on the destination path and destination filename pattern and writes document to the stream.

The remaining code in Splitter.vb is in support of the GUI and is not important to this discussion.

Here is a source XML document that was used to the test the Splitter application and the XMLDocumentSplitter class.

XMLSplitter.xml

Here are the resultant XML files that were created when the applicaiton was run with 'Split by Element Count' and 'Element Count' = 1:

XMLDocumentSplitter Class

The XMLDocumentSplitter class is a small class, that has a small number of properties that relate to how the XML documents should be split and the result of the splitting of an XML document. The bulk of the functionality is provided by the Split method and this is where I will concentrate my discussion.

A TextReader object is passed to the Split method along with a XMLDocumentSplitHandler delegate that will be called for each of the smaller XML documents generated by this method. I used a StringBuilder within the StringWriter passed to the XMLTextWriter, so that I could control the underlying string data. This will become evident later when the XML document needs to be reset back to just the header portion.

Public Sub Split(ByVal text_reader As TextReader, 
                      ByVal handler As XMLDocumentSplitHandler)
    Dim document_header As String, document_footer As String
    Dim element_counter As 
         Long = 0 : m_SplitCount = 0 : m_TotalSplitSizes = 0
    ' First time through is never an empty document

    Dim empty_document As Boolean = False 
    Dim string_builder As New System.Text.StringBuilder
    Dim xml_text_writer As New 
         XmlTextWriter(New StringWriter(string_builder))
    Dim xml_text_reader As New XmlTextReader(text_reader)
    xml_text_reader.Read() ' prime the pump

Now capture all of the items before the root element for the header of each of the sub documents. document_header and document_footer are initialized here for later use, to reset the string_builder object and also cap the XML documents. We cannot use the xml_text_writer object to write the end elements because it would then pop the root element internal to itself and then would not allow anymore nodes to be written. So manually appending the document_footer accomplishes the same thing without affecting the xml_text_writer object.

    ' Capture all of the items before the first element...

    While xml_text_reader.NodeType <> XmlNodeType.Element
        xml_text_writer.WriteNode(xml_text_reader, True)
    End While

    ' Prepare the document header and footer sections for use later...

    xml_text_writer.WriteStartElement(xml_text_reader.Name)
    xml_text_writer.WriteAttributes(xml_text_reader, True)
    ' Must close this manually...

    document_header = string_builder.ToString & ">" 
    ' Create close element manually...

    document_footer = vbCrLf & "</" & xml_text_reader.Name & ">" 
    xml_text_reader.Read() ' Skip past the root node...

We should now be at the first child element of the root element. Iterate through these, while writing to the xml_text_writer object. Do not count white space as elements for the count. If we have exceeded the m_SplitSize either by ByDocumentSize or ByElementCount, then we call the handler with the m_SplitCount and XML document as a string. We then increment and reset the counters as appropiate. We reset the string_builder to the document_header text in preparation for the next elements.

    While Not xml_text_reader.EOF
        ' Only count the nodes that interest us...

        If Not IgnorableNodeType(xml_text_reader.NodeType) Then 
            element_counter += 1 : empty_document = False
        ' Copy everything from the reader to the writer

        xml_text_writer.WriteNode(xml_text_reader, True) 

        If (m_SplitType = SplitTypes.ByDocumentSize And 
          (string_builder.Length - document_header.Length) 
          >= m_SplitSize) OrElse _
          (m_SplitType = SplitTypes.ByElementCount And 
          element_counter >= m_SplitSize) Then
          
            handler(m_SplitCount, string_builder.ToString & document_footer)
            ' Adjust the counters

            m_TotalSplitSizes += string_builder.Length : 
                    m_SplitCount += 1 : element_counter = 0 
            ' Reset the StringBuilder

            string_builder.Length = 0 : 
                string_builder.Append(document_header) 
            ' It is an empty document again...

            empty_document = True 
        End If
    End While

This piece of code is used to handle partials of the XML document. If we have streamed one or more relevant nodes to the xml_text_writer and the xml_text_reader has hit the EOF, then we need to output the text in the string_builder object as a partial XML Document.

    If Not empty_document Then
        handler(m_SplitCount, string_builder.ToString & document_footer)
        m_TotalSplitSizes += string_builder.Length : 
              m_SplitCount += 1 ' Adjust the counters

    End If
End Sub

The best way to understand how the XMLTextReader and XMLTextWriter classes interact, is to step through the Split method as it pulls data from the xml_text_reader object and writes to the xml_text_writer object.

Points of iterest

The XMLTextReader class should be used when you do not wish to read an entire XML document into memory at once for either memory or time constraints. The XMLTextReader class is based on a Pull model for processing XML document streams and is very flexible and powerful.

The XMLTextWriter class should be used to generate XML documents instead of the traditional writing of strings to a stream. It is more robust and flexible than simply formatting the XML tags on your own and is a lot simpler than using the XMLDocument class to generate an XML document.

History

June 20th, 2003 - Initial release of source code and article.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here