
Introduction
This article will demonstrate the use of the XMLTextReader
and XMLTextWriter
classes to split XML streams. The class XMLDocumentSplitter
is included, that provides the functionality to split XML streams through a passed in TextReader
object and calls a XMLDocumentSplitHandler
delegate for each portion of the XML stream. The XML stream can be separated by element count or size as specified by parameters to the XMLDocumentSplitter
class. A Splitter VB.NET Windows Project has been provided that shows how to use the XMLDocumentSplitter
class and provides an easy way to split XML files.
Background
This class and functionality was developed to help solve a problem with large document processing in BizTalk 2002. BizTalk 2002 cannot process documents that are larger than 4 megabytes and there was a situation where some of the XML documents that we wished to be processed, could exceed this limit. A solution to the limitation was found in the BizTalk 2002 .NET Toolkit in the form of a custom pipeline, that would receive the large document and would split it into smaller documents to be submitted to BizTalk. The example provided, used the XMLTextReader
class to read the XML elements, but it manually created the XML document sections that were submitted back to BizTalk! I thought that the manual creation of the output XML nodes was unnecessarily complicated, since it was possible to retrieve the text for each element with the OuterXML
method. I decided to rewrite this code as a seperate class so that I could use for several other solutions that I was working on. Later I started looking at the XMLTextWriter
class in the .NET Framework and decided that some of the processing that I was doing with attributes and nodes, before the first element, could be handled better by using the Write*
methods provided. I rewrote the Split
method to use the XMLTextWriter
class to generate the output XML documents.
Using the code
The following code shows how the XMLDocumentSplitter
class is used within the Splitter test project. By default the XMLDocumentSplitter
class is set for SplitTypes.SplitByDocumentSize
with a size of 512,000. The btnSplit_Click
method sets the SplitType
and SplitSize
based on the radio buttons and text fields from the dialog. The Split
method is then called with a StreamReader
initialized to the source document and with the SplitHandler
method passed in as the delegate.
Private m_XMLSplitter As New XMLDocumentSplitter()
Private Sub btnSplit_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles btnSplit.Click
If rbSplitByDocumentSize.Checked Then
m_XMLSplitter.SplitType =
XMLDocumentSplitter.SplitTypes.ByDocumentSize
m_XMLSplitter.SplitSize = CType(txtDocumentSize.Text, Long)
Else
m_XMLSplitter.SplitType =
XMLDocumentSplitter.SplitTypes.ByElementCount
m_XMLSplitter.SplitSize = CType(txtElementCount.Text, Long)
End If
m_XMLSplitter.Split(New
System.IO.StreamReader(txtSourceDocument.Text),
AddressOf SplitHandler)
MessageBox.Show(String.Format("Split '{0}' into {1} files.",
txtSourceDocument.Text, m_XMLSplitter.SplitCount))
End Sub
Private Sub SplitHandler(ByVal count As Long, ByVal document As String)
Dim stream_writer As New
System.IO.StreamWriter(txtDestinationPath.Text &
String.Format(txtDestinationFilenamePattern.Text, count))
stream_writer.Write(document)
stream_writer.Close()
End Sub
The SplitHandler
delegate is called when a sufficient portion of the source XML document has been processed. count
is passed as the count of the portions processed so far. document
is the XML document as string that has been pulled from the source document. SplitHandler
simply creates a new StreamWriter
based on the destination path and destination filename pattern and writes document
to the stream.
The remaining code in Splitter.vb is in support of the GUI and is not important to this discussion.
Here is a source XML document that was used to the test the Splitter application and the XMLDocumentSplitter class.
XMLSplitter.xml
Here are the resultant XML files that were created when the applicaiton was run with 'Split by Element Count' and 'Element Count' = 1:
XMLDocumentSplitter Class
The XMLDocumentSplitter
class is a small class, that has a small number of properties that relate to how the XML documents should be split and the result of the splitting of an XML document. The bulk of the functionality is provided by the Split
method and this is where I will concentrate my discussion.
A TextReader
object is passed to the Split
method along with a XMLDocumentSplitHandler
delegate that will be called for each of the smaller XML documents generated by this method. I used a StringBuilder
within the StringWriter
passed to the XMLTextWriter
, so that I could control the underlying string data. This will become evident later when the XML document needs to be reset back to just the header portion.
Public Sub Split(ByVal text_reader As TextReader,
ByVal handler As XMLDocumentSplitHandler)
Dim document_header As String, document_footer As String
Dim element_counter As
Long = 0 : m_SplitCount = 0 : m_TotalSplitSizes = 0
Dim empty_document As Boolean = False
Dim string_builder As New System.Text.StringBuilder
Dim xml_text_writer As New
XmlTextWriter(New StringWriter(string_builder))
Dim xml_text_reader As New XmlTextReader(text_reader)
xml_text_reader.Read()
Now capture all of the items before the root element for the header of each of the sub documents. document_header
and document_footer
are initialized here for later use, to reset the string_builder
object and also cap the XML documents. We cannot use the xml_text_writer
object to write the end elements because it would then pop the root element internal to itself and then would not allow anymore nodes to be written. So manually appending the document_footer
accomplishes the same thing without affecting the xml_text_writer
object.
While xml_text_reader.NodeType <> XmlNodeType.Element
xml_text_writer.WriteNode(xml_text_reader, True)
End While
xml_text_writer.WriteStartElement(xml_text_reader.Name)
xml_text_writer.WriteAttributes(xml_text_reader, True)
document_header = string_builder.ToString & ">"
document_footer = vbCrLf & "</" & xml_text_reader.Name & ">"
xml_text_reader.Read()
We should now be at the first child element of the root element. Iterate through these, while writing to the xml_text_writer
object. Do not count white space as elements for the count. If we have exceeded the m_SplitSize
either by ByDocumentSize
or ByElementCount
, then we call the handler with the m_SplitCount
and XML document as a string. We then increment and reset the counters as appropiate. We reset the string_builder
to the document_header
text in preparation for the next elements.
While Not xml_text_reader.EOF
If Not IgnorableNodeType(xml_text_reader.NodeType) Then
element_counter += 1 : empty_document = False
xml_text_writer.WriteNode(xml_text_reader, True)
If (m_SplitType = SplitTypes.ByDocumentSize And
(string_builder.Length - document_header.Length)
>= m_SplitSize) OrElse _
(m_SplitType = SplitTypes.ByElementCount And
element_counter >= m_SplitSize) Then
handler(m_SplitCount, string_builder.ToString & document_footer)
m_TotalSplitSizes += string_builder.Length :
m_SplitCount += 1 : element_counter = 0
string_builder.Length = 0 :
string_builder.Append(document_header)
empty_document = True
End If
End While
This piece of code is used to handle partials of the XML document. If we have streamed one or more relevant nodes to the xml_text_writer
and the xml_text_reader
has hit the EOF, then we need to output the text in the string_builder
object as a partial XML Document.
If Not empty_document Then
handler(m_SplitCount, string_builder.ToString & document_footer)
m_TotalSplitSizes += string_builder.Length :
m_SplitCount += 1
End If
End Sub
The best way to understand how the XMLTextReader
and XMLTextWriter
classes interact, is to step through the Split
method as it pulls data from the xml_text_reader
object and writes to the xml_text_writer
object.
Points of iterest
The XMLTextReader
class should be used when you do not wish to read an entire XML document into memory at once for either memory or time constraints. The XMLTextReader
class is based on a Pull model for processing XML document streams and is very flexible and powerful.
The XMLTextWriter
class should be used to generate XML documents instead of the traditional writing of strings to a stream. It is more robust and flexible than simply formatting the XML tags on your own and is a lot simpler than using the XMLDocument
class to generate an XML document.
History
- June 20th, 2003 - Initial release of source code and article.