
Fig. 1
Introduction
Working with large XML files is not always an easy task. I am referring to files of size 25MB and more. An approach for processing such large XML files may be to split the XML document into smaller files for processing. It is a no brainier if you just want to split a file into multiple files, but what if you need this partial file to be accessible by an XML parser or DOM individually? You need to make sure that you have a complete node at the end of your smaller file, and you want to skip to the next node at the beginning of your next file.
Background
This is the continuation of my previous topic on how to deal with large XML documents: Large XML Files Processing and Indexing.
Using the code
Here is an idea of how you can do it. I tested the code with many different XML files and it works for the majority of XML files. You may get an error if your split size is too small. It also depends on your XML formatting.
Try to use the attached XML document as an example. Also, I have attached the results of this process: the files .part1, .part2, .part3.
Here is how we split the file and how it works:
private void SplitFile() {
ImportXMLDoc(false);
nodePathDic.Clear();
if (string.IsNullOrEmpty(filePath)) {
MessageBox.Show("Select XML File to split");
return;
}
FileInfo fi = new FileInfo(filePath);
double origFileSize = (double)fi.Length;
numOfNewFiles = Math.Ceiling(origFileSize / maxFileSplitSize);
string filePart = Application.StartupPath + "/" + fi.Name +
".part1" + fi.Extension;
int fileCnt = 1;
long writeFilePosition = 0;
using (StreamReader sr = new StreamReader(filePath, Encoding.UTF8)) {
int pos = 0;
filePart = Application.StartupPath + "/" + fi.Name +
".part" + fileCnt + fi.Extension;
StreamWriter sw = new StreamWriter(filePart, false);
Regex rx = new Regex(@"<", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
string nodeName = string.Empty;
do {
string line = sr.ReadLine();
pos += Encoding.UTF8.GetByteCount(line) + 2;
MatchCollection m = rx.Matches(line);
foreach (Match mt in m) {
nodeName = line.Split(' ').Length == 0 ?
line.Substring(1, line.LastIndexOf('>') - 1) :
line.Split(new char[] { ' ' },
StringSplitOptions.RemoveEmptyEntries)[0];
if (!nodeName.Contains("?xml") &&
!nodePathDic.ContainsKey(pos + mt.Index)) {
nodePathDic.Add(pos + mt.Index, nodeName);
}
break;
}
sw.WriteLine(line);
sw.Flush();
writeFilePosition = sw.BaseStream.Position;
if (pos > maxFileSplitSize * fileCnt) {
int lastNodeStartPosition = 0;
string lastNodeName = string.Empty;
string ln = string.Empty;
string completeLastNode = GetLastNode(filePath,
out lastNodeStartPosition, out lastNodeName);
do {
ln = sr.ReadLine();
if (ln == null)
break;
pos += Encoding.UTF8.GetByteCount(ln) + 2;
} while (!ln.Contains(lastNodeName));
long swPosition = (writeFilePosition -
(nodePathDic.Keys[nodePathDic.Count - 1] -
lastNodeStartPosition)) + 2;
sw.BaseStream.Position = swPosition >= 0 ? swPosition : 0;
sw.Write("\n");
sw.WriteLine("<!-- End of " + Application.StartupPath + "/" +
fi.Name + ".part" + fileCnt + fi.Extension + ". " +
fileCnt + " out of " + numOfNewFiles + " -->");
sw.WriteLine(completeLastNode + "\n\n");
sw.WriteLine(nodePathDic.Values[0].Replace("<", "</"));
filePart = Application.StartupPath + "/" + fi.Name +
".part" + (++fileCnt) + fi.Extension;
sw.Flush();
sw.Close();
sw = new StreamWriter(filePart, false);
sw.WriteLine(nodePathDic.Values[0]);
sw.WriteLine("<!-- Start of " + Application.StartupPath + "/" +
fi.Name + ".part" + fileCnt + fi.Extension + ". " +
fileCnt + " out of " + numOfNewFiles + " -->");
sw.Flush();
}
} while (!sr.EndOfStream);
sw.Flush();
sw.Close();
sr.Close();
sw.Close();
}
}

At the end of the run, you should have the files included in a Zip file.

Let’s take a look at the output of this process:
At the end of each file, note the “<!—End of…." comment line and the complete last node. I added this for a visual effect. I can use it later to join the documents together (that would be in my next article).

The next file will start where the last file ended.

Note
Root nodes are at the beginning and at the end of each document. The output XML file should be good to be used in an XML DOM or a tool like XMLSpy.
Enjoy. If you have any questions, post them here or send me an email.
History
- Created on November 20, 2008.
- Jan 09, 2011: I've changed the logic of how to end the file nodes and start new file nodes. This is a more robust version and based on .NET 4.0 and includes xsd.exe to generate the XML file schema.