|
|||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
Figure 9
IntroductionXML document performs well if it is relatively small (less than 10-20 MB). DOM loads content into memory and does validation and so on… but what to do with large documents? Bigger documents use more memory and will consume more resources and performance will suffer significantly. Tool like XML Spy will try to load it into memory and try to validate the document and after all will … hang and run out of memory. (Files tested were over 250 MB). Other XML parsers will just crash silently. BackgroundYou are probably having a similar issue, otherwise you would not read this. Here is a simple solution to get the nodes of interest or all nodes from a large-huge XML document without any performance hit (well - small one) and almost no memory or CPU hits. This is what we do, we index our nodes of interest in memory or in this example into another indexing document (*.xml.idx) and write output nodes into another document (*.xml.sorted). We can now use indexes for fast access to any node of interest. It is up to you to generate conditions and selection methods. This is a very general example of processing Invoices for customers. The test file was over 420MB and it took on an average 40-50 seconds to process it while writing another document with all nodes of interest. The processing time will be based on your processor and amount of memory on your box. Key elements: Using the CodeHow to:
After copying file to the new directory, the file size will be shown in Form Text field (Figure 3). Figure 3
Start parsing process and write to *.idx and *.sorted files. Use Regex to find if there is a match in line. using (FileStream fs = new FileStream(wokingCopy, FileMode.Open, FileAccess.Read))
using (StreamReader sr = new StreamReader(wokingCopy, Encoding.UTF8)) {
string parseText = txtNode.Text.Trim();
//Matching expression for the node:
Regex rx = new Regex(@"<" + parseText, RegexOptions.Compiled | RegexOptions.IgnoreCase);
int pos = 0;
int startIndex = 0;
int lastPositio = 0;
//Read each line in XML document as regular file stream.
do {
string line = sr.ReadLine();
pos += Encoding.UTF8.GetByteCount(line) + 2;// 2 extra bites for end of line chars.
MatchCollection m = rx.Matches(line);
foreach (Match mt in m) {
startIndex = lastPositio + mt.Index;
ValidateXPathCondition(fs, startIndex);
}
lastPositio = pos;
} while (!sr.EndOfStream);
sr.Close();
sw.Close();
fs.Close();
}
WriteSortedDocument();
At the end, your processing time in seconds will be displayed in Form Text (Figure 4): Figure 4
This is how you read the stream and store information: private void WriteSortedDocument() {
using(FileStream fs = new FileStream(wokingCopy, FileMode.Open, FileAccess.Read))
using (StreamWriter wr = new StreamWriter(wokingCopy + ".sorted", false)) {
wr.WriteLine("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
wr.WriteLine("<" + txtNode.Text.Trim().ToUpper() + "S>");
foreach (int key in indexList.Keys) {
fs.Seek(indexList[key], SeekOrigin.Begin);
using (XmlReader reader = XmlReader.Create(fs)) {
reader.MoveToContent();
XmlDocument d = new XmlDocument();
d.Load(reader.ReadSubtree());
wr.WriteLine(d.InnerXml);
wr.Flush();
reader.Close();
}
}
fs.Close();
wr.WriteLine("</" + txtNode.Text.Trim().ToUpper() + "S>");
wr.Flush();
wr.Close();
}
}
/// <summary>
/// Write value and index in original file into the indexing file.
/// </summary>
/// <param name="value"></param>
/// <param name="startIndex"></param>
private void SaveMatchedIndex(string value, int startIndex) {
sw.WriteLine(value + "\t" + startIndex);
sw.Flush();
//Add new element to sorted list. Sorting is by key
indexList.Add(Int32.Parse(value), startIndex);
}
New files generated: Figure 5
Original XML File snippet: Figure 6
Part of index file *.xml.idx: Figure 7
Part of sorted *.sorted XML file: File was sorted by Invoice ID. Figure 8
Test your indexes. Click on Read button to Test Index Position 311383: It will return the selected node. See Figure 9 at the top of this article. Points of InterestIf you have any comments or questions, please email me. History
|
||||||||||||||||||||||||||||||||||||||||