Introduction
The TechFestXmlSolution
is a tool which helps the developer in parsing large XML documents. This application is specific for a particular XML file and is taken from Project Gutenberg which maintains a list of books in RDF format. The application is used for searching the XML file for a particular bookid, getting books by index and searching book by text.
Performance and Scalability Issue
Processing a large XML document using DOM object causes a high CPU, memory and bandwidth utilization.
Appropriate design decisions can help you address many XML-related performance issues early by choosing an appropriate XML class for the job considering combining xmlReader
and xmlDocument
and on the xmlReader
use moveToContent
and Skip
methods. We can read up to 2 GB using xmlReader
as it only loads 4KB buffers into memory.
Using the Code
In this application, we have used xmlReader
to go through each line and xmlDocument
object to load the part of the XML. xmlReader.moveToContent()
method checks whether the current node is a content (non-white space text, CDATA
, Element
, EndElement
, EntityReference
, or EndEntity
) node. If the node is not a content node, the reader skips ahead to the next content node or end of file. It skips over nodes of the following type: ProcessingInstruction
, DocumentType
, Comment
, Whitespace
, or SignificantWhitespace
and xmlReader.Skip()
method to skip children of the current node which we do not have to search. See the snap shot of the code:
while (!_xReader.EOF)
{
if((_xReader.MoveToContent() == XmlNodeType.Element &&
_xReader.Name == "pgterms:etext"
&& _xReader.GetAttribute(0) == name) || _xReader.Name == "rdf:RDF")
{
if (_xReader.Name == "rdf:RDF")
{
Console.WriteLine(" Before finding the book CPU usage ->
"+sampleCounter.CpuUsage);
_xReader.Read();
}
else
{
Console.WriteLine(" After finding the book CPU usage ->" +
sampleCounter.CpuUsage);
doc = new XmlDocument();
XmlNode xnode = doc.ReadNode(_xReader);
if (xnode.Attributes.Count > 0)
{
book1.Initialise(xnode);
}
break;
}
}
else
{
_xReader.Skip();
_xReader.MoveToContent();
}
Similarly the application contains two more functions for getting the books from index and searching the book by title/subject/publisher, etc. While searching the books, in the background, this function is writing the output in the text file. The path of the text file is given in the App.Config:
public override List<book> getBooks(long _startIndex, long _lastIndex)
{
List<book> bookList = new List<book>
(Convert.ToInt32(_lastIndex - _startIndex + 1));
int _index = 0;
sampleCounter.StartTime = DateTime.Now;
XmlReader _xReader = openXmlFile(_fileName);
try
{
while (!_xReader.EOF)
{
if ((_xReader.MoveToContent() == XmlNodeType.Element &&
_xReader.Name == "pgterms:etext"
&& _index > _startIndex - 2 && _index < _lastIndex) ||
_xReader.Name == "rdf:RDF")
{
if (_xReader.Name == "rdf:RDF")
{
_xReader.Read();
}
else
{
doc = new XmlDocument();
XmlNode xnode = doc.ReadNode(_xReader);
Book book1 = new Book();
if (xnode.Attributes.Count > 0)
{
book1.Initialise(xnode);
bookList.Add(book1);
}
_index++;
}
}
else if (_index == _lastIndex)
{
break;
}
else
{
if(_xReader.MoveToContent() == XmlNodeType.Element &&
_xReader.Name == "pgterms:etext")
{
_index++;
_xReader.Skip();
_xReader.MoveToContent();
}
else
{
_xReader.Skip();
_xReader.MoveToContent();
}
}
}
}catch (Exception ex){
finally{ _xReader.Close(); }
if (bookList.Count > 0)
{
Console.WriteLine("Total number of books are " + bookList.Count);
Console.WriteLine(" CPU usage is " +sampleCounter.CpuUsage);
Console.WriteLine("Time taken " + sampleCounter.TimeTaken);
return bookList;
}
else
return null;
}
Points to Remember
Please see the schema of the XML file for the type which the application can be used.
If you want to use this function in your application, make certain changes in the GutenbergBookManager.cs class.
Make sure that the path of the XML file which you have to parse must be given in the App.config. For more details, read the documents in the attached zip file.