Click here to Skip to main content
13,003,516 members (61,646 online)
Click here to Skip to main content
Add your own
alternative version


43 bookmarked
Posted 27 Oct 2006

Parsing/Loading/Searching XML Document of Size ~ 1GB

, 8 Aug 2008
Rate this:
Please Sign up or sign in to vote.
The Console application which provides an efficient way to parse large XML files using xmlReader and DOM object in hand


The TechFestXmlSolution is a tool which helps the developer in parsing large XML documents. This application is specific for a particular XML file and is taken from Project Gutenberg which maintains a list of books in RDF format. The application is used for searching the XML file for a particular bookid, getting books by index and searching book by text.

Performance and Scalability Issue

Processing a large XML document using DOM object causes a high CPU, memory and bandwidth utilization.

Appropriate design decisions can help you address many XML-related performance issues early by choosing an appropriate XML class for the job considering combining xmlReader and xmlDocument and on the xmlReader use moveToContent and Skip methods. We can read up to 2 GB using xmlReader as it only loads 4KB buffers into memory.

Using the Code

In this application, we have used xmlReader to go through each line and xmlDocument object to load the part of the XML. xmlReader.moveToContent() method checks whether the current node is a content (non-white space text, CDATA, Element, EndElement, EntityReference, or EndEntity) node. If the node is not a content node, the reader skips ahead to the next content node or end of file. It skips over nodes of the following type: ProcessingInstruction, DocumentType, Comment, Whitespace, or SignificantWhitespace and xmlReader.Skip() method to skip children of the current node which we do not have to search. See the snap shot of the code:

while (!_xReader.EOF)
{  // check Node type, node name, match attribute which is id to search, 
   // RDF is root element
   if((_xReader.MoveToContent() == XmlNodeType.Element && 
	_xReader.Name == "pgterms:etext" 
         && _xReader.GetAttribute(0) == name) || _xReader.Name == "rdf:RDF")
    { // if the node name is rdf, then it is root element 
      // don't skip it and continue to read            
     if (_xReader.Name == "rdf:RDF")
        Console.WriteLine(" Before finding the book  CPU usage  ->
      {   // when the node containing Id is found, create the document
          // object to load the node only here we are loading the node
          // in document not whole file for getting the data of node faster
       Console.WriteLine(" After finding the book CPU usage  ->" + 
       doc = new XmlDocument();
          // to get node in memory 
       XmlNode xnode = doc.ReadNode(_xReader);
          // check if element contains any attribute 
       if (xnode.Attributes.Count > 0)
        {  // call the Initialize method of class Book which initializes whole variables
          // Print the whole book description after initializing 
   {     // skip the whole node as it is not of use

Similarly the application contains two more functions for getting the books from index and searching the book by title/subject/publisher, etc.  While searching the books, in the background, this function is writing the output in the text file. The path of the text file is given in the App.Config:

/// Function getBooks takes start and end index as argument to find the number of books
/// and returns the books in list of type Book 
    public override List<book> getBooks(long _startIndex, long _lastIndex)
        // create the List object of capacity number of books needed
      List<book> bookList = new List<book>
		(Convert.ToInt32(_lastIndex - _startIndex + 1));
       // local variable to count number of book found
      int _index = 0;
      sampleCounter.StartTime = DateTime.Now;
      XmlReader _xReader = openXmlFile(_fileName);
      {   // loop till end of the file 
        while (!_xReader.EOF)
         {  // check the conditions that until end file 
          if ((_xReader.MoveToContent() == XmlNodeType.Element && 
		_xReader.Name == "pgterms:etext" 
                && _index > _startIndex - 2 && _index < _lastIndex) || 
		_xReader.Name == "rdf:RDF")
                 if (_xReader.Name == "rdf:RDF")
                     // index is greater than start index and less than lastIndex
                     // get whole node in the memory so that searching 
		   // become easy and productive
                     doc = new XmlDocument();
                     XmlNode xnode = doc.ReadNode(_xReader);
                     // create the instance of the book as an container 
                     Book book1 = new Book();
                     if (xnode.Attributes.Count > 0)
                      { // Initialize and add to the list 
                    // increment as book is found
                 } // check whether the found books are equal to last index or that 
             else if (_index == _lastIndex)
              {  // last index is reached break 
             else  // check until start index is not reached 
               if(_xReader.MoveToContent() == XmlNodeType.Element && 
			_xReader.Name == "pgterms:etext")
                }// skip unwanted nodes to make efficient searching
       }catch (Exception ex){// Error    }
       finally{ _xReader.Close(); }
        	// Check whether list contains any book or not
      if (bookList.Count > 0)
          Console.WriteLine("Total number of books are " + bookList.Count);
          Console.WriteLine(" CPU usage is " +sampleCounter.CpuUsage);
          Console.WriteLine("Time taken " + sampleCounter.TimeTaken);
            // return the whole list of Books 
          return bookList;
          // out of range returns null
          return null;

Points to Remember

Please see the schema of the XML file for the type which the application can be used.
If you want to use this function in your application, make certain changes in the GutenbergBookManager.cs class.

Make sure that the path of the XML file which you have to parse must be given in the App.config. For more details, read the documents in the attached zip file. 


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Sandeep Akhare
Software Developer (Senior)
United States United States
Graduated in Electronics and Telecommunication.
Have been working in software company from last 30 months.
Technology/Languages intrested in
1. ASP.NET 2.0
2. AJAX 1.0
3. C# 3.0
4. JavaScript

You may also be interested in...

Comments and Discussions

Generalnice article Pin
mng3-Oct-08 20:51
membermng3-Oct-08 20:51 
GeneralRe: nice article Pin
Sandeep Akhare3-Oct-08 23:41
memberSandeep Akhare3-Oct-08 23:41 
Thanks mng,
Happy to see in some sort it helped you out

Thanks and Regards

If If you look at what you do not have in life, you don't have anything,
If you look at what you have in life, you have everything... "

Check My Blog

GeneralVTD-XML Pin
Jimmy Zhang7-Aug-08 19:55
memberJimmy Zhang7-Aug-08 19:55 
GeneralRe: VTD-XML Pin
Sandeep Akhare8-Aug-08 3:45
memberSandeep Akhare8-Aug-08 3:45 
Generalnice sandeep Pin
priyank19806-Aug-08 4:45
memberpriyank19806-Aug-08 4:45 
GeneralRe: nice sandeep Pin
Sandeep Akhare6-Aug-08 19:48
memberSandeep Akhare6-Aug-08 19:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170626.1 | Last Updated 8 Aug 2008
Article Copyright 2006 by Sandeep Akhare
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid