Parsing/Loading/Searching XML Document of Size ~ 1GB

Sandeep Akhare

3.45/5 (11 votes)

8 Aug 2008CPOL2 min read

1.4K

The Console application which provides an efficient way to parse large XML files using xmlReader and DOM object in hand

Introduction

The TechFestXmlSolution is a tool which helps the developer in parsing large XML documents. This application is specific for a particular XML file and is taken from Project Gutenberg which maintains a list of books in RDF format. The application is used for searching the XML file for a particular bookid, getting books by index and searching book by text.

Performance and Scalability Issue

Processing a large XML document using DOM object causes a high CPU, memory and bandwidth utilization.

Appropriate design decisions can help you address many XML-related performance issues early by choosing an appropriate XML class for the job considering combining xmlReader and xmlDocument and on the xmlReader use moveToContent and Skip methods. We can read up to 2 GB using xmlReader as it only loads 4KB buffers into memory.

Using the Code

In this application, we have used xmlReader to go through each line and xmlDocument object to load the part of the XML. xmlReader.moveToContent() method checks whether the current node is a content (non-white space text, CDATA, Element, EndElement, EntityReference, or EndEntity) node. If the node is not a content node, the reader skips ahead to the next content node or end of file. It skips over nodes of the following type: ProcessingInstruction, DocumentType, Comment, Whitespace, or SignificantWhitespace and xmlReader.Skip() method to skip children of the current node which we do not have to search. See the snap shot of the code:

while (!_xReader.EOF)
{  // check Node type, node name, match attribute which is id to search, 
   // RDF is root element
   if((_xReader.MoveToContent() == XmlNodeType.Element && 
	_xReader.Name == "pgterms:etext" 
         && _xReader.GetAttribute(0) == name) || _xReader.Name == "rdf:RDF")
    { // if the node name is rdf, then it is root element 
      // don't skip it and continue to read            
     if (_xReader.Name == "rdf:RDF")
      {
        Console.WriteLine(" Before finding the book  CPU usage  ->
					"+sampleCounter.CpuUsage);
       _xReader.Read();
      }
     else
      {   // when the node containing Id is found, create the document
          // object to load the node only here we are loading the node
          // in document not whole file for getting the data of node faster
       Console.WriteLine(" After finding the book CPU usage  ->" + 
						sampleCounter.CpuUsage);
       doc = new XmlDocument();
          // to get node in memory 
       XmlNode xnode = doc.ReadNode(_xReader);
          // check if element contains any attribute 
       if (xnode.Attributes.Count > 0)
        {  // call the Initialize method of class Book which initializes whole variables
           book1.Initialise(xnode);
        }
          // Print the whole book description after initializing 
       break;
      }
   }
  else
   {     // skip the whole node as it is not of use
      _xReader.Skip();
      _xReader.MoveToContent();
   }

Similarly the application contains two more functions for getting the books from index and searching the book by title/subject/publisher, etc. While searching the books, in the background, this function is writing the output in the text file. The path of the text file is given in the App.Config:

/// Function getBooks takes start and end index as argument to find the number of books
/// and returns the books in list of type Book 
    public override List<book> getBooks(long _startIndex, long _lastIndex)
    {
        // create the List object of capacity number of books needed
      List<book> bookList = new List<book>
		(Convert.ToInt32(_lastIndex - _startIndex + 1));
       // local variable to count number of book found
      int _index = 0;
      sampleCounter.StartTime = DateTime.Now;
      XmlReader _xReader = openXmlFile(_fileName);
      try
      {   // loop till end of the file 
        while (!_xReader.EOF)
         {  // check the conditions that until end file 
          if ((_xReader.MoveToContent() == XmlNodeType.Element && 
		_xReader.Name == "pgterms:etext" 
                && _index > _startIndex - 2 && _index < _lastIndex) || 
		_xReader.Name == "rdf:RDF")
               {
                 if (_xReader.Name == "rdf:RDF")
                  {
                    _xReader.Read();
                  }
                 else
                  {
                     // index is greater than start index and less than lastIndex
                     // get whole node in the memory so that searching 
		   // become easy and productive
                     doc = new XmlDocument();
                     XmlNode xnode = doc.ReadNode(_xReader);
                     // create the instance of the book as an container 
                     Book book1 = new Book();
                     if (xnode.Attributes.Count > 0)
                      { // Initialize and add to the list 
                           book1.Initialise(xnode);
                           bookList.Add(book1);
                      }
                    // increment as book is found
                   _index++;
                  }
                 } // check whether the found books are equal to last index or that 
             else if (_index == _lastIndex)
              {  // last index is reached break 
                    break;
               }
             else  // check until start index is not reached 
              {
               if(_xReader.MoveToContent() == XmlNodeType.Element && 
			_xReader.Name == "pgterms:etext")
                {
                    _index++;
                    _xReader.Skip();
                    _xReader.MoveToContent();
                }// skip unwanted nodes to make efficient searching
               else
                {
                   _xReader.Skip();
                   _xReader.MoveToContent();
                }
              }
         }
       }catch (Exception ex){// Error    }
       finally{ _xReader.Close(); }
        	// Check whether list contains any book or not
      if (bookList.Count > 0)
       {
          Console.WriteLine("Total number of books are " + bookList.Count);
          Console.WriteLine(" CPU usage is " +sampleCounter.CpuUsage);
          Console.WriteLine("Time taken " + sampleCounter.TimeTaken);
            // return the whole list of Books 
          return bookList;
        }
        else 
          // out of range returns null
          return null;
        }

Points to Remember

Please see the schema of the XML file for the type which the application can be used.
If you want to use this function in your application, make certain changes in the GutenbergBookManager.cs class.

Make sure that the path of the XML file which you have to parse must be given in the App.config. For more details, read the documents in the attached zip file.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)