5,699,997 members and growing! (22,968 online)
Email Password   helpLost your password?
Web Development » ASP.NET » General     Intermediate License: The Code Project Open License (CPOL)

Parsing/Loading/Searching xml document of size ~ 1GB

By Sandeep Akhare

The Console application which provide efficent way to parse large xml files using xmlReader and DOM object in hand
C# (C# 2.0, C#), Windows, .NET, .NET 2.0, XAML, ASP.NET, VS2005, Visual Studio, Dev

Posted: 27 Oct 2006
Updated: 8 Aug 2008
Views: 14,739
Bookmarked: 28 times
Announcements
Loading...



Search    
Advanced Search
Sitemap
11 votes for this Article.
Popularity: 3.60 Rating: 3.45 out of 5
4 votes, 36.4%
1
1 vote, 9.1%
2
0 votes, 0.0%
3
0 votes, 0.0%
4
6 votes, 54.5%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

 

Introduction

The TechFestXmlSolution is a tool which help the developer in parsing large xml documents this application is specific for perticular xml file and is taken from Project Gutenberg(http://www.gutenberg.org/) maintains a list of books in RDF format The application is used for searching the xml file for perticular bookid,getting books by index and searching book by text

Perforamance and scalability issue:

Processing large xml document using Dom object causes a high CPU,Memory and Bandwidth utiliation
Appropriate design decisions can help you address many XML-related performance issues early by chossing appropriate XML class for the job considering combining xmlReader and xmlDocument and on the xmlReader use moveToContent and Skip methods. We can read upto 2 GB using xmlReader as it only load 4Kb buffers into memory.

 

Using the Code

 

      In this application we have used xmlReader to go through each line and xmlDocument object to load the part of the xml . xmlReader.moveToContent() method Checks whether the current node is a content (non-white space text, CDATA, Element, EndElement, EntityReference, or EndEntity) node. If the node is not a content node, the reader skips ahead to the next content node or end of file. It skips over nodes of the following type: ProcessingInstruction, DocumentType, Comment, Whitespace, or SignificantWhitespace    and xmlReader.Skip() method to skip children of the current node which we do not have to search . See the snap shot of the code

while (!_xReader.EOF)
{  // check Node type,node name ,match attribute which is id to search, RDF is root element
   if((_xReader.MoveToContent() == XmlNodeType.Element && _xReader.Name == "pgterms:etext" 
         && _xReader.GetAttribute(0) == name) || _xReader.Name == "rdf:RDF")
    { // if the node name is rdf the it is root element don't skip it and continue to read            
     if (_xReader.Name == "rdf:RDF")
      {
        Console.WriteLine(" Before finding the book  CPU usage  ->"+sampleCounter.CpuUsage);
       _xReader.Read();
      }
     else
      {   // when the node containing Id is found  create the document
          // object to load the node only here we are loading the node
          // in document not whole file for getting the data of node faster
       Console.WriteLine(" After finding the book CPU usage  ->" + sampleCounter.CpuUsage);
       doc = new XmlDocument();
          // to get node in memory 
       XmlNode xnode = doc.ReadNode(_xReader);
          // check if element contains any attribute 
       if (xnode.Attributes.Count > 0)
        {  // call the Initialise method of class Book which intializes whole variables
           book1.Initialise(xnode);
        }
          // Print the whole book description after initialising 
       break;
      }
   }
  else
   {     // skip the whole node as it is not of use
      _xReader.Skip();
      _xReader.MoveToContent();
   }

Similarly the application contains 2 more function for getting the books from index and searching the book by title/subject/publisher etc  while searching the books at the back this function are writing the output in the text file the path of the text file is given in the App.Config please

       /// Function getBooks takes start and end index as argument to find that number of books
       /// and retunrns the books in list of type Book 
    public override List getBooks(long _startIndex, long _lastIndex)
    {
        // create the List object of capacity number of books needed
      List bookList = new List(Convert.ToInt32(_lastIndex - _startIndex + 1));
       // local variable to count number of book found
      int _index = 0;
      sampleCounter.StartTime = DateTime.Now;
      XmlReader _xReader = openXmlFile(_fileName);
      try
      {   // loop till end of the file 
        while (!_xReader.EOF)
         {  // check the conditions that untill end file 
          if ((_xReader.MoveToContent() == XmlNodeType.Element && _xReader.Name == "pgterms:etext" 
                && _index > _startIndex - 2 && _index < _lastIndex) || _xReader.Name == "rdf:RDF")
               {
                 if (_xReader.Name == "rdf:RDF")
                  {
                    _xReader.Read();
                  }
                 else
                  {
                     // index is greater than start index and less than lastIndex
                     // get whole node in the memory so that searching become easy and productive
                     doc = new XmlDocument();
                     XmlNode xnode = doc.ReadNode(_xReader);
                     // create the instance of the book as an container 
                     Book book1 = new Book();
                     if (xnode.Attributes.Count > 0)
                      { // Intialise and add to the list 
                           book1.Initialise(xnode);
                           bookList.Add(book1);
                      }
                    // increment as book is found
                   _index++;
                  }
                 } // check whether the found books are equal to last index or that 
             else if (_index == _lastIndex)
              {  // last index is reached break 
                    break;
               }
             else  // check until start index is not reach 
              {
               if(_xReader.MoveToContent() == XmlNodeType.Element && _xReader.Name == "pgterms:etext")
                {
                    _index++;
                    _xReader.Skip();
                    _xReader.MoveToContent();
                }// skip unwanted nodes to make efficent searching
               else
                {
                   _xReader.Skip();
                   _xReader.MoveToContent();
                }
              }
         }
       }catch (Exception ex){// Error    }
       finally{ _xReader.Close(); }
        	// Check wether list conatins any book or not
      if (bookList.Count > 0)
       {
          Console.WriteLine("Total number of books are " + bookList.Count);
          Console.WriteLine(" CPU usage is " +sampleCounter.CpuUsage);
          Console.WriteLine("Time taken " + sampleCounter.TimeTaken);
            // return the whole list of Books 
          return bookList;
        }
        else 
          // out of range returns null
          return null;
        }

Points to Remember

Please see the schema of the xml file for such type  which only the application can be used
If you want to use this function in your application make certain changes in the GutenbergBookManager.cs class
Make sure that the path of the xml file which you have to parse must be given in the App.config for more details read the documents in zip file

 

 

 

 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Sandeep Akhare


Graduated in Electronics and Telecommunication.
Have been working in software company from last 30 months.
Technology/Languages intrested in
1. ASP.NET 2.0
2. AJAX 1.0
3. C# 3.0
4. JavaScript
Occupation: Software Developer (Senior)
Location: India India

Other popular ASP.NET articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 6 of 6 (Total in Forum: 6) (Refresh)FirstPrevNext
Generalnice articlemembermng21:51 3 Oct '08  
GeneralRe: nice articlememberSandeep Akhare0:41 4 Oct '08  
GeneralVTD-XMLmemberJimmy Zhang20:55 7 Aug '08  
GeneralRe: VTD-XMLmemberSandeep Akhare4:45 8 Aug '08  
Generalnice sandeepmemberpriyank19805:45 6 Aug '08  
GeneralRe: nice sandeepmemberSandeep Akhare20:48 6 Aug '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 8 Aug 2008
Editor:
Copyright 2006 by Sandeep Akhare
Everything else Copyright © CodeProject, 1999-2008
Web19 | Advertise on the Code Project