Click here to Skip to main content
6,595,854 members and growing! (18,948 online)
Email Password   helpLost your password?
Languages » XML » General     Beginner License: The Code Project Open License (CPOL)

Split large XML files into small files

By Slava Khristich

Read any size XML docuement and split it into small supporting files.
C#, XML, Windows, .NET, Visual Studio (VS.NET2003, VS2005, VS2008), Dev
Posted:20 Nov 2008
Views:14,338
Bookmarked:23 times
Announcements
Loading...
 
Search    
Advanced Search
Add to IE Search
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
8 votes for this article.
Popularity: 3.41 Rating: 3.78 out of 5

1
1 vote, 12.5%
2

3
4 votes, 50.0%
4
3 votes, 37.5%
5

Fig. 1

Introduction

Working with large XML files is not always an easy task. I am referring to files of size 25MB files and more. An approach for processing such large XML files may be to split the XML document into smaller files for processing. It is a no brainier if you just want to split a file into multiple files, but what if you need this partial file to be accessible by an XML parser or DOM individually. You need to make sure that you have a complete node at the end of your smaller file, and you want to skip to the next node at the beginning of your next file.

Background

This is the continuation on my previous topic on how to deal with large XML documents: Large XML Files Processing and Indexing.

Using the code

Here is an idea of how you can do it. I tested the code with many different XML files and it works for the majority of XML files. You may get an error if your split size is too small. It also depends on your XML formatting.

Try to use the attached XML document as an example. Also, I have attached the results of this process: the files .part1, .part2, .part3.

Here is how we split the file and how it works:

  • Run the EXE as in Fig. 1.
  • Select a file to split (a large XML file).
  • Fig2.JPG

    Now split the file: get the XML file first. Call ImportXMLDoc(false);SplitFile();.

/// <summary>
/// Split file based on max size in MB
/// </summary>
private void SplitFile() {
    ImportXMLDoc(false);
    nodePathDic.Clear();
    if (string.IsNullOrEmpty(filePath)) {
        MessageBox.Show("Select XML File to split");
        return;
    }

    FileInfo fi = new FileInfo(filePath);
    double origFileSize = (double)fi.Length;
    numOfNewFiles = Math.Ceiling(origFileSize / maxFileSplitSize);
    string filePart = Application.StartupPath + "/" + fi.Name + 
                      ".part1" + fi.Extension;
    int fileCnt = 1;
    long writeFilePosition = 0;

    using (StreamReader sr = new StreamReader(filePath, Encoding.UTF8)) {
        int pos = 0;
        filePart = Application.StartupPath + "/" + fi.Name + 
                   ".part" + fileCnt + fi.Extension;
        //Read each line in XML document as regular file stream.
        StreamWriter sw = new StreamWriter(filePart, false);

        Regex rx = new Regex(@"<", RegexOptions.Compiled | 
                             RegexOptions.IgnoreCase);
        string nodeName = string.Empty;
        do {

            string line = sr.ReadLine();
            pos += Encoding.UTF8.GetByteCount(line) + 2;
            // 2 extra bites for end of line chars.

            MatchCollection m = rx.Matches(line);
            //Save index of this node into dictionary
            foreach (Match mt in m) {
                nodeName = line.Split(' ').Length == 0 ? 
                           line.Substring(1, line.LastIndexOf('>') - 1) : 
                           line.Split(new char[] { ' ' }, 
                           StringSplitOptions.RemoveEmptyEntries)[0];
                if (!nodeName.Contains("?xml") && 
                    !nodePathDic.ContainsKey(pos + mt.Index)) {
                    nodePathDic.Add(pos + mt.Index, nodeName);
                }
                break;
            }

            sw.WriteLine(line);
            sw.Flush();
            writeFilePosition = sw.BaseStream.Position;

            //If we at the limit of new file let's get 
            //a last node and write it to this file 
            //and create a new split file.
            if (pos > maxFileSplitSize * fileCnt) {
                int lastNodeStartPosition = 0;
                string lastNodeName = string.Empty;
                string ln = string.Empty;
                string completeLastNode = GetLastNode(filePath, 
                       out lastNodeStartPosition, out lastNodeName);

                //Some synchronization. TODO: needs to be optimized 
                //but it works "AS IS"
                do {
                    //Skip rest of the node....
                    ln = sr.ReadLine();
                    if (ln == null)
                        break;

                    pos += Encoding.UTF8.GetByteCount(ln) + 2;
                } while (!ln.Contains(lastNodeName));

                //Get position where we will begin to read again in our 
                //original XML file. We want to skip to the end of last 
                //complete node we wrote to the file.
                long swPosition = (writeFilePosition - 
                                  (nodePathDic.Keys[nodePathDic.Count - 1] - 
                                   lastNodeStartPosition)) + 2;
                sw.BaseStream.Position = swPosition >= 0 ? swPosition : 0;
                sw.Write("\n");
                sw.WriteLine("<!-- End of " + Application.StartupPath + "/" + 
                             fi.Name + ".part" + fileCnt + fi.Extension + ". " + 
                             fileCnt + " out of " + numOfNewFiles + " -->");

                sw.WriteLine(completeLastNode + "\n\n");
                sw.WriteLine(nodePathDic.Values[0].Replace("<", "</"));

                filePart = Application.StartupPath + "/" + fi.Name + 
                           ".part" + (++fileCnt) + fi.Extension;
                sw.Flush();
                sw.Close();

                sw = new StreamWriter(filePart, false);
                sw.WriteLine(nodePathDic.Values[0]);
                sw.WriteLine("<!-- Start of " + Application.StartupPath + "/" + 
                             fi.Name + ".part" + fileCnt + fi.Extension + ". " + 
                             fileCnt + " out of " + numOfNewFiles + " -->");
                sw.Flush();
            }
        } while (!sr.EndOfStream);

        //Clean up...
        sw.Flush();
        sw.Close();
        sr.Close();
        sw.Close();
    }
}

Fig3.JPG

At the end of the run, you should have the files included in a Zip file.

Fig4.JPG

Let’s take a look at the output of this process:

At the end of each file, note the “<!—End of…." comment line and the complete last node. I added this for a visual effect. I can use it later to join the documents together (that would be in my next article).

Fig5.JPG

The next file will start where the last file ended.

Fig6.JPG

Note

Root nodes are at the beginning and at the end of each document. The output XML file should be good to be used in an XML DOM or a tool like XMLSpy.

Enjoy. If you have any questions, post them here or send me an email.

History

  • Created on November 20th, 2008.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Slava Khristich


Member

Software development is my passion as well as photography.


If you got a sec stop by to see my photography work at http://sk68.com


Tateeda Media Network
Occupation: Software Developer (Senior)
Company: Tateeda Media Networks
Location: United States United States

Other popular XML articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 15 of 15 (Total in Forum: 15) (Refresh)FirstPrevNext
QuestionHow to download the demo ? Pinmemberjl8320:50 13 Aug '09  
GeneralI am getting error while split. My file size is 120MB. Pinmembertestsrini22:06 11 Apr '09  
GeneralLink is broken PinmemberMember 149834811:16 7 Apr '09  
GeneralSorce code for this demo [modified] PinmemberSlava Khristich9:11 6 Apr '09  
GeneralThanks Pinmembergmanunta81816:03 10 Mar '09  
Generalplease help... Pinmemberashutoshctsk20:52 4 Mar '09  
GeneralSource Code Pinmemberdenisa5:16 4 Mar '09  
GeneralAm I missing Something? PinmemberBill Riehemann10:48 3 Mar '09  
GeneralSome missing functions PinmemberRizwan Bashir3:56 24 Feb '09  
Generalhi Pinmemberhirunda17:05 18 Feb '09  
GeneralQuestion PinmemberKLKurakula7:30 4 Dec '08  
GeneralRe: Question PinmemberSlava Khristich11:32 10 Dec '08  
GeneralRe: Question PinmemberSlava Khristich11:46 10 Dec '08  
GeneralInnovative Pinmemberashu fouzdar1:16 25 Nov '08  
GeneralInteresting PinmemberJose M. Menendez Poó17:12 20 Nov '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 20 Nov 2008
Editor: Smitha Vijayan
Copyright 2008 by Slava Khristich
Everything else Copyright © CodeProject, 1999-2009
Web22 | Advertise on the Code Project