Click here to Skip to main content
11,428,876 members (62,141 online)
Click here to Skip to main content

Split large XML files into small files

, 9 Jan 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
Read any size XML docuement and split it into small supporting files.

Fig. 1

Introduction

Working with large XML files is not always an easy task. I am referring to files of size 25MB and more. An approach for processing such large XML files may be to split the XML document into smaller files for processing. It is a no brainier if you just want to split a file into multiple files, but what if you need this partial file to be accessible by an XML parser or DOM individually? You need to make sure that you have a complete node at the end of your smaller file, and you want to skip to the next node at the beginning of your next file.

Background

This is the continuation of my previous topic on how to deal with large XML documents: Large XML Files Processing and Indexing.

Using the code

Here is an idea of how you can do it. I tested the code with many different XML files and it works for the majority of XML files. You may get an error if your split size is too small. It also depends on your XML formatting.

Try to use the attached XML document as an example. Also, I have attached the results of this process: the files .part1, .part2, .part3.

Here is how we split the file and how it works:

  • Run the EXE as in Fig. 1.
  • Select a file to split (a large XML file).
  • Fig2.JPG

    Now split the file: get the XML file first. Call ImportXMLDoc(false);SplitFile();.

/// <summary>
/// Split file based on max size in MB
/// </summary>
private void SplitFile() {
    ImportXMLDoc(false);
    nodePathDic.Clear();
    if (string.IsNullOrEmpty(filePath)) {
        MessageBox.Show("Select XML File to split");
        return;
    }

    FileInfo fi = new FileInfo(filePath);
    double origFileSize = (double)fi.Length;
    numOfNewFiles = Math.Ceiling(origFileSize / maxFileSplitSize);
    string filePart = Application.StartupPath + "/" + fi.Name + 
                      ".part1" + fi.Extension;
    int fileCnt = 1;
    long writeFilePosition = 0;

    using (StreamReader sr = new StreamReader(filePath, Encoding.UTF8)) {
        int pos = 0;
        filePart = Application.StartupPath + "/" + fi.Name + 
                   ".part" + fileCnt + fi.Extension;
        //Read each line in XML document as regular file stream.
        StreamWriter sw = new StreamWriter(filePart, false);

        Regex rx = new Regex(@"<", RegexOptions.Compiled | 
                             RegexOptions.IgnoreCase);
        string nodeName = string.Empty;
        do {

            string line = sr.ReadLine();
            pos += Encoding.UTF8.GetByteCount(line) + 2;
            // 2 extra bites for end of line chars.

            MatchCollection m = rx.Matches(line);
            //Save index of this node into dictionary
            foreach (Match mt in m) {
                nodeName = line.Split(' ').Length == 0 ? 
                           line.Substring(1, line.LastIndexOf('>') - 1) : 
                           line.Split(new char[] { ' ' }, 
                           StringSplitOptions.RemoveEmptyEntries)[0];
                if (!nodeName.Contains("?xml") && 
                    !nodePathDic.ContainsKey(pos + mt.Index)) {
                    nodePathDic.Add(pos + mt.Index, nodeName);
                }
                break;
            }

            sw.WriteLine(line);
            sw.Flush();
            writeFilePosition = sw.BaseStream.Position;

            //If we at the limit of new file let's get 
            //a last node and write it to this file 
            //and create a new split file.
            if (pos > maxFileSplitSize * fileCnt) {
                int lastNodeStartPosition = 0;
                string lastNodeName = string.Empty;
                string ln = string.Empty;
                string completeLastNode = GetLastNode(filePath, 
                       out lastNodeStartPosition, out lastNodeName);

                //Some synchronization. TODO: needs to be optimized 
                //but it works "AS IS"
                do {
                    //Skip rest of the node....
                    ln = sr.ReadLine();
                    if (ln == null)
                        break;

                    pos += Encoding.UTF8.GetByteCount(ln) + 2;
                } while (!ln.Contains(lastNodeName));

                //Get position where we will begin to read again in our 
                //original XML file. We want to skip to the end of last 
                //complete node we wrote to the file.
                long swPosition = (writeFilePosition - 
                                  (nodePathDic.Keys[nodePathDic.Count - 1] - 
                                   lastNodeStartPosition)) + 2;
                sw.BaseStream.Position = swPosition >= 0 ? swPosition : 0;
                sw.Write("\n");
                sw.WriteLine("<!-- End of " + Application.StartupPath + "/" + 
                             fi.Name + ".part" + fileCnt + fi.Extension + ". " + 
                             fileCnt + " out of " + numOfNewFiles + " -->");

                sw.WriteLine(completeLastNode + "\n\n");
                sw.WriteLine(nodePathDic.Values[0].Replace("<", "</"));

                filePart = Application.StartupPath + "/" + fi.Name + 
                           ".part" + (++fileCnt) + fi.Extension;
                sw.Flush();
                sw.Close();

                sw = new StreamWriter(filePart, false);
                sw.WriteLine(nodePathDic.Values[0]);
                sw.WriteLine("<!-- Start of " + Application.StartupPath + "/" + 
                             fi.Name + ".part" + fileCnt + fi.Extension + ". " + 
                             fileCnt + " out of " + numOfNewFiles + " -->");
                sw.Flush();
            }
        } while (!sr.EndOfStream);

        //Clean up...
        sw.Flush();
        sw.Close();
        sr.Close();
        sw.Close();
    }
}

Fig3.JPG

At the end of the run, you should have the files included in a Zip file.

Fig4.JPG

Let’s take a look at the output of this process:

At the end of each file, note the “<!—End of…." comment line and the complete last node. I added this for a visual effect. I can use it later to join the documents together (that would be in my next article).

Fig5.JPG

The next file will start where the last file ended.

Fig6.JPG

Note

Root nodes are at the beginning and at the end of each document. The output XML file should be good to be used in an XML DOM or a tool like XMLSpy.

Enjoy. If you have any questions, post them here or send me an email.

History

  • Created on November 20, 2008.
  • Jan 09, 2011: I've changed the logic of how to end the file nodes and start new file nodes. This is a more robust version and based on .NET 4.0 and includes xsd.exe to generate the XML file schema.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Slava Khristich
Software Developer (Senior) Tateeda Media Networks
United States United States

Software development is my passion as well as photography.


If you got a sec stop by to see my photography work at http://sk68.com


Tateeda Media Network

Comments and Discussions

 
BugSpecial Char Read or Write Pin
Ageng Dwi Prastyawan22-Mar-14 10:32
memberAgeng Dwi Prastyawan22-Mar-14 10:32 
QuestionSplit 8GB xml file Pin
el0769428-Nov-13 2:47
memberel0769428-Nov-13 2:47 
AnswerRe: Split 8GB xml file Pin
Member 174238123-Jan-14 21:29
memberMember 174238123-Jan-14 21:29 
QuestionMessage Automatically Removed Pin
14-Nov-13 4:51
membergupta.anil14-Nov-13 4:51 
QuestionXML source that DOES NOT have newline characters Pin
Anthony Concialdi2-Dec-11 12:42
memberAnthony Concialdi2-Dec-11 12:42 
QuestionCode Pin
Lisa121130-Nov-11 10:31
memberLisa121130-Nov-11 10:31 
AnswerRe: Code Pin
Slava Khristich30-Nov-11 16:42
memberSlava Khristich30-Nov-11 16:42 
GeneralRe: Code Pin
Slava Khristich1-Dec-11 6:37
memberSlava Khristich1-Dec-11 6:37 
GeneralRe: Code Pin
Lisa121128-Dec-11 12:01
memberLisa121128-Dec-11 12:01 
Questionswdish signs converts to junk Pin
Member 842483222-Nov-11 11:04
memberMember 842483222-Nov-11 11:04 
AnswerRe: swdish signs converts to junk Pin
Slava Khristich28-Dec-11 13:50
memberSlava Khristich28-Dec-11 13:50 
QuestionUnable to find a version of the runtime to run this application Pin
thomasKreiller7-Sep-11 13:50
memberthomasKreiller7-Sep-11 13:50 
AnswerRe: Unable to find a version of the runtime to run this application Pin
Slava Khristich7-Sep-11 15:12
memberSlava Khristich7-Sep-11 15:12 
QuestionHow to split 3GB xml file? Pin
sophie12282-Sep-11 10:29
membersophie12282-Sep-11 10:29 
AnswerRe: How to split 3GB xml file? Pin
Slava Khristich7-Sep-11 15:09
memberSlava Khristich7-Sep-11 15:09 
GeneralMy vote of 1 Pin
gonzalovm120-Jul-11 12:20
membergonzalovm120-Jul-11 12:20 
AnswerRe: My vote of 1 Pin
Slava Khristich26-Jul-11 7:52
memberSlava Khristich26-Jul-11 7:52 
Generalxml splitter Pin
yosiasz25-Mar-11 9:24
memberyosiasz25-Mar-11 9:24 
GeneralRe: xml splitter Pin
Slava Khristich25-Mar-11 10:24
memberSlava Khristich25-Mar-11 10:24 
Generalunable to find nodePathDic declration and GetLstNode function Pin
gupta.anil9-Sep-10 20:08
membergupta.anil9-Sep-10 20:08 
GeneralExactly what I was looking for, but a small issue, OutOfMemoryException Pin
Member 33486929-Dec-09 14:11
memberMember 33486929-Dec-09 14:11 
Generalvtd-xml is ideally suited for splitting xml doc Pin
Jimmy Zhang21-Nov-09 14:03
memberJimmy Zhang21-Nov-09 14:03 
GeneralNested XML Pin
John Norrby11-Nov-09 1:15
memberJohn Norrby11-Nov-09 1:15 
QuestionHow to download the demo ? Pin
jl8313-Aug-09 20:50
memberjl8313-Aug-09 20:50 
GeneralI am getting error while split. My file size is 120MB. Pin
testsrini11-Apr-09 22:06
membertestsrini11-Apr-09 22:06 
See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
at System.ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument argument, ExceptionResource resource)
at System.Collections.Generic.SortedList`2.GetKey(Int32 index)
at System.Collections.Generic.SortedList`2.KeyList.get_Item(Int32 index)
at SplitXMLIntoFiles.Main.SplitFile()
at SplitXMLIntoFiles.Main.btnSplit_Click(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
GeneralLink is broken Pin
Member 14983487-Apr-09 11:16
memberMember 14983487-Apr-09 11:16 
GeneralSorce code for this demo [modified] Pin
Slava Khristich6-Apr-09 9:11
memberSlava Khristich6-Apr-09 9:11 
GeneralRe: Sorce code for this demo [modified] Pin
Lisa121130-Nov-11 13:27
memberLisa121130-Nov-11 13:27 
GeneralThanks Pin
gmanunta818110-Mar-09 6:03
membergmanunta818110-Mar-09 6:03 
Generalplease help... Pin
ashutoshctsk4-Mar-09 20:52
memberashutoshctsk4-Mar-09 20:52 
GeneralSource Code Pin
denisa4-Mar-09 5:16
memberdenisa4-Mar-09 5:16 
QuestionAm I missing Something? Pin
Bill Riehemann3-Mar-09 10:48
memberBill Riehemann3-Mar-09 10:48 
GeneralSome missing functions Pin
Rizwan Bashir24-Feb-09 3:56
memberRizwan Bashir24-Feb-09 3:56 
Generalhi Pin
hirunda18-Feb-09 17:05
memberhirunda18-Feb-09 17:05 
GeneralQuestion Pin
KLKurakula4-Dec-08 7:30
memberKLKurakula4-Dec-08 7:30 
GeneralRe: Question Pin
Slava Khristich10-Dec-08 11:32
memberSlava Khristich10-Dec-08 11:32 
GeneralRe: Question Pin
Slava Khristich10-Dec-08 11:46
memberSlava Khristich10-Dec-08 11:46 
GeneralInnovative Pin
ashu fouzdar25-Nov-08 1:16
memberashu fouzdar25-Nov-08 1:16 
GeneralInteresting Pin
Jose M. Menendez Poó20-Nov-08 17:12
memberJose M. Menendez Poó20-Nov-08 17:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.150428.2 | Last Updated 9 Jan 2012
Article Copyright 2008 by Slava Khristich
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid