Click here to Skip to main content
15,749,203 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more: , +
I am looking for the best way to parse through over a million XML files ranging in size from 2KB-10MB. In total, files add up to somewhere in the neighborhood of 500GB. The application collects data from various nodes throughout the file and shoves them into a Postgres Database schema. I wrote python code using etree that did this a long time ago when the number of XML files was much smaller. But now it takes close to a week to process. Any idea on the best strategy to scale this? If I could get process these in a day or two it would be a huge improvement.

What I have tried:

Python
class ParseJob(threading.Thread):
    path_queue = None
    stopper = None    
    def __init__(self, path_queue,stopper):
        super(self.__class__,self).__init__()
        self.path_queue = path_queue
        self.stopper = stopper
    def run(self):
        while not self.stopper.is_set():
            try:
                path = self.path_queue.get_nowait()                       
                new = open(path, 'rb')
                xmldoc = minidom.parse(new)
                parseFunc(xmldoc)
                new.close()
                self.path_queue.task_done()    
            except Queue.Empty:
                break

def parseFunc(xmldoc):
    ##does all the parsing       
            
def main():
    path_queue = Queue.Queue()
    dir = ##path to xml files
    for path in dir:
        tile_queue.put(path)
    stopper = threading.Event()  
    num_workers = 8
    threads = list()
    for i in range(num_workers):
        job = ParseJob(path_queue,stopper)
        threads.append(job)
        job.start()
    path_queue.join()

main()
Posted
Updated 9-Feb-16 10:57am
v3
Comments
Patrice T 9-Feb-16 16:58pm    
With Python, tabs are part of the meaning of the code.
Never remove tabs from Python code.
Member 12316683 9-Feb-16 18:49pm    
I understand that, and apologize. I am still trying to get used to copying and pasting from my text editor to this site as it appeared the tabs were automatically removed. This is actually my first visit to codeproject.com.
Patrice T 9-Feb-16 18:58pm    
The way you copy/pasted your code in the solution was right. I got it from there.

I would probably have a component that manages a queue, loading it with the file paths to process. This component would hand out a file path to process to a worker thread that does all the work for a single file. You can have as many of these worker threads running as there are cores available in the CPU. When a worker is done with a file it can go back to the queue and get another file.

As the files are done being processed, it'd probably be a good idea to delete them or move them to some archive storage if the data is important enough. The thing you have to keep in mind when designing this is what happens if the system crashes or something during processing. How is the algorithm going to recover from an interrupted job?

It'll still take a long time to get through all the files, but you'll definitely cut down the job time to closer to a day depending on the number of cores you can keep busy.
 
Share this answer
 
Comments
Member 12316683 9-Feb-16 16:54pm    
Thanks Dave. I updated my question with an attempt at multi-threading using a queue. I haven't yet implemented any crash handling, but this runs. After monitoring for a while I definitely think there has been a performance increase but I will wait to see if it runs through before making conclusions.
If, cutting short the time was so simple, why do you think we have a subject or course titled as, "Data Structures and Algorithms[^]"? :-) The data is expected to grow, but you as the database administrator or system administrator are required to make sure that the logic runs in the same way it was expected to, not to continue to keep running for weeks.

There are many ways to cut short that time. I would give you all of those points in a list. But, I hope you will try to follow them because there is no "other way"! You are expected to follow these rules to improve the time required.

1. Change the language! Python is not at all faster. Did you know that Python is interpreted language? Which makes it much slower.
- Use something like C++. It has same paradigm and I am sure it would have a prebuilt library already available on CodeProject, GitHub or anywhere else.
2. Change how you arrange the data.
- Data's presentation is very much important. Small files and large files also cause a problem. After each file, program has to clean the RAM and input the next file. Find an alternate to having small chunks of files.
3. Increase the CPU speed. You don't want to do the job that a supercomputer does, using a personal computer. That doesn't make any sense.
4. Think again!

The most common thing to use is the common sense. Your data spans over 500GB. Why? Also, when you want to query the data, why do you want to query all of it? These are a few things that you should consider and think while updating the data structures, while updating your algorithms and why updating the system hardware.

Otherwise, this time can be cut short with a maximum output of 1 day less (which is still 6 days!) and nothing more.
 
Share this answer
 
When it comes to speed optimization, the tool of choice is the profiler.
the profiler is there to tell you how much time you spend in each part of your code and to detect bottlenecks.
From your piece of code, one can guess most time is spend in parseFunc, in the code you didn't show.
Even if the choice of language have effect in efficiency of your code, the way you designed your code is even more dramatic effects.

In order to see if we can improve the runtime, we need to know exactly what you do in each file and what are the requests you are answering.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900