How to process 5 million + text lines at once

Question

1.00/5 (1 vote)

See more:

I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever. Is there away to process over 5 Million lines at the same time?

Python

from cryptotools.BTC.HD import check, WORDS



with open("input.txt", "r") as a_file:
    for line in a_file:
        stripped_line = line.strip()
        for word in WORDS:
            mnemonic = stripped_line.format(x=word)
            if check(mnemonic):
               print(mnemonic)
               with open("print.txt", "a") as i:
                   i.write(mnemonic)
                   i.write("\n")

What I have tried:

I have tried the following but keep getting TypeError: 'int' object is not iterable.

Python

import concurrent.futures as cf
from cryptotools.BTC.HD import check, WORDS

N_THREADS = 20
result = []

def doWork(data):
    for line in data:
        tripped_line = line.strip()
        for word in WORDS:
            mnemonic = stripped_line.format(x=word)
            if check(mnemonic):                
               result.append(mnemonic)

m_input = open("input.txt", "r")
lines = [line for line in m_input]
#the data for the threads will be here
#as a list of rows for each thread
m_data= { i: [] for i in range(0, N_THREADS)} 
for l, n in zip(lines, len(lines)):
    m_data[n%N_THREADS].append(l)
'''
If you have to trim the number of threads uncomment these lines
m_data= { k:v for k, v in m_data.items() if len(v) != 0}
N_THREADS = N_THREADS if len(m_data) > N_THREADS else len(m_data)
if(N_THREADS == 0): 
    exit()
'''
with cf.ThreadPoolExecutor(max_workers=N_THREADS) as tp:
    for d in m_data.keys():
        tp.submit(doWork, data_t[d])
    
#work done
output = open("print.txt", "w")
for item in result:
    output.write(f"{item}\n")
output.close()

Posted 20-Aug-21 8:00am

X_strong

Updated 20-Aug-21 21:06pm

v2

Add a Solution

Comments

Patrice T 20-Aug-21 14:28pm

Try to tell us the complete error message with position in source code.

5 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Afzaal Ahmad Zeeshan · Answer 1 · 2021-08-20T09:30:00

Quote:
problem is my text file has between 1,5-2 billion lines and it is taking forever.

The answer to this depends on the machine that you are using as well. If your machine is slow (a dual-core, quad-core perhaps) then this would take time. Note that Python is trying to process your code in a sequence.

Then, the next bottleneck happens here:

for line in a_file:
    stripped_line = line.strip()
    for word in WORDS:

The complexity of your algorithm is O(N * M); where N is the number of items you have and M is the count of the WORDS from your library. I would suggest bringing this down to at least O(N + M) then trying to do an O(N) — you can introduce a hashtable or something similar to decrease the time that is being taken.

I really like what Solution 1 is suggesting to time your process times and then see what can be improved instead of trying out millions of records at once.

Quote:
Is there a way to process over 5 Million lines at the same time?

Of course, but they all require some "investment". The investment can either be in terms of money (you buy a bigger machine with more RAM and more CPU cores) or you invest time; which you are already doing. The first approach can be taken with different designs of computing, like MapReduce[^]. What these distributed systems do is they break the bigger problem down into subproblems, then solve these subproblems on different machines (or at least different cores).

But as you will find out, those path ways are more complex than just bringing down the algorithm to a suitable complexity range. :)

OriginalGriff · Answer 2 · 2021-08-20T09:24:00

The basic answer is "no".
Although it is possible to use multithreading in Python[^] there isn't a processor on the planet that can handle 5,000,000 concurrent threads - which is what you need to handle "process over 5 Million lines at the same time".

Each thread needs a free core in order to run, and most processors (even GPU's) don't have 100th of that. In addition, each thread generally requires 1MB of stack space just to run at all, so your threads would require a minimum of 5TB of physical RAM installed (in addition to however much your processing may need to do it's work!

You may be able to speed processing up by multithreading (but keep the thread count down to at most the number of cores in your processor, preferably less), or by offloading the processing to a GPU[^] but you will never get the throughput you are asking for!

biorpg · Answer 3 · 2021-08-20T08:40:00

I'm not sure about doing it purely within your python script, but if you can split the text into a number of chunks equal to your available threads, then you can use batch scripts to call some additional script you might have for processing, which in turn can handle the re-combination of your full job.

for instance, if you create a small batch script with '.cmd' as the extension, and simply make it run a given python script:

<pre lang="BAT">setlocal enableextensions
start /min python.exe "%~dp1" "%~dp2" 2>&1

And then a '.bat' script to run it on each chunk concurrently:

BAT

@echo off
setlocal enableextensions
start /min call %~dp0runpython.cmd yourpythonscript.py chunk1.txt
start /min call %~dp0runpython.cmd yourpythonscript.py chunk2.txt

I have not tested this exact example, so expect possiblly needing to adjust quotation if you experience odd behavior. best bet would be to leave out the quotes and use paths that don't contain any spaces.
the main points are the "start /min call *.cmd" that calls the given cmd script and immediately moves to the next command without waiting for a return, and '2>&1' which will forward all command output to the void(output in the command prompt accounts for a *large* portion of many slowdowns when processing many items.)
Alternatively, you can do '>chunk1_finished.txt' or similar to output to a file, which is just as fast.

Patrice T · Answer 4 · 2021-08-20T20:58:00

Quote:
I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever.

Yes, that is the principle, processing "between 1,5-2 billion lines" takes long time.

Quote:
Is there a way to process over 5 Million lines at the same time?

Probably, but you need to tell us what is this text file, what is the processing, what you do with the result.
As far as I can see:
- Variable length text lines stored in a flat text file is size efficient, is not thread friendly. To read a given line, you need to know the exact length of all previous lines, and thus read them.
- Appending result in flat text file is not thread friendly either. You will have to manage a sharing scheme to prevent 2 thread from writing in file at same time and writing from threads will be out of order, in case it matters.

Techniques exist to avoid those problems, but we need to know all details on this data and usage.

RickZeeland · Answer 5 · 2021-08-20T21:06:00

Solution 5

Take a look at: Parallel Processing in Python - A Practical Guide with Examples | ML+[^]

Posted 20-Aug-21 21:06pm

RickZeeland

How to process 5 million + text lines at once

5 solutions

Solution 3

Solution 2

Solution 1

Solution 4

Solution 5

Add your solution here

Preview 0