Click here to Skip to main content
15,065,044 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever. Is there away to process over 5 Million lines at the same time?

Python
from cryptotools.BTC.HD import check, WORDS



with open("input.txt", "r") as a_file:
    for line in a_file:
        stripped_line = line.strip()
        for word in WORDS:
            mnemonic = stripped_line.format(x=word)
            if check(mnemonic):
               print(mnemonic)
               with open("print.txt", "a") as i:
                   i.write(mnemonic)
                   i.write("\n")


What I have tried:

I have tried the following but keep getting TypeError: 'int' object is not iterable.

Python
import concurrent.futures as cf
from cryptotools.BTC.HD import check, WORDS

N_THREADS = 20
result = []

def doWork(data):
    for line in data:
        tripped_line = line.strip()
        for word in WORDS:
            mnemonic = stripped_line.format(x=word)
            if check(mnemonic):                
               result.append(mnemonic)

m_input = open("input.txt", "r")
lines = [line for line in m_input]
#the data for the threads will be here
#as a list of rows for each thread
m_data= { i: [] for i in range(0, N_THREADS)} 
for l, n in zip(lines, len(lines)):
    m_data[n%N_THREADS].append(l)
'''
If you have to trim the number of threads uncomment these lines
m_data= { k:v for k, v in m_data.items() if len(v) != 0}
N_THREADS = N_THREADS if len(m_data) > N_THREADS else len(m_data)
if(N_THREADS == 0): 
    exit()
'''
with cf.ThreadPoolExecutor(max_workers=N_THREADS) as tp:
    for d in m_data.keys():
        tp.submit(doWork, data_t[d])
    
#work done
output = open("print.txt", "w")
for item in result:
    output.write(f"{item}\n")
output.close()
Posted
Updated 20-Aug-21 21:06pm
v2
Comments
Patrice T 20-Aug-21 14:28pm
   
Try to tell us the complete error message with position in source code.

Quote:
problem is my text file has between 1,5-2 billion lines and it is taking forever.
The answer to this depends on the machine that you are using as well. If your machine is slow (a dual-core, quad-core perhaps) then this would take time. Note that Python is trying to process your code in a sequence.

Then, the next bottleneck happens here:

for line in a_file:
    stripped_line = line.strip()
    for word in WORDS:

The complexity of your algorithm is O(N * M); where N is the number of items you have and M is the count of the WORDS from your library. I would suggest bringing this down to at least O(N + M) then trying to do an O(N) — you can introduce a hashtable or something similar to decrease the time that is being taken.

I really like what Solution 1 is suggesting to time your process times and then see what can be improved instead of trying out millions of records at once.

Quote:
Is there a way to process over 5 Million lines at the same time?
Of course, but they all require some "investment". The investment can either be in terms of money (you buy a bigger machine with more RAM and more CPU cores) or you invest time; which you are already doing. The first approach can be taken with different designs of computing, like MapReduce[^]. What these distributed systems do is they break the bigger problem down into subproblems, then solve these subproblems on different machines (or at least different cores).

But as you will find out, those path ways are more complex than just bringing down the algorithm to a suitable complexity range. :)
   
The basic answer is "no".
Although it is possible to use multithreading in Python[^] there isn't a processor on the planet that can handle 5,000,000 concurrent threads - which is what you need to handle "process over 5 Million lines at the same time".

Each thread needs a free core in order to run, and most processors (even GPU's) don't have 100th of that. In addition, each thread generally requires 1MB of stack space just to run at all, so your threads would require a minimum of 5TB of physical RAM installed (in addition to however much your processing may need to do it's work!

You may be able to speed processing up by multithreading (but keep the thread count down to at most the number of cores in your processor, preferably less), or by offloading the processing to a GPU[^] but you will never get the throughput you are asking for!
   
I'm not sure about doing it purely within your python script, but if you can split the text into a number of chunks equal to your available threads, then you can use batch scripts to call some additional script you might have for processing, which in turn can handle the re-combination of your full job.

for instance, if you create a small batch script with '.cmd' as the extension, and simply make it run a given python script:
<pre lang="BAT">setlocal enableextensions
start /min python.exe "%~dp1" "%~dp2" 2>&1


And then a '.bat' script to run it on each chunk concurrently:
BAT
@echo off
setlocal enableextensions
start /min call %~dp0runpython.cmd yourpythonscript.py chunk1.txt
start /min call %~dp0runpython.cmd yourpythonscript.py chunk2.txt


I have not tested this exact example, so expect possiblly needing to adjust quotation if you experience odd behavior. best bet would be to leave out the quotes and use paths that don't contain any spaces.
the main points are the "start /min call *.cmd" that calls the given cmd script and immediately moves to the next command without waiting for a return, and '2>&1' which will forward all command output to the void(output in the command prompt accounts for a *large* portion of many slowdowns when processing many items.)
Alternatively, you can do '>chunk1_finished.txt' or similar to output to a file, which is just as fast.
   
v2
Quote:
I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever.

Yes, that is the principle, processing "between 1,5-2 billion lines" takes long time.
Quote:
Is there a way to process over 5 Million lines at the same time?

Probably, but you need to tell us what is this text file, what is the processing, what you do with the result.
As far as I can see:
- Variable length text lines stored in a flat text file is size efficient, is not thread friendly. To read a given line, you need to know the exact length of all previous lines, and thus read them.
- Appending result in flat text file is not thread friendly either. You will have to manage a sharing scheme to prevent 2 thread from writing in file at same time and writing from threads will be out of order, in case it matters.

Techniques exist to avoid those problems, but we need to know all details on this data and usage.
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900