The answer to this depends on the machine that you are using as well. If your machine is slow (a dual-core, quad-core perhaps) then this would take time. Note that Python is trying to process your code in a sequence.Quote:problem is my text file has between 1,5-2 billion lines and it is taking forever.

Then, the next bottleneck happens here:

for line in a_file: stripped_line = line.strip() for word in WORDS:

The complexity of your algorithm is O(N * M); where N is the number of items you have and M is the count of the

`WORDS`

from your library. I would suggest bringing this down to at least O(N + M) then trying to do an O(N) — you can introduce a hashtable or something similar to decrease the time that is being taken. I really like what Solution 1 is suggesting to time your process times and then see what can be improved instead of trying out millions of records at once.

Of course, but they all require some "investment". The investment can either be in terms of money (Quote:Is there a way to process over 5 Million lines at the same time?

*you buy a bigger machine with more RAM and more CPU cores*) or you invest time;

*which you are already doing*. The first approach can be taken with different designs of computing, like MapReduce[^]. What these distributed systems do is they break the bigger problem down into subproblems, then solve these subproblems on different machines (or at least different cores).

But as you will find out, those path ways are more complex than just bringing down the algorithm to a suitable complexity range. :)