Help me speed up my Python code/algorithm

Question

0.00/5 (No votes)

See more:

I'm new both to the language, and performance-sensitive algorithms as well, so it's been really challenging so far, which is a good thing I guess.

Anyway, the task was the following:

I have x amount (24 in this case) of letters, and a list of words.
I'm supposed to find all the combinations of words("sentences") that use up the given letters with no letter remaining.

Update: A word can appear more than once in a sentence, for example "the the the the" is perfectly fine if the letters allow it.

For example:

input:

given letters:

IFSNIHLSTA

And the list of words:

nice
guys
finish
last

(The max amount of words we're looking for is irrelevant in this case.)

The output is:

finish last
last finish

I'm pretty sure there are some existing algorithms I could use, but I don't know any at the moment, so I decided to write one from scratch, and as you could guess, it ended up being so slow that it's unusable.

What I have tried:

The very first thing to do was obviously to cut down the amount of words first, so I filtered those out that aren't needed for sure.

Then the first main idea was to take a word, use the letters up for it, then take a next word, and use the remaining letters, and so on until all the letters are used up, or no more words can be created. If the letters are all used up, then hurray we got a sentence.

Repeating the above process for all the words, and again for each word(after a valid one) will in the end get us all the combinations, right?

The problem with it is the speed, finding sentences containing a single word will require iterating through the list once, however two word sentences will need an another iteration after each word in the worst case, and if we're looking for 6 word sentences, then oh boy... you get the idea.

So I looked into ways to make it efficient enough to be usable (if I had been successful, I wouldn't be writing this):

I figured that since I'm only looking for the combinations, the order doesn't matter, so I have far fewer words to start with (if I'm looking for sentences of 6 words, then I only have to look at the 1/6 of the words for the first iteration).

The implementation is really a spaghetti code, as a day ago I didn't even know what python looked like, sorry about it.

Anyway it follows the logic written above, the difference being that it splits the words up, and starts a process for each cpu core.

Since it is around 200 lines, I put it on pastebin instead of pasting it here:
Code on pastebin[^]

I have 2 kinds of questions:

Theoretical:
How to improve the algorithm, or what other algorithms to use, that could be more efficient? (and also what am I missing?)

Update: There still some duplicates in the output, and I'll need to figure a way out to remove them.

About performance itself:
Even though each process runs on an own cpu thread, and there's no disk i/o going on, the CPUs isn't 100% used, why could that be?

Also would I get further with CUDA or openCL, if yes, how?

Thanks in advance

Posted 28-Dec-17 10:06am

fecka97

Updated 29-Dec-17 2:14am

v3

Add a Solution

Comments

Patrice T 28-Dec-17 16:23pm

Show an example input/output.

fecka97 28-Dec-17 16:49pm

An example I just put together:

Input:

Given letters: "IFSNIHLSTA"

And the list of words:

nice
guys
finish
last

(The max amount of words we're looking for is irrelevant in this case.)

The output is:

"finish last"
"last finish"

These words together use the exact letters given.

Gonna update the question as well.

Patrice T 28-Dec-17 23:08pm

how large is the list of words?
Is it reused or new list every time?
If reused, can you store additional information with words?

fecka97 29-Dec-17 5:00am

I'm using a list of "all" english words every time, which is around ~500k words, but it gets filtered at the beginning depending on the letters, it's usually max 9-10k words, the fewer letters, the fewer words I'll have to work with. I iterate over this same filtered list for each word.

Right now it's just a list(or array if you like) of strings, but storing more data with the words shouldn't be a problem.

Also I forgot to mention that one word can occur more than once in a sentence, so "the the the the" could absolutely be a possibility if the letters allow it.

Also in the current version duplicates are a problem, for example "asd bsd csd" and "asd csd bsd" are considered duplicates. Increasing the amount of words looked through (... listSize/3, listSize/2, listSize/1) would solve it, but it won't work if a maximum amount of words for a sentence isn't known, and it won't allow the same word appearing more than once.

Patrice T 29-Dec-17 5:14am

Use Improve question to update your question.
So that everyone can pay attention to this information.

fecka97 29-Dec-17 5:25am

Thanks, I've done that.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Patrice T · Accepted Answer · 2017-12-29T02:14:00

General answer: when you think performance; optimization, speed up ...
The first thing to do is to think profiler.
26.4. The Python Profilers — Python 2.7.14 documentation[^]
python - How can you profile a script? - Stack Overflow[^]
The profiler is a tool that will show you the time spend in your code and where it is spent. It also tell you how many time a given part is executed.
The place that consume the most time is the first place to look at for optimizing.
And everywhere, you have to ask yourself:
- Why am I doing this?
- Is it really necessary?
- Can I do it another way which is less resource hungry?

You have a file of ~500k words, this file is more or less constant. This mean that any work that can be done early and saved in the file will be a huge saving.

Python

letters.find(char.upper())

For example if you ensure that the file of words and the letters are using the same letter case, you don't have to worry when you process the file.
This code scans letters once for every letter of every word, but at this point, letters is constant and it is easy to know if a given letter is part of letters or not.

Advice: do not thread when you optimize, if processing the file is too long, use a reduced one for testing purpose.