Click here to Skip to main content
15,888,351 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm new both to the language, and performance-sensitive algorithms as well, so it's been really challenging so far, which is a good thing I guess.

Anyway, the task was the following:

I have x amount (24 in this case) of letters, and a list of words.
I'm supposed to find all the combinations of words("sentences") that use up the given letters with no letter remaining.

Update: A word can appear more than once in a sentence, for example "the the the the" is perfectly fine if the letters allow it.

For example:

input:

given letters:
IFSNIHLSTA

And the list of words:
nice
guys
finish
last

(The max amount of words we're looking for is irrelevant in this case.)

The output is:
finish last
last finish


I'm pretty sure there are some existing algorithms I could use, but I don't know any at the moment, so I decided to write one from scratch, and as you could guess, it ended up being so slow that it's unusable.

What I have tried:

The very first thing to do was obviously to cut down the amount of words first, so I filtered those out that aren't needed for sure.

Then the first main idea was to take a word, use the letters up for it, then take a next word, and use the remaining letters, and so on until all the letters are used up, or no more words can be created. If the letters are all used up, then hurray we got a sentence.

Repeating the above process for all the words, and again for each word(after a valid one) will in the end get us all the combinations, right?

The problem with it is the speed, finding sentences containing a single word will require iterating through the list once, however two word sentences will need an another iteration after each word in the worst case, and if we're looking for 6 word sentences, then oh boy... you get the idea.

So I looked into ways to make it efficient enough to be usable (if I had been successful, I wouldn't be writing this):

I figured that since I'm only looking for the combinations, the order doesn't matter, so I have far fewer words to start with (if I'm looking for sentences of 6 words, then I only have to look at the 1/6 of the words for the first iteration).

The implementation is really a spaghetti code, as a day ago I didn't even know what python looked like, sorry about it.

Anyway it follows the logic written above, the difference being that it splits the words up, and starts a process for each cpu core.

Since it is around 200 lines, I put it on pastebin instead of pasting it here:
Code on pastebin[^]

I have 2 kinds of questions:

Theoretical:
How to improve the algorithm, or what other algorithms to use, that could be more efficient? (and also what am I missing?)

Update: There still some duplicates in the output, and I'll need to figure a way out to remove them.

About performance itself:
Even though each process runs on an own cpu thread, and there's no disk i/o going on, the CPUs isn't 100% used, why could that be?

Also would I get further with CUDA or openCL, if yes, how?

Thanks in advance
Posted
Updated 29-Dec-17 2:14am
v3
Comments
Patrice T 28-Dec-17 16:23pm    
Show an example input/output.
fecka97 28-Dec-17 16:49pm    
An example I just put together:

Input:

Given letters: "IFSNIHLSTA"

And the list of words:

nice
guys
finish
last

(The max amount of words we're looking for is irrelevant in this case.)

The output is:

"finish last"
"last finish"

These words together use the exact letters given.

Gonna update the question as well.
Patrice T 28-Dec-17 23:08pm    
how large is the list of words?
Is it reused or new list every time?
If reused, can you store additional information with words?
fecka97 29-Dec-17 5:00am    
I'm using a list of "all" english words every time, which is around ~500k words, but it gets filtered at the beginning depending on the letters, it's usually max 9-10k words, the fewer letters, the fewer words I'll have to work with. I iterate over this same filtered list for each word.

Right now it's just a list(or array if you like) of strings, but storing more data with the words shouldn't be a problem.

Also I forgot to mention that one word can occur more than once in a sentence, so "the the the the" could absolutely be a possibility if the letters allow it.

Also in the current version duplicates are a problem, for example "asd bsd csd" and "asd csd bsd" are considered duplicates. Increasing the amount of words looked through (... listSize/3, listSize/2, listSize/1) would solve it, but it won't work if a maximum amount of words for a sentence isn't known, and it won't allow the same word appearing more than once.
Patrice T 29-Dec-17 5:14am    
Use Improve question to update your question.
So that everyone can pay attention to this information.

1 solution

General answer: when you think performance; optimization, speed up ...
The first thing to do is to think profiler.
26.4. The Python Profilers — Python 2.7.14 documentation[^]
python - How can you profile a script? - Stack Overflow[^]
The profiler is a tool that will show you the time spend in your code and where it is spent. It also tell you how many time a given part is executed.
The place that consume the most time is the first place to look at for optimizing.
And everywhere, you have to ask yourself:
- Why am I doing this?
- Is it really necessary?
- Can I do it another way which is less resource hungry?

You have a file of ~500k words, this file is more or less constant. This mean that any work that can be done early and saved in the file will be a huge saving.
Python
letters.find(char.upper())

For example if you ensure that the file of words and the letters are using the same letter case, you don't have to worry when you process the file.
This code scans letters once for every letter of every word, but at this point, letters is constant and it is easy to know if a given letter is part of letters or not.

Advice: do not thread when you optimize, if processing the file is too long, use a reduced one for testing purpose.
 
Share this answer
 
Comments
fecka97 29-Dec-17 9:38am    
Thank you! That profiler, and your advice already helped me a lot.

I guess I will go through the code and simplify it as much as I can. Now that I know that it actually works, I don't need the logger, and the prints everywhere.

Also could you give hints for other approaches if you have any in mind? At the moment I can only refine this one, but I'm afraid that in the end it will still require too many iterations to be usable (that's why I was thinking openCL, and more threads).

I'd mark your answer as a solution, as it helped me a ton, however I don't know if I should, as this is more of an open discussion, and I'm interested in others' opinions as well if they have something to add.
Patrice T 29-Dec-17 12:47pm    
There is no general advice beyond what I told you.
Techniques to apply depend heavily on your actual code and on what says the profiler.
When you ask for help, size of data matters to help to understand what does the program.
You can close the question by accepting my solution.
fecka97 30-Dec-17 5:49am    
Alright, I took a smaller dataset that took approximately 10seconds to process, after a few optimizations I managed to get it to 8seconds, and with using Pypy it's actually just 0.9 seconds.

Still not good enough for larger data, but I'm slowly getting there, and there are a lot of stuff left to reimplement and optimize.

Anyway, thanks for your help again, I realized that the rest is just solving the challenge, which is up to me, so I'll close the question.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900