Python imagehash doesn't play well with threadpoolexecutor

Question

0.00/5 (No votes)

See more:

I want to compute multiple hashes for image. to speed it up I want to compute all hashes in parallel. I have method called hashImage which returns dictionary. Expected result is that all dictionary keys have values that are other than None. Actual result is that random keys have None as their value. Sometimes ret["dhash10"] is None while other times ret["averageHash10"] is None and third time something else might be None.

Python

import time
from PIL import Image
import imagehash
import hashlib
import concurrent.futures
import numpy
import os

class ImageHasher:
    def __init__(self):
        print("__init__")
        pass

    def computeHash(self, image, hashFunction, hashSize):
        try:
            return hashFunction(image, hashSize)
        except:
            return None
        
    def compute_md5_hash(self, filename):
        with open(filename, "rb") as file:
            content = file.read()
            hash_object = hashlib.md5(content)
            return hash_object.hexdigest()

    def hashImage(self, imagePath):
        ret = {}
        try:
            with concurrent.futures.ThreadPoolExecutor() as executor:
                with Image.open(imagePath) as image:
                    # Submit the tasks to the executor
                    future_dhash10 = executor.submit(self.computeHash, image, imagehash.dhash, 10)
                    future_averageHash10 = executor.submit(self.computeHash, image, imagehash.average_hash, 10)
                    future_phash10 = executor.submit(self.computeHash, image, imagehash.phash, 10)
                    future_whash16 = executor.submit(self.computeHash, image, imagehash.whash, 16)
                    future_phashSimple8 = executor.submit(self.computeHash, image, imagehash.phash_simple, 8)
                    future_md5 = executor.submit(self.compute_md5_hash, imagePath)

                    # Wait for all tasks to complete
                    concurrent.futures.wait(
                        [future_dhash10, future_averageHash10, future_phash10, future_whash16, future_phashSimple8, future_md5]
                    )

                    # Store the results in the dictionary
                    ret["dhash10"] = future_dhash10.result()
                    ret["averageHash10"] = future_averageHash10.result()
                    ret["phash10"] = future_phash10.result()
                    ret["whash16"] = future_whash16.result()
                    ret["phashSimple8"] = future_phashSimple8.result()
                    ret["md5"] = future_md5.result()
        except Exception as e:
            print(f"Uh oh!! {e}")
        return ret
    
    def processImage(self):
        result = self.hashImage('result_image.jpg')
        count = 0
        # repeat until something fails.
        while result["dhash10"] != None and result["averageHash10"] != None and result["phash10"] != None and result["whash16"] != None  and result["phashSimple8"] != None and result["md5"] != None:
            result = self.hashImage('result_image.jpg')
            count+=1
        print("count: " + str(count))
        # print keys with missing value
        self.print_keys_with_none_values(result)
        return result
    
    def print_keys_with_none_values(self, dictionary):
        print("The following key(s) are None:")
        for key, value in dictionary.items():
            if value is None:
                print(key)

if __name__ == '__main__':
    # generate test data
    if not os.path.exists('result_image.jpg'):
        imarray = numpy.random.rand(3000, 4000,3) * 255
        im = Image.fromarray(imarray.astype('uint8')).convert('RGB')
        im.save('result_image.jpg',quality=95)
    ih = ImageHasher()
    start_time = time.time()
    result = ih.processImage()
    end_time = time.time()
    print("Time taken:", end_time - start_time)
    #print(result)

What I have tried:

I tried asking ChatGPT 3.5 but it was running circles with suggestions that didn't help. One of which was to use multiprocessing but turned out it did not work. It got UnpickleableError. It also suggested me to have hashImage without threading and instead call hashImage in threads. That didn't work because I later insert those hashes into sqlite database (code removed from this post to simplify code) and it does not support threads so the only way is make hashImage threaded.

Posted 29-Jul-23 12:29pm

rain-13

Add a Solution

Comments

Richard MacCutchan 30-Jul-23 3:57am

You haved two except clauses, both of which throw away the details of the problem. So change them both to capture and display the reason for the exception and you should be on the way to finding out what is going wrong.

rain-13 30-Jul-23 5:33am

Thanks. That helped. How did I not think of it.
Seems like Either PIL or imagehash are not thread safe.
__init__
Traceback (most recent call last):
File "c:\code\threading_test.py", line 17, in computeHash
return hashFunction(image, hashSize)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\imagehash\__init__.py", line 248, in average_hash
image = image.convert('L').resize((hash_size, hash_size), ANTIALIAS)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 933, in convert
self.load()
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\ImageFile.py", line 266, in load
raise OSError(msg)
OSError: image file is truncated (167 bytes not processed)

to solve this I had to use image_copies = [image.copy() for _ in range(4)] and pass these instead.

So this is the changed part. The rest of code is not changed:

                    image_copies = [image.copy() for _ in range(4)]
                    future_dhash10 = executor.submit(self.computeHash, image, imagehash.dhash, 10)
                    future_averageHash10 = executor.submit(self.computeHash, image_copies[0], imagehash.average_hash, 10)
                    future_phash10 = executor.submit(self.computeHash, image_copies[1], imagehash.phash, 10)
                    future_whash16 = executor.submit(self.computeHash, image_copies[2], imagehash.whash, 16)
                    future_phashSimple8 = executor.submit(self.computeHash, image_copies[3], imagehash.phash_simple, 8)
                    future_md5 = executor.submit(self.compute_md5_hash, imagePath)

Now the next question is why it's not using 100% cpu? It's using like 12 to 16 % instead. I am not sure should I continue this discussion here and post solution once all is done or should I post this solution under "Add your solution here" and start new topic to get 100% cpu utilization?

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)