Click here to Skip to main content
15,846,144 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I want to compute multiple hashes for image. to speed it up I want to compute all hashes in parallel. I have method called hashImage which returns dictionary. Expected result is that all dictionary keys have values that are other than None. Actual result is that random keys have None as their value. Sometimes ret["dhash10"] is None while other times ret["averageHash10"] is None and third time something else might be None.

Python
import time
from PIL import Image
import imagehash
import hashlib
import concurrent.futures
import numpy
import os

class ImageHasher:
    def __init__(self):
        print("__init__")
        pass

    def computeHash(self, image, hashFunction, hashSize):
        try:
            return hashFunction(image, hashSize)
        except:
            return None
        
    def compute_md5_hash(self, filename):
        with open(filename, "rb") as file:
            content = file.read()
            hash_object = hashlib.md5(content)
            return hash_object.hexdigest()

    def hashImage(self, imagePath):
        ret = {}
        try:
            with concurrent.futures.ThreadPoolExecutor() as executor:
                with Image.open(imagePath) as image:
                    # Submit the tasks to the executor
                    future_dhash10 = executor.submit(self.computeHash, image, imagehash.dhash, 10)
                    future_averageHash10 = executor.submit(self.computeHash, image, imagehash.average_hash, 10)
                    future_phash10 = executor.submit(self.computeHash, image, imagehash.phash, 10)
                    future_whash16 = executor.submit(self.computeHash, image, imagehash.whash, 16)
                    future_phashSimple8 = executor.submit(self.computeHash, image, imagehash.phash_simple, 8)
                    future_md5 = executor.submit(self.compute_md5_hash, imagePath)

                    # Wait for all tasks to complete
                    concurrent.futures.wait(
                        [future_dhash10, future_averageHash10, future_phash10, future_whash16, future_phashSimple8, future_md5]
                    )

                    # Store the results in the dictionary
                    ret["dhash10"] = future_dhash10.result()
                    ret["averageHash10"] = future_averageHash10.result()
                    ret["phash10"] = future_phash10.result()
                    ret["whash16"] = future_whash16.result()
                    ret["phashSimple8"] = future_phashSimple8.result()
                    ret["md5"] = future_md5.result()
        except Exception as e:
            print(f"Uh oh!! {e}")
        return ret
    
    def processImage(self):
        result = self.hashImage('result_image.jpg')
        count = 0
        # repeat until something fails.
        while result["dhash10"] != None and result["averageHash10"] != None and result["phash10"] != None and result["whash16"] != None  and result["phashSimple8"] != None and result["md5"] != None:
            result = self.hashImage('result_image.jpg')
            count+=1
        print("count: " + str(count))
        # print keys with missing value
        self.print_keys_with_none_values(result)
        return result
    
    def print_keys_with_none_values(self, dictionary):
        print("The following key(s) are None:")
        for key, value in dictionary.items():
            if value is None:
                print(key)

if __name__ == '__main__':
    # generate test data
    if not os.path.exists('result_image.jpg'):
        imarray = numpy.random.rand(3000, 4000,3) * 255
        im = Image.fromarray(imarray.astype('uint8')).convert('RGB')
        im.save('result_image.jpg',quality=95)
    ih = ImageHasher()
    start_time = time.time()
    result = ih.processImage()
    end_time = time.time()
    print("Time taken:", end_time - start_time)
    #print(result)


What I have tried:

I tried asking ChatGPT 3.5 but it was running circles with suggestions that didn't help. One of which was to use multiprocessing but turned out it did not work. It got UnpickleableError. It also suggested me to have hashImage without threading and instead call hashImage in threads. That didn't work because I later insert those hashes into sqlite database (code removed from this post to simplify code) and it does not support threads so the only way is make hashImage threaded.
Posted
Comments
Richard MacCutchan 30-Jul-23 3:57am    
You haved two except clauses, both of which throw away the details of the problem. So change them both to capture and display the reason for the exception and you should be on the way to finding out what is going wrong.
rain-13 30-Jul-23 5:33am    
Thanks. That helped. How did I not think of it.
Seems like Either PIL or imagehash are not thread safe.
__init__
Traceback (most recent call last):
File "c:\code\threading_test.py", line 17, in computeHash
return hashFunction(image, hashSize)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\imagehash\__init__.py", line 248, in average_hash
image = image.convert('L').resize((hash_size, hash_size), ANTIALIAS)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 933, in convert
self.load()
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\ImageFile.py", line 266, in load
raise OSError(msg)
OSError: image file is truncated (167 bytes not processed)

to solve this I had to use image_copies = [image.copy() for _ in range(4)] and pass these instead.

So this is the changed part. The rest of code is not changed:
                    image_copies = [image.copy() for _ in range(4)]
                    future_dhash10 = executor.submit(self.computeHash, image, imagehash.dhash, 10)
                    future_averageHash10 = executor.submit(self.computeHash, image_copies[0], imagehash.average_hash, 10)
                    future_phash10 = executor.submit(self.computeHash, image_copies[1], imagehash.phash, 10)
                    future_whash16 = executor.submit(self.computeHash, image_copies[2], imagehash.whash, 16)
                    future_phashSimple8 = executor.submit(self.computeHash, image_copies[3], imagehash.phash_simple, 8)
                    future_md5 = executor.submit(self.compute_md5_hash, imagePath)


Now the next question is why it's not using 100% cpu? It's using like 12 to 16 % instead. I am not sure should I continue this discussion here and post solution once all is done or should I post this solution under "Add your solution here" and start new topic to get 100% cpu utilization?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900