Click here to Skip to main content
15,891,033 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Currently my Signature database in text format is over one million lines which is the reason as why my scan speeds have decreased. I am sure that I need to implement some different more efficient code to achieve better scan speeds as well but my question is how might I scan using md5 hash or crc32 hash by byte so that it will only read the first few bytes before it knows its a match or not to then quickly move to the next file? also how might I do binary search? or scan with this method?
Posted
Updated 30-Aug-11 17:37pm
v2
Comments
Sergey Alexandrovich Kryukov 31-Aug-11 0:34am    
Scan of what..?
--SA
Dale 2012 31-Aug-11 1:09am    
virus signatures in a text file. I would like to know how to use them more efficiently so that my scan speeds will increase. What method would you approach and how might I learn to apply it? byte or binary lookup?
Dale 2012 31-Aug-11 1:11am    
My virus scanner is slowing down due to the size of my virus definition list being over one million and probably the method of my recursion is a bit weak so I am looking for some examples to achieve faster scanning times by applying some kind of algorithm to my recursive search. Please help
Dale 2012 31-Aug-11 2:50am    
I seen a small example such as

(Hash ex) b11b177e7244624410406a8b26430648
(byte ex) 0xb1 0x1b 0x17 0x7e 0x72 0x44 0x62 0x44 0x10 and so forth

can I write code that will match the bytes and if the first byte or up to the second byte do not match then go on to the next file?

1 solution

MD5 and SHA hashes do not work like that - they are a complete summary across all the bytes in the file.
What you are trying to do is like adding the first three integers of a 1000 integer sequence and expecting to be able to tell something about the resulting total from that - it just doesn't work.
 
Share this answer
 
Comments
Dale 2012 31-Aug-11 4:08am    
ok if not by byte then can i reference the lines of hash by binary? if not will you give me some idea of what methods to use to make scanning of large file files fast and small files even faster or the best way to use hash virus definitions without slowing down scan times?
OriginalGriff 31-Aug-11 4:14am    
No - a hash does not work on bit lines, the way an XOR does - it looks at the whole data stream.
Speed up? Off the top of my head, I don't know if you can make significant inroads without changing to C++ or possibly assembler.
Dale 2012 31-Aug-11 7:36am    
Thank you for your response maybe I can leave you with a post that was given on my last question concerning my problem if you will care to clarify what he may be talking about in laymen terms

So, you're reading your entire hash file tables on every file you're "scanning"? No wonder it's so slow.

You load the hash tables ONCE and keep all that data in an internal table, proabably sorted by hash code to make lookups faster.
Permalink
Posted 24 Jul '11
Dave Kreskowiak


I believe that this is the right track so what I have done is created a few text boxes that load all the contents from my md5.txt and crc32.txt which is what i thought he meant by internal table. I am not sure how it is sorted by hash code or how to sort it by hash to make look ups faster?
OriginalGriff 31-Aug-11 8:46am    
What Dave is talking about is loading the hash tables from your MD5 and CRC32 (Why are you using two hashes? Particularly when CRC32 is, well, crap nowadays) once, and once only and retaining the has tables in memory for the life of your app.
If you are indeed reading them each time you want to check a file, then that could contribute quite a bit to any delays.
In addition, if you are using File.ReadAllBytes to read each individual file and check it against the hash table, it might be worth your considering changing to a native routine which handles that without reallocating buffer space unless the existing space is too small - re-use could help speed things up, because it removes the need for the Garbage Collector to shuffle things around.
Dale 2012 2-Sep-11 2:12am    
What is re-use? I would like to be able to implement this into my scan procedure as what your saying makes sense. thanks in advance

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900