Click here to Skip to main content
14,934,114 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello people.
I could not throw the question properly the first time.
My question is...what is the best way to check a 16 digit number against a list of 100 million+ numbers which are in a text file for duplicates? The numbers are generated by some other program which i cannot control.
One conventional way is to load the numbers in an array and compare each number with the target number in a loop. But this process is very time consuming and memory hungry. Is there any better way to do this?
Thanks in advance.
Posted
Updated 3-Dec-10 2:34am
v4
Comments
Dr.Walt Fair, PE 2-Dec-10 0:22am
   
How often do the numbers need to be checked? Are the numbers in the text file in sorted order?

The text file size just to store the numbers would be over 1.6 GB if saved in ASCII. Is that correct?
Pallab_GT 3-Dec-10 9:21am
   
Hi.Thanks for your reply.
The numbers are not sorted in any order. They need to be checked 2-3 times a month.
Yes you r right about the file size.
Manfred Rudolf Bihy 3-Dec-10 9:44am
   
If they're being checked 2-3 times a month I'd go for the merge sort approach. It may not be the fasted method but it will work with files and record numbers even larger than those you mentioned.
fjdiewornncalwe 3-Dec-10 11:20am
   
Please be careful with the answer you have selected as the correct one. You have selected one below that is completely incorrect to what you are trying to do? WTH!!!

Hi Pallab,

using file based merge sort would be the way to go for you.
Please read this blog entry:

http://splinter.com.au/blog/?p=142

This merge sort works by breaking one big file into smaller chunks.
These are conventionally sorted and written to disk. Then in a final
operation these chunks are merged into one big sorted file. This way
you can keep the memory footprint quite small even for very big files.
The larges piece of information to hold in memory at a time is the size
of one of the chunks the original was broken into.
(Pretty neat isn't it: This algorithm is from way back in the 60/70's when main memory was a very costly thing.)

Modification:

Forgot this in my original answer. After the file based merge sort is done you'll have a sorted file with maybe some duplicates. So you open this file scan through it line by line always remembering the last unique entry. If the next line read contains the same number as the last one it is not output to the final file. Lather, rinse, repeat ...
End of Modification


Cheers,


Manfred
   
v3
Comments
fjdiewornncalwe 3-Dec-10 10:18am
   
Nice... I like this...
You could load them a in groups of 10000 numbers, compare, and if not found, load the next 10000. The only things that should cause the loop to exit would be if the number was found, or if the end of the file was reached.

You didn't specify the exact format of the file, so I'm assuming that each number is on its own line.

Another thing you could do is find the largest number that exists in the big file, and keep it in a separate file. Then, you could first check to see if the new number is larger than the one in your single-number file, and if it is, you're gold, and you can add it to the big file. If not, then you can run through the big file to see if it's already in there.

BTW, sorting the contents of the file probably won't help because you can't assume that the other application will be adding them in any kind of order, and you'll end up constantly sorting the file before conducting your search. Not very efficient.


   
v5
Comments
fjdiewornncalwe 3-Dec-10 10:19am
   
Logically, I agree with you, but performance would be dreadful not because of your algorithm, but because of the volume of data being dealt with.
For starters, I'd want to get the guy who thought it was good idea to store 100+ million numbers in a text file in a corner and kick him a few times.

After I got out of jail, I would probably transfer the numbers into a simple db so that I could take advantage of the databases search algorithms.
   
Comments
Pallab_GT 3-Dec-10 9:14am
   
Thanks for your reply Marcus. I do agree with you on kicking the concerned person.
Using database is not an option.
Thanks again.
fjdiewornncalwe 3-Dec-10 10:17am
   
Can you do anything with the flat file, or do you have to use it as is. ie. Could you pre-process the file into some sort of sub-file sorted range thingy which would allow you to make searches less intensive, or is the flat file constantly changing? ie... Look at Manfred's post above. That's what I'm talking about here.
Well , My logic to accoplish the same will be.

1. Either use LINQ or T-SQL for finding distinct ones.
2. Loop through Millions of record you have and check
  1. If it is in distict then let it be as it is
  2. If it isn't in distinct then keep very first and delete remaining.

This was my solution at that time i needed to find and delete duplicates from 50,000 records.

Try it.

Please vote and Accept Answer if it Helped.
   
v2
Comments
Pallab_GT 3-Dec-10 9:15am
   
Thanks for your reply. But using a database is not an option.
Hiren solanki 3-Dec-10 9:18am
   
Then probably load all the records in datatable and apply LINQ, If it's hard for datatable then use of making partially datatable for records and move on the same way.
Manfred Rudolf Bihy 3-Dec-10 9:42am
   
I seriously doubt that doing this with LINQ but without a DBMS would work with 100 million+ 16 digit numbers.
Use hashing algorithm for this. Calculate hash for each file you have. It will give you a 16 byte string. Whenever the two hashed strings match, it means that the files are duplicate. Google for hashing algorithm and you may get lots of responses.
   
Comments
fjdiewornncalwe 3-Dec-10 8:46am
   
By the sound of the question, all of the numbers are in a massive, huge, crazy-a** text file. If each one is in a separate file, then your method would be very effective, but I'm not sure if you save a lot of cycles with hashing tiny files as opposed to opening and checking the contents.
i think u may use some logic to divide them by some prefix, such like date+time, then the probability of duplicates will be decreased.
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900