Click here to Skip to main content
15,896,423 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
I have three files F1,F2 and F3.
Each file contains 10 million records (all numbers).
Now i have to remove duplicates from all these files eg number N1 should be unique in all the files .
I have approached this by loading each file in differcent hashset .Since memory out of exception is thrown if i store all records in a single hashset .
Is there any way by which i can check unique values in all the three hashSets ?
Or is there any better way to solve this problem without using any database ?
Posted

Hi,

Nice question,

Solution :
http://msdn.microsoft.com/en-us/library/dd997372.aspx[^]

Better Performance[^]

Memory Map File.

I have some other solution if you interested :)

[Update]

You concern is to Save memory right ?

You can use Split file and search for unique, but the problem is when your searching for unique numbers comes to end it may cross the memory limit.

So its better to first sort then search,

For best performance,

Fallow these Steps :

1. Split File in small parts.
2. Write a function which will number from each file and delete it from that file.
3. Start separate threads and pass each file to that function.
4. Create some data block files like 1 - 1000,1001-2000,2001-3000...
5. Put each number to its block ( but before putting it just check is that number already exists or not) if not exists then inset.
6. When your code will traverse all the file you will already have the unique in that data files.
7. Now marge all data blocks files to a single file.

Tips.
1. Don't use more that 10 threads.
2. Initially split each file's in 10 to 20 KB size.
3. Don't create data blocks more than 3000 ( You can use serialization [Binary] but just for process don't hold it in memory. )

If you don't understand any part then feel free to ask.

Sorry for let reply
 
Share this answer
 
v4
Comments
prejval2006 2-Jan-13 5:31am    
can you provide other solutions as well it will be very helpful
Suvabrata Roy 2-Jan-13 6:23am    
Please refer to the update block :)
Zoltán Zörgő 2-Jan-13 8:03am    
Looks much like what I suggested...
Suvabrata Roy 2-Jan-13 8:48am    
No, this is totally different idea.
Because I have create some file for indexing the data in small manner, using these approach you will never have any issue with memory capacity.
Zoltán Zörgő 2-Jan-13 9:05am    
As you wish.
Good one!
There was a technique used at the time, when punch-cards were used as storage. Those machines had really not much of memory :)
The UNIQUE "operator" can be treated as transitive. I suggest, you split the files, and treat them as a pyramid. Make files that are have let's say only 1 million record, but it's up to you.
F1-0...F1-9<br />
F2-0...F2-9<br />
F3-0...F3-9

Do an unique search 0U-10 = U(F1-0) (where U is the function returns that only unique records) on all separately and save them, than you will have:
0U-10...0U-19, 0U-20...0U-29, 0U-30...0U-39
all having only unique records.
Than merge some keeping the limit in mind, and do the search:
1U10-11-12-13 = U(0U1-10 + 0U1-11 + 0U1-12 + 0U1-13)<br />
1U14-15-16 = U(0U-14 + 0U-15 + 0U-16)<br />
1U17-18-19 = U(0U-17 + 0U-18 + 0U-19)<br />
1U20-21-22-23 = U(0U-20 + 0U-21 + 0U-22 + 0U-23)<br />
...<br />

And so on until you have the final result. This way you will have to load only small portions in the memory.
 
Share this answer
 
v5
From your post it's not clear exactly where the memory limit happens.
If you have text files where each record consists of one number only, than this should not cause any problem to load into a list or array of int. I estimate for 10 million numbers less than 100 MB memory per list, which is managable.
Once you have these data in memory, sort them and iterate over the three arrays to get the unique values.
E.g.
C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication
{
    public class NumData
    {
        public class Result
        {
            public int Count { get; private set; }
            public int Value { get; private set; }
            private void Set(int value, int count) { Value = value; Count = count; }
            public Result(int value, int count) { Set(value, count); }
            /// <summary>
            /// Get the min value of the current position of all (valid) iterators,
            /// move the min iterators by one position,
            /// and count the number of iterators with that min value.
            /// </summary>
            /// <remarks>The iterators must point to sorted data that have *no* multiple identical entries.</remarks>
            /// <returns>
            /// true = valid result (Value = current value, Count = number of iterators with that value),
            /// false = no iterator data left
            /// </returns>
            public bool Move(params Iterator[] cursors)
            {
                if (cursors == null || cursors.Length == 0 || !cursors.Any(c => c.IsValid)) return false;
                int min = cursors.Where(c => c.IsValid).Min(c => c.Curr);
                int count = 0;
                foreach (var cursor in cursors.Where(c => c.IsValid))
                {
                    if (cursor.Curr == min)
                    {
                        count++;
                        cursor.Next();
                    }
                }
                Set(min, count);
                return true;
            }
        }

        public class Iterator
        {
            /// <summary>Iterator Fasade to handle end of data gracefully (IsValid property).</summary>
            public Iterator(List<int> list)
	        {
                _items = list.GetEnumerator();
                IsValid = false;
	        }
            private IEnumerator<int> _items;
            public bool IsValid { get; private set; }
            public bool Next() { IsValid = _items.MoveNext(); return IsValid; }
            public int Curr { get { if (!IsValid) throw new InvalidOperationException("Beyond end of iteration"); return _items.Current; } }
        }

        private List<int> _list;
        public NumData(string path) 
            : this()
        {
            Console.WriteLine("...Load...");
            using (TextReader reader = File.OpenText(path))
            {
                while (Add(reader.ReadLine())) {}
            }
            SortUnique();
        }

        public NumData()
        {
            _list = new List<int>();
        }

        public bool Add(string s)
        {
            int n;
            if (string.IsNullOrEmpty(s) || !int.TryParse(s, out n)) return false;
            Add(n);
            return true;
        }

        public void Add(int n)
        {
            _list.Add(n);
        }

        public void SortUnique()
        {
            Console.Write("...Sort-Unique {0} values ...", _list.Count);
            if (_list.Count > 0)
            {
                // sort ...
                _list.Sort();

                // ... and remove duplicates
                List<int> list = new List<int>();
                int last = _list[0];
                list.Add(last);
                foreach (int value in _list)
                {
                    if (last != value)
                    {
                        last = value;
                        list.Add(value);
                    }
                }
                _list = list;
            }
            Console.WriteLine(" results in {0} values", _list.Count);
        }

        public Iterator Cursor { get { return new Iterator(_list); } }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // get iterators and get move to the first value each
            var pos1 = new NumData("..\\F1.txt").Cursor;
            var pos2 = new NumData("..\\F2.txt").Cursor;
            var pos3 = new NumData("..\\F3.txt").Cursor;
            pos1.Next();
            pos2.Next();
            pos3.Next();
            // prepare results
            var result = new NumData();
            var res = new NumData.Result(0, 0);

            // iterate concurrently over all sort-unique lists and report each value with its count of occurance
            int i = 0;
            while (res.Move(pos1, pos2, pos3))
            {
                if ((res.Count == 1)) result.Add(res.Value);
                if (++i % 1000000 == 0) Console.WriteLine("...{0}", i);
            }

            // show results
            var pos = result.Cursor;
            while (pos.Next())
            {
                Console.WriteLine("{0}", pos.Curr);
            }
            Console.WriteLine("Done (press any key to exit...)");
            Console.ReadKey();
        }

    }
}


Cheers
Andi
 
Share this answer
 
You can also try reading chunk of buffers instead of loading whole file in memory. Have a look StreamReader [^] class.

Read the part of the file and perform your comparisons and move the buffer to read next set of chunk and so on.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900