How to fetch distinct values in multiple text files ?

Question

5.00/5 (1 vote)

See more:

I have three files F1,F2 and F3.
Each file contains 10 million records (all numbers).
Now i have to remove duplicates from all these files eg number N1 should be unique in all the files .
I have approached this by loading each file in differcent hashset .Since memory out of exception is thrown if i store all records in a single hashset .
Is there any way by which i can check unique values in all the three hashSets ?
Or is there any better way to solve this problem without using any database ?

Posted 1-Jan-13 23:10pm

prejval2006

Add a Solution

4 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

*Suvabrata Roy* · Answer 1 · 2013-01-01T23:24:00

Hi,

Nice question,

Solution :
http://msdn.microsoft.com/en-us/library/dd997372.aspx[^]

Better Performance[^]

Memory Map File.

I have some other solution if you interested :)

[Update]

You concern is to Save memory right ?

You can use Split file and search for unique, but the problem is when your searching for unique numbers comes to end it may cross the memory limit.

So its better to first sort then search,

For best performance,

Fallow these Steps :

1. Split File in small parts.
2. Write a function which will number from each file and delete it from that file.
3. Start separate threads and pass each file to that function.
4. Create some data block files like 1 - 1000,1001-2000,2001-3000...
5. Put each number to its block ( but before putting it just check is that number already exists or not) if not exists then inset.
6. When your code will traverse all the file you will already have the unique in that data files.
7. Now marge all data blocks files to a single file.

Tips.
1. Don't use more that 10 threads.
2. Initially split each file's in 10 to 20 KB size.
3. Don't create data blocks more than 3000 ( You can use serialization [Binary] but just for process don't hold it in memory. )

If you don't understand any part then feel free to ask.

Sorry for let reply

**Zoltán Zörgő** · Answer 2 · 2013-01-01T23:31:00

Good one!
There was a technique used at the time, when punch-cards were used as storage. Those machines had really not much of memory :)
The UNIQUE "operator" can be treated as transitive. I suggest, you split the files, and treat them as a pyramid. Make files that are have let's say only 1 million record, but it's up to you.

F1-0...F1-9<br />
F2-0...F2-9<br />
F3-0...F3-9

Do an unique search 0U-10 = U(F1-0) (where U is the function returns that only unique records) on all separately and save them, than you will have:
0U-10...0U-19, 0U-20...0U-29, 0U-30...0U-39
all having only unique records.
Than merge some keeping the limit in mind, and do the search:

1U10-11-12-13 = U(0U1-10 + 0U1-11 + 0U1-12 + 0U1-13)<br />
1U14-15-16 = U(0U-14 + 0U-15 + 0U-16)<br />
1U17-18-19 = U(0U-17 + 0U-18 + 0U-19)<br />
1U20-21-22-23 = U(0U-20 + 0U-21 + 0U-22 + 0U-23)<br />
...<br />

And so on until you have the final result. This way you will have to load only small portions in the memory.

Andreas Gieriet · Answer 3 · 2013-01-02T09:12:00

From your post it's not clear exactly where the memory limit happens.
If you have text files where each record consists of one number only, than this should not cause any problem to load into a list or array of int. I estimate for 10 million numbers less than 100 MB memory per list, which is managable.
Once you have these data in memory, sort them and iterate over the three arrays to get the unique values.
E.g.

C#

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication
{
    public class NumData
    {
        public class Result
        {
            public int Count { get; private set; }
            public int Value { get; private set; }
            private void Set(int value, int count) { Value = value; Count = count; }
            public Result(int value, int count) { Set(value, count); }
            /// <summary>
            /// Get the min value of the current position of all (valid) iterators,
            /// move the min iterators by one position,
            /// and count the number of iterators with that min value.
            /// </summary>
            /// <remarks>The iterators must point to sorted data that have *no* multiple identical entries.</remarks>
            /// <returns>
            /// true = valid result (Value = current value, Count = number of iterators with that value),
            /// false = no iterator data left
            /// </returns>
            public bool Move(params Iterator[] cursors)
            {
                if (cursors == null || cursors.Length == 0 || !cursors.Any(c => c.IsValid)) return false;
                int min = cursors.Where(c => c.IsValid).Min(c => c.Curr);
                int count = 0;
                foreach (var cursor in cursors.Where(c => c.IsValid))
                {
                    if (cursor.Curr == min)
                    {
                        count++;
                        cursor.Next();
                    }
                }
                Set(min, count);
                return true;
            }
        }

        public class Iterator
        {
            /// <summary>Iterator Fasade to handle end of data gracefully (IsValid property).</summary>
            public Iterator(List<int> list)
	        {
                _items = list.GetEnumerator();
                IsValid = false;
	        }
            private IEnumerator<int> _items;
            public bool IsValid { get; private set; }
            public bool Next() { IsValid = _items.MoveNext(); return IsValid; }
            public int Curr { get { if (!IsValid) throw new InvalidOperationException("Beyond end of iteration"); return _items.Current; } }
        }

        private List<int> _list;
        public NumData(string path) 
            : this()
        {
            Console.WriteLine("...Load...");
            using (TextReader reader = File.OpenText(path))
            {
                while (Add(reader.ReadLine())) {}
            }
            SortUnique();
        }

        public NumData()
        {
            _list = new List<int>();
        }

        public bool Add(string s)
        {
            int n;
            if (string.IsNullOrEmpty(s) || !int.TryParse(s, out n)) return false;
            Add(n);
            return true;
        }

        public void Add(int n)
        {
            _list.Add(n);
        }

        public void SortUnique()
        {
            Console.Write("...Sort-Unique {0} values ...", _list.Count);
            if (_list.Count > 0)
            {
                // sort ...
                _list.Sort();

                // ... and remove duplicates
                List<int> list = new List<int>();
                int last = _list[0];
                list.Add(last);
                foreach (int value in _list)
                {
                    if (last != value)
                    {
                        last = value;
                        list.Add(value);
                    }
                }
                _list = list;
            }
            Console.WriteLine(" results in {0} values", _list.Count);
        }

        public Iterator Cursor { get { return new Iterator(_list); } }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // get iterators and get move to the first value each
            var pos1 = new NumData("..\\F1.txt").Cursor;
            var pos2 = new NumData("..\\F2.txt").Cursor;
            var pos3 = new NumData("..\\F3.txt").Cursor;
            pos1.Next();
            pos2.Next();
            pos3.Next();
            // prepare results
            var result = new NumData();
            var res = new NumData.Result(0, 0);

            // iterate concurrently over all sort-unique lists and report each value with its count of occurance
            int i = 0;
            while (res.Move(pos1, pos2, pos3))
            {
                if ((res.Count == 1)) result.Add(res.Value);
                if (++i % 1000000 == 0) Console.WriteLine("...{0}", i);
            }

            // show results
            var pos = result.Cursor;
            while (pos.Next())
            {
                Console.WriteLine("{0}", pos.Curr);
            }
            Console.WriteLine("Done (press any key to exit...)");
            Console.ReadKey();
        }

    }
}

Cheers
Andi

Jibesh · Answer 4 · 2013-01-02T00:17:00

Solution 3

You can also try reading chunk of buffers instead of loading whole file in memory. Have a look StreamReader [^] class.

Read the part of the file and perform your comparisons and move the buffer to read next set of chunk and so on.

Posted 2-Jan-13 0:17am

Jibesh

How to fetch distinct values in multiple text files ?

4 solutions

Solution 1

Solution 2

Solution 4

Solution 3

Add your solution here

Preview 0