How to find duplicates from a large data file having ~ 10cr records

Question

0.00/5 (No votes)

See more:

I have a large file having ~10cr data in csv format, I need to find duplicates from this using 1st column of each records. How can I do it efficiently with less RAM?

What I have tried:

I tried this using hash map but it didn't work and takes two much time and memory and I got a warning "thread starvation or clock leap detected"

Posted 22-Jun-23 21:01pm

Anjali Maurya 2023

Updated 23-Jun-23 1:08am

Add a Solution

Comments

Richard MacCutchan 23-Jun-23 5:21am

With that many records (10 crore = 100 million) you will probably need to break it down into smaller chunks and process each independently.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Andre Oosthuizen · Answer 1 · 2023-06-23T01:08:00

Without code supplied it is a bit of a shot in the dark, we used something similar to the below example -
1) a Lot of tips to be found on reducing memory consumption at Reducing memory consumption in Java [^]
2) a Great article with tips on how to tune your code to be more efficient from Oracle at Tuning For a Small Memory Footprint [^]
3) Using the right Garbage collector Minimize Java Memory Usage with the Right Garbage Collector [^]
4) a Tool to test and rectify your Java performance at Rapidly Optimize Java Performance [^]

With your supplied issue, I will do the following -
a) Read the file line by line -
b) Split each line into columns -
c) Extract the value from the first column to perform duplicate check -
d) Check if the value exists in a HashSet -
e) If the value is duplicated in the HashSet, perform your action (like deleting etc.) -
f) If the value is not duplicated, add it to the HashSet.
g) Repeat steps b-f for each line in your file -
h) For Garbage Collection, see below -

The above will only store unique values in memory, minimizing your RAM usage.
If you want to be explicit about garbage collection in your code, you can invoke it using the 'System.gc()' method.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;

public class CheckForDuplicatesInFirstColumn {
    public static void main(String[] args) {
        String filename = "your_file_to_read_from.csv";
        HashSet<String> uniqueValues = new HashSet<>();

        try (BufferedReader br = new BufferedReader(new FileReader(filename))) {
            String line;
            while ((line = br.readLine()) != null) {
                String[] columns = line.split(",");
                // Assuming the first column is the key
                String value = columns[0]; 

                if (uniqueValues.contains(value)) {
                    // Duplicate found, do something here
                    System.out.println("Duplicate found: " + line);
                } else {
                    uniqueValues.add(value);
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

        // Invoke garbage collection after processing the file
        //keep in mind that invoking System.gc() does not perform immediate garbage collection as the garbage collection is ultimately up to your JVM and it should be used sparingly.
        System.gc();
    }
}