Click here to Skip to main content
15,886,963 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a large file having ~10cr data in csv format, I need to find duplicates from this using 1st column of each records. How can I do it efficiently with less RAM?

What I have tried:

I tried this using hash map but it didn't work and takes two much time and memory and I got a warning "thread starvation or clock leap detected"
Posted
Updated 23-Jun-23 1:08am
Comments
Richard MacCutchan 23-Jun-23 5:21am    
With that many records (10 crore = 100 million) you will probably need to break it down into smaller chunks and process each independently.

1 solution

Without code supplied it is a bit of a shot in the dark, we used something similar to the below example -
1) a Lot of tips to be found on reducing memory consumption at Reducing memory consumption in Java[^]
2) a Great article with tips on how to tune your code to be more efficient from Oracle at Tuning For a Small Memory Footprint[^]
3) Using the right Garbage collector Minimize Java Memory Usage with the Right Garbage Collector[^]
4) a Tool to test and rectify your Java performance at Rapidly Optimize Java Performance[^]

With your supplied issue, I will do the following -
a) Read the file line by line -
b) Split each line into columns -
c) Extract the value from the first column to perform duplicate check -
d) Check if the value exists in a HashSet -
e) If the value is duplicated in the HashSet, perform your action (like deleting etc.) -
f) If the value is not duplicated, add it to the HashSet.
g) Repeat steps b-f for each line in your file -
h) For Garbage Collection, see below -

The above will only store unique values in memory, minimizing your RAM usage.
If you want to be explicit about garbage collection in your code, you can invoke it using the 'System.gc()' method.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;

public class CheckForDuplicatesInFirstColumn {
    public static void main(String[] args) {
        String filename = "your_file_to_read_from.csv";
        HashSet<String> uniqueValues = new HashSet<>();

        try (BufferedReader br = new BufferedReader(new FileReader(filename))) {
            String line;
            while ((line = br.readLine()) != null) {
                String[] columns = line.split(",");
                // Assuming the first column is the key
                String value = columns[0]; 

                if (uniqueValues.contains(value)) {
                    // Duplicate found, do something here
                    System.out.println("Duplicate found: " + line);
                } else {
                    uniqueValues.add(value);
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

        // Invoke garbage collection after processing the file
        //keep in mind that invoking System.gc() does not perform immediate garbage collection as the garbage collection is ultimately up to your JVM and it should be used sparingly.
        System.gc();
    }
}
 
Share this answer
 
Comments
CPallini 23-Jun-23 7:18am    
5.
Andre Oosthuizen 23-Jun-23 7:18am    
Thanks!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900