Click here to Skip to main content
15,893,790 members

Welcome to the Lounge

   

For discussing anything related to a software developer's life but is not for programming questions. Got a programming question?

The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.

 
GeneralRe: Update on Yesterday post Pin
Mark_Wallace12-Feb-20 23:52
Mark_Wallace12-Feb-20 23:52 
GeneralRe: Update on Yesterday post Pin
OriginalGriff13-Feb-20 0:57
mveOriginalGriff13-Feb-20 0:57 
GeneralRe: Update on Yesterday post Pin
dan!sh 13-Feb-20 1:19
professional dan!sh 13-Feb-20 1:19 
GeneralRe: Update on Yesterday post Pin
dandy7213-Feb-20 4:55
dandy7213-Feb-20 4:55 
GeneralRe: Update on Yesterday post Pin
dan!sh 13-Feb-20 5:09
professional dan!sh 13-Feb-20 5:09 
GeneralRe: Update on Yesterday post Pin
dandy7213-Feb-20 5:54
dandy7213-Feb-20 5:54 
GeneralRe: Update on Yesterday post Pin
dandy7213-Feb-20 4:59
dandy7213-Feb-20 4:59 
QuestionUsing machine learning to de-duplicate data Pin
Rohit_bhat12-Feb-20 20:55
professionalRohit_bhat12-Feb-20 20:55 
I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.

I have completed a machine learning course and have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.

Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.

For Instance, We might have 5 different entries for a customer John Doe, each with different contact details.

We also have the case where multiple records that represent different customers match on key fields like email. For instance, when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses, etc.

All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course, certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.

modified 19-Feb-20 8:45am.

AnswerRe: Using machine learning to de-duplicate data Pin
OriginalGriff12-Feb-20 21:30
mveOriginalGriff12-Feb-20 21:30 
AnswerRe: Using machine learning to de-duplicate data Pin
the goat in your machine12-Feb-20 21:49
the goat in your machine12-Feb-20 21:49 
AnswerRe: Using machine learning to de-duplicate data Pin
Mark_Wallace12-Feb-20 22:13
Mark_Wallace12-Feb-20 22:13 
AnswerRe: Using machine learning to de-duplicate data Pin
musefan12-Feb-20 22:34
musefan12-Feb-20 22:34 
AnswerRe: Using machine learning to de-duplicate data Pin
GuyThiebaut13-Feb-20 2:13
professionalGuyThiebaut13-Feb-20 2:13 
AnswerRe: Using machine learning to de-duplicate data Pin
dandy7213-Feb-20 4:41
dandy7213-Feb-20 4:41 
GeneralMan only? Pin
Kornfeld Eliyahu Peter12-Feb-20 19:52
professionalKornfeld Eliyahu Peter12-Feb-20 19:52 
GeneralRe: Man only? Pin
OriginalGriff12-Feb-20 20:57
mveOriginalGriff12-Feb-20 20:57 
GeneralWhen? Pin
Kornfeld Eliyahu Peter12-Feb-20 19:49
professionalKornfeld Eliyahu Peter12-Feb-20 19:49 
GeneralRe: When? Pin
Sander Rossel12-Feb-20 22:39
professionalSander Rossel12-Feb-20 22:39 
JokeFound on Internet PinPopular
honey the codewitch12-Feb-20 10:39
mvahoney the codewitch12-Feb-20 10:39 
GeneralRe: Found on Internet Pin
Super Lloyd12-Feb-20 12:45
Super Lloyd12-Feb-20 12:45 
GeneralRe: Found on Internet Pin
Amarnath S12-Feb-20 21:19
professionalAmarnath S12-Feb-20 21:19 
GeneralRe: Found on Internet Pin
OriginalGriff12-Feb-20 21:31
mveOriginalGriff12-Feb-20 21:31 
GeneralRe: Found on Internet Pin
kalberts13-Feb-20 0:44
kalberts13-Feb-20 0:44 
GeneralRe: Found on Internet Pin
Jörgen Andersson12-Feb-20 23:37
professionalJörgen Andersson12-Feb-20 23:37 
GeneralRe: Found on Internet Pin
Amarnath S12-Feb-20 23:42
professionalAmarnath S12-Feb-20 23:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.