|
I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have completed a machine learning course and have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance, We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance, when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses, etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course, certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
modified 19-Feb-20 8:45am.
|
|
|
|
|
Have you tried turning it off and back on again?
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
you forgot to add there may be 50 different people with he same name, i.e. Mohamm John Doe (often just the one name, sometimes spelled differently, sometimes father and son sharing the same name...), birthdate sometimes unknown so it's 1-Jan-yyyy (and not always the same year for the same cust because they just don't know for sure). People may have moved so address [or even locality] is not telling.
Only 1 million (or is that just the sample?), well over 6 million I thought.
LOL, been there, ... ran away AFAP.
|
|
|
|
|
Is it too much effort to use a query?
I wanna be a eunuchs developer! Pass me a bread knife!
|
|
|
|
|
Rohit_bhat wrote: a lot of our customers have ended up with more than one profile in our DB
How do you know that? What is it about those records that makes you know for sure that they are the same person? Work that out, then apply that logic to the automated process.
|
|
|
|
|
If you have done a machine learning course then you will have a basic understanding of statistical modelling.
Call me simple, but surely all you need to do is create a training dataset and apply that dataset to a number of different models until you get the results you are expecting.
You can then use this model to your live dataset and see what results you get.
I feel like I am patronising you by explaining the above having only myself spent about 30 minutes learning about machine learning, so I am sure you know much more than me on this subject and can see the flaws in my suggestions.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
|
|
|
|
|
My personal take on this sort of thing. And please don't take this the wrong way.
Can you, as a human being, with a human brain, look at any two such records and define some criteria by which you can decide what's a duplicate and what's not...? And then correctly decide what to do about the situation?
If you can't, then I'm afraid this is another case of "machine learning" being presented as a panacea to solve all of humanity's problems.
As I was saying just a few days ago in some unrelated thread...this is how "AI" and "big data" ends up showing me ads for articles I've just purchased, after the purchase was made...
|
|
|
|
|
|
|
|
For me it was:
Sept 2010: I don't know what the hell I'm doing.
Jan 2011: Read about design patterns and SOLID.
Feb 2011: I AM A GOD!
I now wield a +5 magical keyboard and a mouse that has automatic protection from evil management bugs
|
|
|
|
|
So a German, an Englishman and an Irishman were all in Saudi Arabia, sharing a smuggled crate of booze when they were arrested bySaudi police. The mere possession of alcohol is a severe offence in Saudi Arabia, so they are all sentenced to death!
However, after many months and with the help of very good lawyers, they were able to
appeal their sentences down to 20 lashes each of the whip.
As they were preparing for their punishment, the Sheikh announced: “As it is my first wife’s birthday today, she has asked me to allow each of you one wish before your whipping.”
The German was first in line; after thinking for a bit he said, “Please tie a pillow to my back.” This was done, but after only 10 lashes the whip had shredded the pillow. When
the punishment was done the German had to be carried away bleeding and crying in pain.
The Englishman was next up. After watching the German in horror he asked, “Please tie two pillows to my back.” This time it took 15 lashes, but once again the pillows were shredded, and the Englishman was led away bleeding and whimpering in pain.
The Irishman was the last one up, but before he could say anything, the Sheikh turned to him and said: “You are from the most beautiful part of the world I have ever seen. Because of this, you may have two wishes!”
“Thank you, your Most Royal and Merciful highness,” the Irishman replied. “In recognition of your kindness, my first wish is that you give me not 20, but 100 lashes.”
“Not only are you an honorable man from a beautiful island, you are also very brave,” the Sheikh said with admiration. “If 100 lashes is what you desire, then so be it. And your second wish?”
And the Irishman said, “Tie the Englishman to my back.”
Real programmers use butterflies
|
|
|
|
|
True story!
Though one could ask.. were they really drinking together, hey?!
|
|
|
|
|
And I thought that Irish were also English.
|
|
|
|
|
Go to Ireland, stand in a pub, and declare that!
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
I heard from this American who returned from a visit to Cardiff with a swollen left eye. How did it happen? Well, he had been to this pub, where he had been chatting with these three girls, and asked them something like "Are you local English girls, from around here?" The girls sneered back: "That is Wales, you ignorant fool!" So he told them he was sorry, and rephrased his question: "I am so sorry! So, are you three whales from around here?"
|
|
|
|
|
|
Our geography textbooks taught us only one thing - the British Isles.
|
|
|
|
|
I can imagine. Our geography text books teach us way to little on India. (or the rest of the world for that matter)
When I was studying we got an Indian exchange student at my student home.
Our discussion on more or less anything was quite an eye opener for exactly how much that was missing from our mutual knowledge.
|
|
|
|
|
Don't let them hear you say that!
Real programmers use butterflies
|
|
|
|
|
Monday[^] is shiny and looks so cool, and at first blush it seemed so much better than JIRA. Well, frankly, anything is better than JIRA, so it still wins that.
But the shiny is starting to wear off, mainly because the UX is awkward. They could make things either a bit more obvious, a bit easier to do, or even possible to do. My current peeve is that once I create a filter, I can't edit the filter name. WTF?
Still, some aspects of the UX are freaking awesome - fast and responsive, search is great, filtering is great, and the UI is clean.
Now, on the positive side, they are making constant improvements, like adding sub-items. Of course, I look at something like Monday and I wonder, how could you go to production and have a useful tool without supporting sub-items? It's a mystery to me.
Still, I feel a bit embarrassed. The glitter and makeup fooled me.
|
|
|
|
|
Marc Clifton wrote: The glitter and makeup fooled me. I tested it out for the same reason. I can't remember specifically what it was, but there were several basic things that it could not do so we don't use it. More hype than substance from what I could tell.
Social Media - A platform that makes it easier for the crazies to find each other.
Everyone is born right handed. Only the strongest overcome it.
Fight for left-handed rights and hand equality.
|
|
|
|
|
(No comment on the product itself...)
While it's a great domain name, given that nobody likes Mondays...does that make a good product name...?
|
|
|
|
|
dandy72 wrote: given that nobody likes Mondays
Actually - I find the first 5 days after the weekend to be the toughest to get through.
If you can't laugh at yourself - ask me and I will do it for you.
|
|
|
|
|
The man's got a point.
|
|
|
|
|