Forename-Title Profiling
Introduction...
Introduction
This script shows how data in the table can be cleansed/enhanced using other records in the same table. It is essential that you have enough records in the database to get correct results. In this example, we will see how to fix values inTitles
field of table that contains contact details basing on popular Forename-Title
matches.
Using the Code
WithCREATE TABLE
statement, we find popular Forename-Gender combinations in our table and store them in Forename_Gender
table.
-------------------------------------------------------------
-- Create table with all forename-gender combinations --
-------------------------------------------------------------
DROP TABLE Forename_Gender;
CREATE TABLE Forename_Gender
AS
SELECT
Forenames,
Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
'Mrs', 'F', 'Ms', 'F','Mistress','F', 'U') AS Gender,
COUNT(*) AS Amount
FROM Contacts
WHERE
NOT Forenames LIKE '% %'
AND LENGTH(Forenames)>2
GROUP BY
Forenames,
Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
'Mrs', 'F', 'Ms', 'F', 'Mistress', 'F', 'U')
/
-- Delete matches where gender is Unknown
DELETE FROM Forename_Gender WHERE Gender='U'
/
Then we need to remove records where the same forename can be given to both boy and girl. I have used 1% threshold, i.e. if both male and female genders have more than 1% of records (amount) in the forename group, then this Forename-Gender combination cannot be used and will be deleted from Forename_Gender
table.
-------------------------------------------------------------
-- Delete records where forename can be both
-- male and female
-------------------------------------------------------------
DELETE FROM forename_gender f
WHERE EXISTS
(SELECT * FROM forename_gender WHERE
Forenames=f.Forenames
AND Gender<>f.Gender
AND AMOUNT*100>f.amount -- there is more than 1%
-- of people with different gender and same forename.
-- NOTE: you must change 100 to other number to use
-- different threshold. I.e. 50 will mean that 2% of records
-- can have other Gender without being taken into account.
)
/
Then I delete unpopular forenames - less than 5 records in the table.
-------------------------------------------------------------
-- Need more than 5 records having same Gender
-------------------------------------------------------------
DELETE FROM forename_gender
WHERE Amount<5
/
Check what will be the result of update:
-- This statement returns results that show which records will be
-- affected by the update.
SELECT
C.Forenames, -- Forename of the contact
Count(C.Forenames) As Number_To_Update, -- Records will be updated
DECODE(Gender,'M','Mr','F','Ms','Mr/Ms') as New_Title, -- New Title
AVG(FG.Amount) AS Number_In_DB -- Number of records in DB that
-- support this match
FROM Contacts C Inner Join
Forename_Gender FG ON C.Forenames=FG.Forenames
WHERE
UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
GROUP BY C.Forenames,DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
ORDER BY AVG(FG.Amount) DESC
/
And finally we can update Title
field with values from Forename_Gender
.
-------------------------------------------------------------
-- UPDATE Contacts.Title fields where it is not set
-- or set to Mr/Ms
-------------------------------------------------------------
-- !!! This code is going to change data in your table.
-- !!! Test this code untill you are confident in results.
UPDATE Contacts C
SET
Title = (SELECT DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
FROM Forename_Gender
WHERE C.Forenames=Forenames)
-- SELECT COUNT(*) FROM Contacts C
WHERE
UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
AND EXISTS (SELECT *
FROM Forename_Gender
WHERE C.Forenames=Forenames)
/
Do More
I have done a very similar exercise on addresses table to fix Towns and Counties (UK customer base) using popular Postcode area - town combinations. Also this is a way of profiling your records, so you can use match tables (i.e.Forname_Gender
, Postcode_Town
) to find erroneous records in your tables.