
Introduction
Over time I have compiled a database of roughly six thousand unique first
names along with the gender usually associated with that name. The names
in this database are primarily English names, but also contains some other
nationalities such as German, French, Russian, etc. I have used this database on
a number of occasions for processes such as data entry validation, and data extrapolation.
I believe some other people could benefit from this database and worker class,
so I decided to post it to Code Project.
Why would anyone need this?
Gender is a very common dimension in data marts. A database project I
worked on (some time ago) had a database of about 1.2 million names, addresses
and phone numbers. If the client had wanted gender for the names on this
list it would have been unavailable because that data was not collected with the
list. The only way to get gender was to contact these people directly
(unreasonable) or extract the approximate gender based on each individuals
names.
Another example is a data entry application where the data entry person did a
poor job of entering the data. I used this database to cross match the
gender entered by the user to the approximate gender determined through the
database. The results were that roughly 15% of the data entered required
review and nearly 10% was actually incorrect.
What is this?
I am including with this project 3 items.
- An MS Access database (NameDB.mdb) with a single table (FRST_NM_GNDR)
which contains approximately 6,000 names and associated genders.
- The source code for a class (
CFPSGenderizer
) which loads this database and
provides a simple API for looking up a name and returning the associated
gender.
- A demo project which demonstrates how to use the
CFPSGenderizer
class.
Where did these names come from?
The names in this database have been collected from 3 primary sources.
1) A customers database, 2) freely available web site downloads, 3) the Social
Security Administration's web-site (ssa.gov). There are no license
requirements for using these names nor are there any warranties as to the
accuracy of the name/gender associations.
How accurate is the list?
Who knows! From the few times I have used the list in verifiable
scenarios it appeared that for the names in the list it was at least 85%
accurate. This means, of course, that it could be as much as 15%
inaccurate. I do not use this data when high-precision is needed, only for
cross-verification and data extrapolation situations.
How to use this class?
- Add the FPSGenderizer.cpp and FPSGenderizer.h files to your project.
- Instantiate an instance of the
CFPSGenderizer
class in your program at
an appropriate location. The class must be initialized through the
Load function so your implementation should plan on performing this step
only once if possible.
- Call one of the overridden
Load
functions to load the list from a
database or serialized file.
- Call the
CFPSGenderizer::Genderize
function and pass in an LPCTSTR
containing a first name you want to genderize. It will return a char
which will either be 'M' (Male), 'F' (Female) or 'U' (Unknown). This
function will return 'U' for names not on the list as well as for names on
the list explicitly associated with 'U'.
Future Development?
As my job requires I will be updating the database by adding names and
changing the associations of the names on the list. I also plan to
incorporate an edit-distance and metaphone algorithm (see my earlier Spell
Checker app) to find suggestions for a name and based on the frequency of
suggested male/female/unknown genders suggest a gender. Before I release this
enhancement I need to test the results to see if they are even remotely
reliable, though.