Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Name genderization

0.00/5 (No votes)
22 Apr 2002 1  
Extrapolate the gender of a person based on their first name

Sample app

Introduction

Over time I have compiled a database of roughly six thousand unique first names along with the gender usually associated with that name.  The names in this database are primarily English names, but also contains some other nationalities such as German, French, Russian, etc. I have used this database on a number of occasions for processes such as data entry validation, and data extrapolation.  I believe some other people could benefit from this database and worker class, so I decided to post it to Code Project.  

Why would anyone need this?

Gender is a very common dimension in data marts.  A database project I worked on (some time ago) had a database of about 1.2 million names, addresses and phone numbers.  If the client had wanted gender for the names on this list it would have been unavailable because that data was not collected with the list. The only way to get gender was to contact these people directly (unreasonable) or extract the approximate gender based on each individuals names.

Another example is a data entry application where the data entry person did a poor job of entering the data.  I used this database to cross match the gender entered by the user to the approximate gender determined through the database.  The results were that roughly 15% of the data entered required review and nearly 10% was actually incorrect.

What is this?

I am including with this project 3 items.

  1. An MS Access database (NameDB.mdb) with a single table (FRST_NM_GNDR) which contains approximately 6,000 names and associated genders.
  2. The source code for a class (CFPSGenderizer) which loads this database and provides a simple API for looking up a name and returning the associated gender.
  3. A demo project which demonstrates how to use the CFPSGenderizer class.

Where did these names come from?

The names in this database have been collected from 3 primary sources.  1) A customers database, 2) freely available web site downloads, 3) the Social Security Administration's web-site (ssa.gov).  There are no license requirements for using these names nor are there any warranties as to the accuracy of the name/gender associations.

How accurate is the list?

Who knows!  From the few times I have used the list in verifiable scenarios it appeared that for the names in the list it was at least 85% accurate.  This means, of course, that it could be as much as 15% inaccurate.  I do not use this data when high-precision is needed, only for cross-verification and data extrapolation situations.  

How to use this class?

  1. Add the FPSGenderizer.cpp and FPSGenderizer.h files to your project.
  2. Instantiate an instance of the CFPSGenderizer class in your program at an appropriate location.  The class must be initialized through the Load function so your implementation should plan on performing this step only once if possible.
  3. Call one of the overridden Load functions to load the list from a database or serialized file.
  4. Call the CFPSGenderizer::Genderize function and pass in an LPCTSTR containing a first name you want to genderize.  It will return a char which will either be 'M' (Male), 'F' (Female) or 'U' (Unknown).  This function will return 'U' for names not on the list as well as for names on the list explicitly associated with 'U'.

Future Development?

As my job requires I will be updating the database by adding names and changing the associations of the names on the list.  I also plan to incorporate an edit-distance and metaphone algorithm (see my earlier Spell Checker app) to find suggestions for a name and based on the frequency of suggested male/female/unknown genders suggest a gender. Before I release this enhancement I need to test the results to see if they are even remotely reliable, though.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here