Click here to Skip to main content
Click here to Skip to main content
Go to top

Name genderization

, 22 Apr 2002
Rate this:
Please Sign up or sign in to vote.
Extrapolate the gender of a person based on their first name
<!-- Download Links --> <!-- Add the rest of your HTML here -->

Sample app

Introduction

Over time I have compiled a database of roughly six thousand unique first names along with the gender usually associated with that name.  The names in this database are primarily English names, but also contains some other nationalities such as German, French, Russian, etc. I have used this database on a number of occasions for processes such as data entry validation, and data extrapolation.  I believe some other people could benefit from this database and worker class, so I decided to post it to Code Project.  

Why would anyone need this?

Gender is a very common dimension in data marts.  A database project I worked on (some time ago) had a database of about 1.2 million names, addresses and phone numbers.  If the client had wanted gender for the names on this list it would have been unavailable because that data was not collected with the list. The only way to get gender was to contact these people directly (unreasonable) or extract the approximate gender based on each individuals names.

Another example is a data entry application where the data entry person did a poor job of entering the data.  I used this database to cross match the gender entered by the user to the approximate gender determined through the database.  The results were that roughly 15% of the data entered required review and nearly 10% was actually incorrect.

What is this?

I am including with this project 3 items.

  1. An MS Access database (NameDB.mdb) with a single table (FRST_NM_GNDR) which contains approximately 6,000 names and associated genders.
  2. The source code for a class (CFPSGenderizer) which loads this database and provides a simple API for looking up a name and returning the associated gender.
  3. A demo project which demonstrates how to use the CFPSGenderizer class.

Where did these names come from?

The names in this database have been collected from 3 primary sources.  1) A customers database, 2) freely available web site downloads, 3) the Social Security Administration's web-site (ssa.gov).  There are no license requirements for using these names nor are there any warranties as to the accuracy of the name/gender associations.

How accurate is the list?

Who knows!  From the few times I have used the list in verifiable scenarios it appeared that for the names in the list it was at least 85% accurate.  This means, of course, that it could be as much as 15% inaccurate.  I do not use this data when high-precision is needed, only for cross-verification and data extrapolation situations.  

How to use this class?

  1. Add the FPSGenderizer.cpp and FPSGenderizer.h files to your project.
  2. Instantiate an instance of the CFPSGenderizer class in your program at an appropriate location.  The class must be initialized through the Load function so your implementation should plan on performing this step only once if possible.
  3. Call one of the overridden Load functions to load the list from a database or serialized file.
  4. Call the CFPSGenderizer::Genderize function and pass in an LPCTSTR containing a first name you want to genderize.  It will return a char which will either be 'M' (Male), 'F' (Female) or 'U' (Unknown).  This function will return 'U' for names not on the list as well as for names on the list explicitly associated with 'U'.

Future Development?

As my job requires I will be updating the database by adding names and changing the associations of the names on the list.  I also plan to incorporate an edit-distance and metaphone algorithm (see my earlier Spell Checker app) to find suggestions for a name and based on the frequency of suggested male/female/unknown genders suggest a gender. Before I release this enhancement I need to test the results to see if they are even remotely reliable, though.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Matt Gullett
Web Developer
United States United States
No Biography provided

Comments and Discussions

 
GeneralThankyou PinmemberMember 927458418-Jul-12 19:34 
QuestionGender Data [modified] Pinmembersm jacoby10-Feb-12 3:26 
GeneralBetter sex... Pinmemberpolitico7-Aug-06 9:42 
Generaljust wanted to say thanks Pinmembershoi17-Dec-03 15:18 
GeneralAmbigiuous names PinmemberClaudius Mokler24-Apr-02 0:06 
Unfortunately, there are lots of ambigious names:
Jean (m in french, f in english - Jean Reno and Jean Harlow)
Andrea (m in italian, f in german)
Kim (f in english, m in korean - Kim Basinger and Kim Il Sung
René or Renée (mostly m in german, f in french)
Michel
 
... just to name a few.
 
Short forms or nick-names are even worse:
Harry
Vic
Jo(e)
Mandy (mostly f, but I happened to see this name used for m)
Billy
Frankie
Charlie
Jackie
 
Or mis-spelt names:
Meikel, Maikel (GDR variations of Michael)
Every female french name with omitted final 'e' (GDR)
(Jeannette -> Jeannett or *yuck* Janett)
Anastacia (popular variation of Anastasia)
 
These mis-spelt names are mostly gender-specific, but their mis-spellingness makes keeping a database much more difficult.
 
Regionally used names:
Maike (that one is used in northern germany and is female)
Wastl (bavarian, used in southern germany, might be male)
 

GeneralRe: Ambigiuous names PinmemberMatt Gullett24-Apr-02 1:30 
GeneralRe: Ambigiuous names PinmemberPhilippe Lhoste2-May-02 23:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140916.1 | Last Updated 23 Apr 2002
Article Copyright 2002 by Matt Gullett
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid