Click here to Skip to main content
Click here to Skip to main content

Name genderization

By , 22 Apr 2002
 

Sample app

Introduction

Over time I have compiled a database of roughly six thousand unique first names along with the gender usually associated with that name.  The names in this database are primarily English names, but also contains some other nationalities such as German, French, Russian, etc. I have used this database on a number of occasions for processes such as data entry validation, and data extrapolation.  I believe some other people could benefit from this database and worker class, so I decided to post it to Code Project.  

Why would anyone need this?

Gender is a very common dimension in data marts.  A database project I worked on (some time ago) had a database of about 1.2 million names, addresses and phone numbers.  If the client had wanted gender for the names on this list it would have been unavailable because that data was not collected with the list. The only way to get gender was to contact these people directly (unreasonable) or extract the approximate gender based on each individuals names.

Another example is a data entry application where the data entry person did a poor job of entering the data.  I used this database to cross match the gender entered by the user to the approximate gender determined through the database.  The results were that roughly 15% of the data entered required review and nearly 10% was actually incorrect.

What is this?

I am including with this project 3 items.

  1. An MS Access database (NameDB.mdb) with a single table (FRST_NM_GNDR) which contains approximately 6,000 names and associated genders.
  2. The source code for a class (CFPSGenderizer) which loads this database and provides a simple API for looking up a name and returning the associated gender.
  3. A demo project which demonstrates how to use the CFPSGenderizer class.

Where did these names come from?

The names in this database have been collected from 3 primary sources.  1) A customers database, 2) freely available web site downloads, 3) the Social Security Administration's web-site (ssa.gov).  There are no license requirements for using these names nor are there any warranties as to the accuracy of the name/gender associations.

How accurate is the list?

Who knows!  From the few times I have used the list in verifiable scenarios it appeared that for the names in the list it was at least 85% accurate.  This means, of course, that it could be as much as 15% inaccurate.  I do not use this data when high-precision is needed, only for cross-verification and data extrapolation situations.  

How to use this class?

  1. Add the FPSGenderizer.cpp and FPSGenderizer.h files to your project.
  2. Instantiate an instance of the CFPSGenderizer class in your program at an appropriate location.  The class must be initialized through the Load function so your implementation should plan on performing this step only once if possible.
  3. Call one of the overridden Load functions to load the list from a database or serialized file.
  4. Call the CFPSGenderizer::Genderize function and pass in an LPCTSTR containing a first name you want to genderize.  It will return a char which will either be 'M' (Male), 'F' (Female) or 'U' (Unknown).  This function will return 'U' for names not on the list as well as for names on the list explicitly associated with 'U'.

Future Development?

As my job requires I will be updating the database by adding names and changing the associations of the names on the list.  I also plan to incorporate an edit-distance and metaphone algorithm (see my earlier Spell Checker app) to find suggestions for a name and based on the frequency of suggested male/female/unknown genders suggest a gender. Before I release this enhancement I need to test the results to see if they are even remotely reliable, though.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Matt Gullett
Web Developer
United States United States
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralThankyoumemberMember 927458418 Jul '12 - 19:34 
Just wanted to say thanks for the database, been needing a lot of names with associated gender and this is perfect.
QuestionGender Data [modified]membersm jacoby10 Feb '12 - 3:26 
I know this is really old content, but in a sense it's a bit timeless. My company appends gender to records from not just name but also prefix, suffix, and title, and finds conflicts within existing data (https://www.precisionb2b.com/contact-data/classification/gender/Offer-Gender-Conflicts.shtm - we'll even do 100,000 from your database for free just to introduce ourselves).
 
I bumped the names here against our system and can point out some definite errors which anyone using this would want to address. The following names were declared male when they should have been female:
 
Alfa
Cecile
Haydee
Margo
Veronika
Llewellyn
Hoda
Heba
Heide
 
These were declared female when they should have been male
 
Glynn
Milan
Thom
Van
Timmy
Rowan
Sheridan
Madan
Issa
Ganesh
Ewan
Demetris
Burl
Carmine
Augustine
 
Of course some names vary in their male/female probability based on language/location, like Hani in this file, which is male in Arabic and generaly female also. I saw many declared as male or female that might not be depending on where they are or if they immigrated from another country. Simone, Andrea, and Laurence are examples of other names one would probably code totally wrong without considering location. Even there it comes down to how cautious you want to be, but declaring Andrea to be female always is not at all cautious. PrecisionB2B uses language-based male/female probabilities for all gender enrichment. Hope this helps.

modified 10 Feb '12 - 9:52.

GeneralBetter sex...memberpolitico7 Aug '06 - 9:42 
I'd add a couple of notes of caution here.
 
1) The distribution of gender for names not on Matt's list is almost definitely more female than male. The reason for this is that females are much more likely to have distinctive names than males. In fact, according to some queries I've run on well-defined sex codes, it appears that about 63% of all the first names in my table are female. This is important because when we assign gender to the most common names, as Matt has done, he's much more likely to have left females unassigned than males.
 
So by all means use Matt's genderizer, but be very careful about inferring gender on the data that has no code assigned. In other words, if you join a table of customers' first names to Matt's table and 85% of the records match and are 60% male, you can be confident that the 'unknown gender' 15% of the records are considerably more female than 40%.
 
2) I've found a few questionable codings that I'll list here. The name is followed by my estimation of gender. Matt's code is the opposite.
 
firstName sexCode
JAN F
FRANKIE F
OLLIE F
MERLE M
TOMMIE F
ELIA F
VAN M
HAYDEE F
CECILE F
CARY M
CAREY M
MARGO F
DENNY M
TIMMY M
COY M
BURL M
EDA F
BARRON M
SHEDRICK M
VERONIKA F
MAJ M
TRINI F
GANESH M
AINSLEY M
HEIDE F
 
None of this should be construed as anything other than making Matt's excellent work better, and I very much appreciate his generosity in sharing.
 
Jim Carson
Generaljust wanted to say thanksmembershoi17 Dec '03 - 15:18 
my app is in good ole foxpro dos
but the mdb file was just what i wanted, doesnt take much to turn it into a dbf and hey its in use
 
So thanks a lot for the data
 
Steve (male)
GeneralAmbigiuous namesmemberClaudius Mokler24 Apr '02 - 0:06 
Unfortunately, there are lots of ambigious names:
Jean (m in french, f in english - Jean Reno and Jean Harlow)
Andrea (m in italian, f in german)
Kim (f in english, m in korean - Kim Basinger and Kim Il Sung
René or Renée (mostly m in german, f in french)
Michel
 
... just to name a few.
 
Short forms or nick-names are even worse:
Harry
Vic
Jo(e)
Mandy (mostly f, but I happened to see this name used for m)
Billy
Frankie
Charlie
Jackie
 
Or mis-spelt names:
Meikel, Maikel (GDR variations of Michael)
Every female french name with omitted final 'e' (GDR)
(Jeannette -> Jeannett or *yuck* Janett)
Anastacia (popular variation of Anastasia)
 
These mis-spelt names are mostly gender-specific, but their mis-spellingness makes keeping a database much more difficult.
 
Regionally used names:
Maike (that one is used in northern germany and is female)
Wastl (bavarian, used in southern germany, might be male)
 

GeneralRe: Ambigiuous namesmemberMatt Gullett24 Apr '02 - 1:30 
Thanks for the feedback!
 
You are correct, of course. Many names are gender neutral and cannot be genderized to a specific gender. The current database consists of about 300 gender neutral names (gender = 'U'.) This issue makes it impossible to genderize names at 100% accuracy.
 
100% accuracy is often not needed, though. For many data marts and data validation routines 85%+ is better than nothing.
 
Also one of the improvements I would like to see to my database (if I can find the data) would be to include frequencies of usage of each name. IE. How many times has a name been used as Male and how many as female. This way I can do a propensity score and return an approximate weighted result.
 
Another issue which you rightly identify is nicknames. Nick names do tend to become more neutral than specific. Here again, though, at least for my uses 85% accurate is good enough.
 
The mis-spelling issue is also a concern. My current solution is not to worry about them since I attained my needed level of accuracy. However, in the future I intend to implement a Metaphone and EditDistance algorithm to help find incorrectly spelled names and based on the total frequency of male/female within the suggestion list genderize a name. (As I said in my article, though, I am not confident that this will yield good results.)
 
I appreciate your feedback and look forward to hearing from you again.
 
Thanks,
 
Matt Gullett
GeneralRe: Ambigiuous namesmemberPhilippe Lhoste2 May '02 - 23:17 
I appreciate your article (even if I have no use for itSmile | :) ) because it is honest (no 100% accuracy promised) and it explains well why it is needed (I wondered).
 
FYI, there are not much ambiguous names in French, with same spelling.
I recall mainly of:
Claude (m), Camille (f), Dominique (u)
The gender given is, as far as I know, the most frequent.
 
Some variants have little difference, at least when spoken, like René/Renée, Frédéric/Frédérique, Fabien/Fabienne.
 
Note that now, French people can create any first name they want. It used to be much more restrictive in the past (only calendar and historic names).
We don't see much names created from scratch, but a lot of variants in spelling, to stand out...
Eg. a regular spelling was Alain, now we see Allain, Alin, Alyn, etc. Phonetic rules can help here.
 
Regards.
 
--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--
Philippe Lhoste (Paris -- France)
Professional programmer and amateur artist
http://jove.prohosting.com/~philho/

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 23 Apr 2002
Article Copyright 2002 by Matt Gullett
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid