This week I very quickly put together a simply database of some fifty tables and about three hundred columns all told, with every table having at least one foreign key. Then I noticed some spelling mistakes and or typos in some table and column names, and this inspired me to develop a tool that can perform a spelling check on all identifiers used in all database objects.
This is no mean feat. My rought draft strategy is to
1. Build a list of all identifiers in the database. To start with, I will limit this to table, column, view, and key and index names. 2. Use patterns and heuristics to split identifiers into word parts and ignore parts such as 'FK_' etc. 3. Check spellings of split identifiers, and those not split, and mark possible errors. 4. Where an error is e.g. found and corrected in a column name, correct all other occurrences of that column name in the DB.
I know none of these is simple, but I think this could make an interesting hobby project and an article or two, if I start simple and incrementally make the app more intelligent.
What warnings, criticisms, suggestions, or general comments do you have?
Lobster Thermidor aux crevettes with a Mornay sauce, served in a Provençale manner with shallots and aubergines, garnished with truffle pate, brandy and a fried egg on top and Spam - Monty Python Spam Sketch
I'll be using a set of algorithms to split names in various ways, hopefully into separate English words, as my example above would do quite nicely. Then I'll have a configurable set of standard non- or semi-English tokens that can be checked for deviation from their definitions, and good old ignore lists.
I would check for consistency first -- if a token is repeated many times, then it is probably what you meant and a token with a small Levenshtein Distance from it is probably misspelled.
On one project I decided that (by golly) I was going to spell "threshold" with three Hs (threshhold), as it's pronounced! I later changed it back, but before doing so I'd want the dozen "threshhold"s to outweigh an outlying "threshold".
Likewise, an incorrectly named item could be a valid word.