Over and over and over again, I see database designs using serial or auto increment columns, using numerical values. With databases becoming larger every day, terabyte sizes and larger, these kinds of fields for row identifiers are simply going to run out of values. Additionally, many database designers haven't seen fit to learn the lessons that programmers learned long ago, like standard interfaces are a good thing. Finally, whether or not you believe normalization is a good thing, a certain amount is required to make sure that the data is self integral. My goal with this article is to setup two basic types of tables, using uniqueidentifiers (also known as a
guid) as the row identifier columns, fixing a number of problems along the way, while providing consistency in the database design.
As a programmer, the concepts of inheritance, polymorphism and encapsulation are very familiar to me. I like the ability to use base objects to contain references to inherited objects, after a while, you get used to this ability. Consistency is the name of the game with inheritance and polymorphism. It surprises me that these concepts aren't usually taught to database designers. I couldn't believe my eyes at the mess some of the DBAs were making of databases that I was forced to use. For one of my contracts, the worst example was 3 different tables that were meant to hold similar data:
- an id for a
name which was nullable and non-unique.
- an id for a
Not one of the tables had any constraints to enforce that the
type existed, or contained a constraint to enforce that the
type ids were valid to be used in that table. One of the tables actually didn't contain the
type id at all, the explanation was "it is assumed to be X". The most annoying thing was that the columns meant to hold the same data were named the same thing in 2 of the tables, but not in the third. Thus, in order to write a query to pull back in a uniform manner for a single administration page that could reused, it required radically different SQL be special cased and hardcoded for each, including hardcoding the missing value. To make matters worse at that point, the database server used didn't support stored procedures, so this SQL had to be stored in the application. Finally, several of the tables did not have a row identifier column, the DBAs stated that a compound key was good enough.
When I started designing a new database to work with this database and fill some of these tables, one of the requirements was a familiar one, keep track of who and when created the row, as well as who modified the row and when. So I will be using this as another few columns to add to my basetable.
Hard to Detect Bad Data and Replication Problems
There is an error, something I call "misplaced ids" that is a major problem that I see in databases that use numeric id columns. Let me explain, suppose I have 3 tables "
LicensePlate" and "
Vehicle LicensePlate" that ties the 2 together using the id columns.
|ID ||Plate Number |
|2 ||Not REAL |
|3 ||Bad One |
|7 ||Fake 3 || ||
|ID ||Vehicle Identification Number |
|4 ||1238454 |
|3 ||0985463 |
|7 ||7345684 |
A nearly undetectable error in the data could be shown here if I want the vehicle with the VIN of "7345684" to have the
LicensePlate of "Bad One". If due to a programming error, the numbers get reversed, I could get the vehicle with the VIN of "0985463" (id:3) associated with
LicensePlate with the "Plate Number" "Fake 3" (id:7), and there is no way to detect it, short of looking through the data by hand. This can happen even with foreign key constraints, because the numbers are valid, even though it's obviously not what was intended.
If, rather than using numerics, we use uniqueidentifiers, this problem is solved. It also prevents the issue I have seen creep up where replication could not be done because the autoincrement id conflicted between the 2 sites. (One site would insert a row that would get an ID of "7" while the other site would insert another row that also got the ID of "7", now the rows can't be merged because the 2 rows are different, but have the same ID).
As a matter of preference, I always use a row identifier on all database tables, and I always call it "ID". This is really nice because I know what the columns will be for joins. I also know that multi-column joins will never be necessary, and because of the reasons above I always make it a uuid I know what the datatype of the column will be.
Creating the Base Tables
Using SQL Server I didn't have access to table inheritance, but it is supported in other database servers so I thought I would share how to use it to simplify creating tables in a uniform way.
So here is the DDL to create my basetable:
CREATE TABLE public.basetable
"ID" uuid NOT NULL DEFAULT dbo.newid(),
"CreatedBy" character varying(50) NOT NULL,
"CreatedOn" timestamp with time zone NOT NULL DEFAULT now(),
"UpdatedBy" character varying(50) NOT NULL,
"UpdatedOn" timestamp with time zone NOT NULL DEFAULT now()
When using pgAdmin III the quotes are required to keep the case, other tools may not require the quotes to stay case sensitive.
If you have uuid-ossp installed, you can use
uuid_generate_v4() rather than my previously documented newid.
This table is the base for all tables, but what about more "entity" type tables. A surprising amount of "entity" type tables have 2 things in common, a human meaningful name, and an expanded description. So I will inherit the base table, and add those 2 columns.
CREATE TABLE public.entitybasetable
"Name" character varying(255) NOT NULL,
"Description" character varying(512) NOT NULL,
CONSTRAINT entitybasetable_pkey PRIMARY KEY ("Name")
With these 2 tables in place, I can start making entity and cross reference tables.
Using the Code
So here are 3 tables, Users, the table for the users of website.
CREATE TABLE "Users"
"password" bytea NOT NULL,
salt bytea NOT NULL,
CONSTRAINT "pk_Users" PRIMARY KEY ("Name"),
CONSTRAINT "uix_Users" UNIQUE ("ID")
ALTER TABLE "Users" OWNER TO postgres;
The next table is
CREATE TABLE "Roles"
CONSTRAINT "Roles_pkey" PRIMARY KEY ("Name"),
CONSTRAINT "uix_Roles" UNIQUE ("ID")
ALTER TABLE "Roles" OWNER TO postgres;
This is a great example of why to use inherited tables, no new columns, just a unique table to hold unique data so I can enforce referential integrity.
Finally a cross reference table to put users into a role, and make sure that a user cannot be in more than a single role at a time.
CREATE TABLE "UsersInRoles"
"Roles_ID" uuid NOT NULL,
"Users_ID" uuid NOT NULL,
CONSTRAINT "pk_UsersInRoles" PRIMARY KEY ("Roles_ID", "Users_ID"),
CONSTRAINT "fk_Roles" FOREIGN KEY ("Roles_ID")
REFERENCES "Roles" ("ID") MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT "fk_Users" FOREIGN KEY ("Users_ID")
REFERENCES "Users" ("ID") MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT "uix_UniqueUsers" UNIQUE ("Users_ID")
ALTER TABLE "UsersInRoles" OWNER TO postgres;
COMMENT ON CONSTRAINT "uix_UniqueUsers"
IS 'Enforce that a user can only be in a single role';
I always name reference columns as "
Tablename" an underscore then "
ID". So "
Users_ID" is supposed to be a foreign key to "
Users", and its unique "
ID" column. It's easy to see, even if the foreign key somehow got dropped.
Points of Interest
At this point, we have addressed any possibility of putting an ID from the wrong table into a referencing column. Because the chances of the same value coming up twice (assuming a version 4 uuid) is 1 in roughly 2122, while that isn't a guaranteed unique number, it's pretty darn close.
We have also made sure that there is a uniformity in table design, each table has a unique row identifier, that is a
uuid, that is named "
ID". Every table has the 4 columns we need to track changes, and they are always named the same thing, have the same definitions, and they have the same defaults.
As a programmer, I can now implement a handful of interfaces or base objects to inherit all the business objects from. This is useful for doing things like creating simple SQL for checking if a record exists or has been modified.
For those that don't believe in foreign key constraints, because "it slows down the inserts" or some such excuse, personally, I would much rather let the database enforce the constraints making sure my data is always valid, preventing programming errors or data errors and deal with a slightly slower database.
Yeah, I hear all the detractors saying that "
guids are just long strings" or "the overhead is too high". I have seen this point of view held to so passionately that people create their own incrementers with numbers and letters, avoiding special characters, because these same people avoid parameters. The most horrid part is I have seen it done in code, so nothing but a handful of programs could generate the ids, OUCH! It pains me to think about it.
In truth, assuming the binary representation of the
uuid being held, rather than the string representation, it only requires 16 bytes to store, and you don't have to worry about case sensitivity. In contrast, a case-insensitive string that contains only a-z and 0-9 that could contain as many possibilities would take 24 characters to hold which depending on your database encoding could easily be 48 bytes (assuming an encoding like UTF-16), so using a string to do the work in an attempt to somehow get the same features is counter-productive. Besides many databases today handle the
uuid type natively (even Microsoft Access), so it is less expensive to use than you might think, though I won't even pretend to say that it is as efficient as a 32bit or 64bit integer.
It's unfortunate that more database servers don't support inheritance.