I have experienced two situations recently regarding the use of artificial (a.k.a. surrogate) versus natural database table keys. I decided to post about this because I truly do not understand their proliferation. It seems to me that immediately using an artificial key is cutting corners. I am hoping to dispel some of the myths I heard about artificial keys, make some suggestions for when to use them and when to not use them, and do this all without offending too many folks.
Firstly, just about every table in your database should include a key column. Notice that I did not write “every” table...I wrote “just about every” table. Keys provide a number of benefits to database design and maintenance. They are the primary mechanism through which normalization is accomplished and thereby assist with database design. They help eliminate duplications, thereby helping with database maintenance and data integrity. These are true whether the key is artificial or natural.
Artificial keys are virtually meaningless when describing the data. This should be a sufficient reason to avoid them. If you still need more convincing, consider this. Artificial keys do not actually prevent duplication and, thereby, reduce decrease the level of data integrity.
Take, for example, a table that stores email addresses. Email addresses are an interesting data element because they are unique in the solar system and yet they can and do change often. Now let's assume you are recording contact information about team members. If an artificial key is used, then the same team member may be listed multiple times, each entry with its own data. Now imagine trying to maintain that contact's information. Which record is the correct record? Which record gets updated? Which gets deleted?
A counter argument to the example above is that using data that can change often (such as an email address) as a primary key constraint makes it difficult to maintain the data because the key field is volatile. This is an interesting argument because it implies that key fields are somehow treated differently from non-key fields on updates and deletes.
The reality is that cascading updates and deletes works with both key and non-key fields. Continuing with the above example, let’s relate the contact table to a table managing project resource assignments. Now let’s assume that a team member changes their name and, therefore, their email address changes. Utilizing the “Cascade Update Related Fields” option (or whatever it’s called in your database management system) ensures that the database management system makes appropriate updates to related, child tables when the key field in the parent table changes.
Given all of this information, is it ever necessary to use an artificial key? Absolutely! Natural keys are identified because they are a “natural” part of the domain entity being modeled. Email addresses, for example, are a natural part of contact information and they are unique. Phone numbers are also a natural part of contact information but they are not unique.
An example of appropriate artificial key use can be found when recording project information. Projects are typically identified by a name or title but this can change. If you analyze the project domain you may find a composite key (given your organizational structure) but this is highly unlikely. Almost every attribute of a project is volatile except some non-unique attributes such as start date. Given this scenario, it is appropriate and wise to utilize an artificial key.
I hope that this article shows that I am not against using artificial keys and, in fact, I encourage their use…appropriately. I don’t think they should be the default and should be considered as a last option. Data engineers and database designers should ALWAYS look for a natural key first. I welcome your feedback in this discussion. Look for an upcoming article regarding the use of composite keys.