The Lanham Factor

Balancing the Technology-Business Equation

  Home  |   Contact  |   Syndication    |   Login
  123 Posts | 2 Stories | 100 Comments | 109 Trackbacks

News

Article Categories

Archives

Post Categories

Image Galleries

BLOGS

Companies

My Articles

I have experienced two situations recently regarding the use of artificial (a.k.a. surrogate) versus natural database table keys.  I decided to post about this because I truly do not understand their proliferation.  It seems to me that immediately using an artificial key is cutting corners.  I am hoping to dispel some of the myths I heard about artificial keys, make some suggestions for when to use them and when to not use them, and do this all without offending too many folks.

Firstly, just about every table in your database should include a key column.  Notice that I did not write “every” table...I wrote “just about every” table.  Keys provide a number of benefits to database design and maintenance.  They are the primary mechanism through which normalization is accomplished and thereby assist with database design.  They help eliminate duplications, thereby helping with database maintenance and data integrity.  These are true whether the key is artificial or natural.

Artificial keys are virtually meaningless when describing the data.  This should be a sufficient reason to avoid them.  If you still need more convincing, consider this.  Artificial keys do not actually prevent duplication and, thereby, reduce decrease the level of data integrity. 

Take, for example, a table that stores email addresses.  Email addresses are an interesting data element because they are unique in the solar system and yet they can and do change often.  Now let's assume you are recording contact information about team members.  If an artificial key is used, then the same team member may be listed multiple times, each entry with its own data.  Now imagine trying to maintain that contact's information.  Which record is the correct record?  Which record gets updated?  Which gets deleted?

A counter argument to the example above is that using data that can change often (such as an email address) as a primary key constraint makes it difficult to maintain the data because the key field is volatile.  This is an interesting argument because it implies that key fields are somehow treated differently from non-key fields on updates and deletes.  

The reality is that cascading updates and deletes works with both key and non-key fields.  Continuing with the above example, let’s relate the contact table to a table managing project resource assignments.  Now let’s assume that a team member changes their name and, therefore, their email address changes.  Utilizing the “Cascade Update Related Fields” option (or whatever it’s called in your database management system) ensures that the database management system makes appropriate updates to related, child tables when the key field in the parent table changes.

Given all of this information, is it ever necessary to use an artificial key?  Absolutely!  Natural keys are identified because they are a “natural” part of the domain entity being modeled.  Email addresses, for example, are a natural part of contact information and they are unique.  Phone numbers are also a natural part of contact information but they are not unique.  

An example of appropriate artificial key use can be found when recording project information.  Projects are typically identified by a name or title but this can change.  If you analyze the project domain you may find a composite key (given your organizational structure) but this is highly unlikely.  Almost every attribute of a project is volatile except some non-unique attributes such as start date.  Given this scenario, it is appropriate and wise to utilize an artificial key.

I hope that this article shows that I am not against using artificial keys and, in fact, I encourage their use…appropriately.  I don’t think they should be the default and should be considered as a last option.  Data engineers and database designers should ALWAYS look for a natural key first.  I welcome your feedback in this discussion.  Look for an upcoming article regarding the use of composite keys.

posted on Saturday, November 05, 2005 11:10 AM

Feedback

# re: Artificial Database Keys 11/5/2005 3:11 PM Ann Archer fan
B,

How about using an artificial primary key with a unique index and/or application code? I understand what you're saying about cascading updates/deletes, but let's say we're talking about the resource table, the pool of allowable resources, and you make email the primary key I think you (as well as anyone that will ever work on the project) really have to understand the impacts and go-forward decision making. I'd prefer using an artifical key, such as employee serial number and enforce uniqueness on the email field with an index or application code (as a business rule), where for example an email address can exist on one and only one active employee record.

I do agree that there isn't a right/wrong answer in all cases and you can feel free to do it wrong. : ) (just pulling your leg B)

Ann Archer fan

# re: Artificial Database Keys 11/6/2005 6:20 AM Codesailor
Good points and I agree with your implication that some strategic thinking is required. You seem to be referring more to an enterprise application than a stand-alone app and that means my example was bad. If you are integrating with an HRIS (or any other system for that matter) then you should probably (definitely?) use their key. If you *know* of any future integration efforts that may cause your chosen key to "break" those efforts (such as your scenario) then perhaps an artificial key is for you.

I also like your thought of requiring field/column uniqueness even if it's not a primary key field. That's a really good suggestion. I understand where you're coming from but it would help me to see a specific example to better understand your position. Ultimately, it seems that we both agree that good data modeling and engineering techniques are required to justify the decision...either way.

Thanks for the feedback and keep it coming!

# re: Artificial Database Keys 11/6/2005 10:45 AM Blogus Maximus
Good article, nice to see you posting again.

C.

# re: Artificial Database Keys 11/27/2005 8:11 AM Sean Smith
Good Article - I have some serious reservations about the following point:

Referential Integrity (RI) and Artificial Keys (AK) - In your article you asserted that AK use could result in a loss of RI - but I don't feel you explained how.

I am a member of the other (pro AK) camp, and I'll separate my reasons into logical reasons, and physical reasons.

Logical Reasons
Using Natural Keys as keys in your database does not separate relationship fields from data fields. Natural Keys are interesting, because thay can serve as both. Your solution requiring cascading updates and deletes illustrates the problem with this. If this one column changes it could result in a long series of database updates. Multiple threads causing cascading updates in a transactional, multithreaded application could easily result in a deadlock situation. Email addresses is also user entered data - and for that reason should not be trusted to be accurate. For that reason I would have reservations enforcing uniqueness, let alone using it as a key.

Physical Reasons
Tables are typically clustered by primary key, this is logical, since that table should generally be referenced by the primary key, and clustering index seek speeds this up. Each entry in a non clustered index refers to the clustered key value when referring to the row it points to.
Natural Keys are typically text based columns (email address, license number, business partner#). If a table is clustered by a natural key then every non clustered index of that table will be larger, and slower to access.

Example
I'll use your same contacts example, assuming that the average width of an email address is 32 bytes and that the table contains 100,000 contacts.

Each index on this table will contain 3.2MB of data, just to refer to the clustered index.

Using an Int32 Artifical Key as the primary key, and therefore the cluster key would require only 200KB.

The storage impact of this are nothing in comparison to the processing impact. If you scale this to an even larger table your database server will become overloaded just loading an index on the table.

# re: Artificial Database Keys 11/28/2005 8:23 AM Codesailor
Sean,

GREAT FEEDBACK! I had not considered those points. I think it's interesting that you use the fact that natural keys are both relational and data fields as a reason to NOT use them and I think that's the reason they should be used!

I definitely see your point with deadlocks and such. That's a great point. On that note, I want to re-emphasize that I think they should be considered first and heavily but it's not appropriate for every situation.

Supporting your argument and supporting my thought that NK's are not always appropriate, you make a great point in your Physical argument. I had not considered database size or performance. I really am focused on a more pure analysis & design approach and on maintenance and extensibility.

Thanks for your great feedback AND for making me see some another perspective!

B

# re: Artificial Database Keys 11/28/2005 9:39 PM Sean Smith
Codesailor, I'm happy to contribute.

Let me take another shot at expressing my logical argument... I conceed that it is largely a matter of taste - but I do feel that more experienced data architects will have tastes aligned towards the use of artificial keys.

We both agreed that Natural Keys behaive both as data, and as key values. The question is, which do they resemble most? I would assert that Natural Keys are data first, and keys only when convenient...

Take the contacts table example - a contact could have multiple email addresses over time, and potentially at once. If the email address serves as the PK, there is no way to represent this 1 to Many relationship accurately. Additionally, a single email addresses could potentially refer to multiple contacts (A group box, or a box associated with a role that is filled by different people over time). The point I am trying to illustrate is, that Natural Keys accomodate business rules that are defined external to your application, and that you may not be aware of in any way. A Natural Key that *should* be unique may have periods of overlap or conflict (a renewed or recycled license number for instance). If the relationships in your data rely on these fields, you may find that some of the data has relationships that are unresolvable - expecially if your not enforcing uniqueness with constraints. If you are enforcing uniqueness, you may find those constraints becoming obstacles (what if two contacts swap email addresses?).

I would argue that a key should be constant once it is set.
In my previous post I described a database layer threading problem... Now I am going to describe a business layer issue... Suppose two different threads are working on the same row within the business layer both threads read the data at the same time - Thread A creates a transaction, changes the natural key value and saves - the database cascades this change to child entities and commits the transaction. After Thread A is finished, the user working on Thread B changes the contacts first name and attempts to save - however, the row has effectively been "moved" from underneath it. The Primary Key value it has for the current row no longer exists in the database (or worse, it refers to a different row!).

Ideally, Keys are easier to rely on when they are constant and unmoving, and data should be assumed updateable... I would argue that a key value should be meaningless outside of its use as a key.


# re: Artificial Database Keys 11/30/2005 1:16 AM Codesailor
This is really great! You've made some more really solid arguments for using artificial keys. You certainly identify some issues I had not consider with respect to email addresses. Thanks for your continued posts and for enlightening me!

I have often fealt that I'm one of those software guys that give DBA's a headache. Now I know why! :)


# re: Artificial Database Keys 2/11/2008 3:52 AM Ashwin
Sean Smith argues for the case of artificial keys by giving examples where the primary key is expected to change....I still believe that you can use natural keys as primary keys for those examples where the primary key will not change.

Secondly he argues that well he is using less data with his approach. If I added a artificial primary key + the other columns that I have as part of my table....then thats more data to store not less.



# Keys are not 'a column' 7/18/2008 8:31 AM Leandro G Faria Corcete DUTRA
When you say 'just about every table in your database should include a key column', you make a few fundamental conceptual mistakes.

First, a key can be composed of more than one attribute.

Second, a relation (AKA table) can have more than one (so-called candidate) key, one of which can be choosen as the primary one; the other, alternate keys can be declared as unique constraints.

Third, all relations properly modelled do contain at least one natural key. The issue is the convenience of creating a surrogate key, but the declaration of at least one natural candidate key must never be forgotten.

Post Feedback

Title:
Name:
Email: (never displayed)
Url:
Comments: 
Please add 4 and 4 and type the answer here: