Key Value Pairs in Database design

There was an interesting thread on the SQLDownUnder mailing list last week about the use of Key Value Pairs in database design. One of the Guru's on the list Jeremy Huppatz made a few posts which covered off pretty much all the issues I was thinking about.

One of the key points that Jeremy made was this one:

Remember that relational databases are designed with 3 key goals - reliable data entry, stable consistent storage and rapid retrieval. By using KVP's you throw the rapid retrieval capability straight out of the window.

Working mainly in the BI field might mean that I can have a tendency to focus more on the retrieval side of things, but I think it is good practice with any system to keep both the inputs and outputs in mind while building it.

Rather than paraphrasing Jeremy's post, I contacted him and asked if he would mind if I reproduced it here:

KVP's look elegant for data capture but they make reporting a nightmare.
If you're going to use KVP's for data capture, strongly consider running triggers or a second-phase ETL process to transform KVP data into an entity-based schema.
Try to avoid reporting against big heaps of KVPs.
If you have no other option, try to ensure that your KVP rows have as many additional lookup keys as possible to allow queries against your heaps to be heavily filterable on the KVP categories, subtypes, etc.

Also consider the fact that with KVP's you might effectively be building some optimizable mathematical structures such as Directed Acyclic Graphs or Bidirectional Acyclic Graphs, and consider the extra data-field considerations that may be required.

Before doing anything else, however, I'd strongly recommend doing a full logical model and looking at what common properties on your sets of objects suggest standard ERD implementations and relationships. KVP's are most often used as a short-circuit to the data analysis process and are rarely an easy solution to support and maintain in the long term - especially when you start connecting enterprise systems together.

Not only did Jeremy post the above points, he then went on to work through a simple example of what happens when you try to report off a KVP structure.

One of the big problems with reports is that you end up building lots of logical subqueries when building reports.

e.g. Consider a fairly typical KVP table defined as follows:

CREATE TABLE KVP_Table (record_id dec (17,0) identity (1,1) NOT NULL PRIMARY KEY
, key_id int NOT NULL
, category_id int NOT NULL
, sub_category_id int NOT NULL
, key_name nvarchar(60) NOT NULL
, key_value nvarchar(max) NOT NULL)

Let's also assume that key_id, category_id and sub_category_id are foreign keys to lookup tables with appropriate category names that can be used for filtering, and that record_id is used as a foreign key by other tables that refer to our KVP heap.

Let's say for the sake of argument that our application is a medication administration database and we want to look for records where a specific medication specification has been administered.

So, when looking for something general like medication name like Ibuprofen, we need:

SELECT record_idFROM KVP_TABLEWHERE key_name = 'Medication Name'
AND key_value = 'Ibuprofen'

Should come back pretty quickly - this is the trivial case.

Let's say we wanted to find out something using more than one filter.

With the table as specified above, we have no way to correlate KVP's with a broader context. So... lets add an event_id column to the table (make it int NOT NULL) to allow correlation of medication administration events with the corresponding KVPs.

Then we might want to find a list of event ID's for which our medication matches a compound KVP definition for medication name, dosage level and administration route.

Let's say:

SELECT event_idFROM KVP_TABLEWHERE key_name = 'Route' AND key_value = 'Oral: Capsule'
AND event_id in (
      SELECT event_id      FROM KVP_TABLE      WHERE key_name = 'Dosage' AND key_value = '50mg'
      AND event_id in (
          SELECT event_id          FROM KVP_TABLE         WHERE key_name = 'Medication Name' AND key_value = 'Ibuprofen'
         )     ))

You can see where this starts getting messy. This could also be implemented as a join as below:

SELECT e1.event_id
FROM KVP_TABLE e1    INNER JOIN KVP_TABLE e2 ON e1.event_id = e2.event_ID
        INNER JOIN KVP_TABLE e3 ON e2.event_id = e2.event_id WHERE e1.key_name = 'Route'    AND e1.key_value = 'Oral: Capsule'
    AND e2.key_name = 'Dosage'    AND e2.key_value = '50mg'
    AND e3.key_name = 'Medication Name'    AND e3.key_value = 'Ibuprofen'

However, when you want records that match a "NOT IN" query specification (i.e. you want to screen out records that actually MATCH the fields specified - this is more common than you'd think) you end up with the JOIN query being no longer viable - the subquery becomes the optimal approach.

Consider the impact of a single "NOT IN" clause in a report that has 5 levels of filtering on a table with 20 million rows. What sort of response times would you expect to see? IMO, you're kidding yourself if you'd expect anything under 40-50 seconds. More realistic would be 2-3 minutes - if you're lucky.

If you're looking to deploy this in SQL Server 2005, you might have some joy using the native XML datatype to encode your key value pairs, but searching for them is still going to be painful, even if you use the native XML indexing available in the latest DBMS release.

Anyway - you get the idea... I thought it might be better to illustrate the problem than just to say "Don't do it." Hopefully this gives you an idea of why the red pill might be a bad idea.

The other part of the thread that I also had an interest in came about when it was mentioned that reporting should not be too much of an issue as it was planned to do it all through OLAP cubes. Below was Jeremy's response to this proposal,

However, you still have the issue that a given logical record spans multiple rows in the KVP values table, and this makes filtering a pain.

From a cube perspective, many of your facts are going to be built as KVPs - you need an additional view that does a pivot summary of each of the KVP values relevant to a specific "event" into a fact table with the appropriate correlations to dimensional members. From a query processing perspective, this rapidly becomes a nightmare. Expect much longer cube-builds going down this route than using relational tables where the fact table doesn't have to be computationally derived.

With respect to cube dimensions, you're going to run into issues with dimension definitions as you're going to have to build a view defining each set of key-names to correlate the members for each level of your dimension. This is doable, but is also fairly hard to maintain. Adding a new property KVP to a record means doing a full process of your dimension and cube to reflect it in your reporting environment.

Relational data can speed up dimension and cube builds by several orders of magnitude, as you don't need to build an abstraction layer that defines the dimension memberships and levels required by each individual KVP. You simply define your dimensions directly from the relational schema.

I understand the attraction of KVPs from a web developers standpoint, and from the perspective of someone wanting to do a lot of SET/GET-based OOP, but once you start doing more complex method-based coding (e.g.

complex business rule validation which might update multiple instance/class properties in a single invokation), the benefits of KVPs are quickly eroded.

This of course should not be taken to mean that you should never use KVPs. This post is merely meant to highlight the issues involved with KVPs so that you can weigh up the issues involved. KVP tables are often used for storing things like settings (which I have done myself) or for storing extension attributes (I understand Microsoft CRM employs KVP tables for this purpose.

New on Geeks with Blogs