Scott Muc

Another .Net Developer Named Scott

  Home  |   Contact  |   Syndication    |   Login
  29 Posts | 0 Stories | 27 Comments | 4 Trackbacks

News

Archives

Post Categories

Blogging Tools

Personal

Work Related

There are many situations where you'll have aggregate functions in your queries. Usually you want to avoid these aggregations in an OLTP application in order to decrease resource consumption. As an example I'll be using the following simplified schema to show what the problem is and a solution that I often use to deal with this issue.

What we have is a table containing articles and another table to track each time an article is read. The schema has been simplified to focus on the problem, but imagine that a table like NewsArticles having many more columns and relations with other tables.

CREATE TABLE NewsArticles (
    INT ArticleId IDENTITY(1, 1) NOT NULL,
    NTEXT ArticleText,
    PRIMARY KEY (ArticleId)
)
GO
CREATE TABLE NewsArticleHits (
    SHORTDATETIME ReadDate DEFAULT GETDATE(),
    INT ArticleId NOT NULL
)
GO
CREATE CLUSTERED INDEX IDX_NewsArticleHits_ArticleId
    ON NewsArticleHits(ArticleId)

So what is the problem? Let's say we have a web application where when an article is read a row is inserted into the NewsArticleHits table. For some reason we want to display how many times the article has been read on the front end. To do this we would have a query like the following:

SELECT A.ArticleText, COUNT(H.TrackId) AS HitCount
FROM NewsArticles A LEFT OUTER JOIN NewsArticleHits H
        ON A.ArticleId = H.ArticleId
WHERE A.ArticleId = @ArticleId -- Of course we use parameterized queries

This query should reveal my decision for the CLUSTERED INDEX. There's also an off by one bug because it will return a count of 1 even if no hits have occurred. This setup will work fine for a very small system. Just guessing here, but this query should be fast enough for almost all but the most popular articles. Obviously Slashdot wouldn't be able to use a setup like this to see a posts stats.

There's another issue that's caused by the choice of the CLUSTERED INDEX. If you want to perform date range queries you're going to see full table scans on the table.

What I've chosen to do is to replace the schema with the following:

CREATE TABLE NewsArticles (
    INT ArticleId IDENTITY(1, 1) NOT NULL,
    NTEXT ArticleText,
    INT HitCount DEFAULT 0 NOT NULL,
    PRIMARY KEY (ArticleId)
)
GO
CREATE TABLE NewsArticleHits (
    SHORTDATETIME ReadDate DEFAULT GETDATE(),
    INT ArticleId NOT NULL
)
GO
CREATE CLUSTERED INDEX IDX_NewsArticleHits_ReadDate
    ON NewsArticleHits(ReadDate)

I've added a column to NewsArticles to hold the hit counter. This column will have to be computed and updated. I use SQL Server and don't plan on leaving that platform anytime soon so I decided to use TRIGGERs to keep that column up to date. I've also changed the CLUSTERED INDEX on the NewsArticleHits table to allow for date range queries. With the computed hit column there's no need to JOIN the NewsArticleHits table so I won't bother to put an INDEX on the ArticleId column.

If you ever search for implementations of TRIGGERs used you may find some rather naive implementations:

CREATE TRIGGER trig_Increment_NewsArticles_HitCount
ON NewsArticleHits AFTER INSERT
AS
UPDATE     NewsArticles
SET HitCount = HitCount + 1
WHERE ArticleId = inserted.TrackId

Unfortunately that TRIGGER won't be accurate if you happen want to do multiple INSERTs at once. This could easily be the case if the hit updates are performed in batches. Make sure you don't use the above TRIGGER and use something more like the following:

CREATE TRIGGER trig_Increment_NewsArticles_HitCount
ON NewsArticleHits AFTER INSERT
AS
UPDATE     NewsArticles
SET
    NewsArticles.HitCount = NewsArticles.HitCount + C.HitCount
FROM
    (SELECT ArticleId, COUNT(*) AS HitCount
    FROM inserted
    GROUP BY ArticleId) AS C
WHERE
    WHERE ArticleId = C.ArticleId

And there you go! So far I have this kind of implementation in a production database where the hit table has over 17 million rows. I'll leave the DELETE TRIGGER as an exercise for the reader. Well... mainly because our system won't be deleting rows from this database. You may have noticed that I did not make the ArticleId a FOREIGN KEY to the NewsArticle table. This is for a couple reasons: If the article gets deleted we don't necessarily want to lose the stats, and CASCADING DELETES on these kind of tables can be killer performance wise.

I would love to hear other ways people have tackled this problem. Would you prefer to do this on the application level?

posted on Tuesday, June 05, 2007 8:02 PM