Fervent Coder

Coding Towards Utopia...by Rob Reynolds
posts - 278 , comments - 431 , trackbacks - 0

My Links


Rob Reynolds

Subscribe to Fervent Coder RSS
Subscribe to Fervent Coder by Email

About Me

I manage several open source projects. Need...
   ...package management for Windows?
   ...automated builds?
   ...database change management (migrations)?
   ...your application to check email?
   ...a monitoring utility?

I also write for

Like what you are reading? Want to buy me a cup of coffee?
PayPal - The safer, easier way to pay online!


Post Categories


Data Warehousing, Data Mart Dimensions and Fact Table Growth

NOTE: This is up for discussion. I am not an expert in Data Warehousing (yet).  I hear from my Data Modeler that we are mostly Kimball in our approach, but I think that only applies to the Data Marts.  When I think of a Data Warehouse, I am thinking of a central place to store historical information.  Then we have a concept of Data Marts, which are separate databases of information in a slightly denormalized fashion (this is where you hear about Facts, Dimensions, and Aggregates).

For Data Warehousing we have chosen a path that is sort of a centralized history for all business domains.  We keep that somewhat normalized (which kind of follows Bill Inmon's approach), with the added 4th dimension of duration.  Then we publish out to Data Marts to be consumed by applications/reporting tools/etc. So we were talking about a few things today with data warehousing and we came up with a few ideas.

The diagram below shows some of the concepts we have been talking about in my workplace.  Transactional tables are 3NF and only hold information as of now. The Data Warehouse holds that information as snapshots at specified moments in time.  Data Marts use the star schema with facts and dimensional tables. The data marts are dimensionally modeled like Ralph Kimball's approach to data warehousing.


When looking at the difference between Ralph Kimball and Bill Inmon, the article on data warehousing in wikipedia describes some of the differences.  Another article sums it up somewhat:

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

Granularity is the idea of to what level you put things into a warehouse.  A good definition can be the "level of detail at which information is viewed or described." If you think about a book, a table of contents gets you to a chapter, an index is more granular because it can get you to the paragraph. Granularity can be events (major milestones, certain events, minor events, all events), time (years, months, weeks, days, hours, minutes, seconds), etc...


FactTableGrowth When thinking about data marts, we buy heavily into the star schema approach. This is the concept of fact tables with dimensions attached to them like stars.  The facts are units of measure. With the star schema, granularity determines the growth of your fact tables over a period of time.  For instance, if you have a fact table that has an aggregate per day, then the fact table may grow at a rate of one per day.  If you add in a customer dimension, you multiply the granularity of the dimensional tables to determine the growth of your fact table.  For instance, if you have 300 unique customers in your customer dimension and your time granularity is once per day, your fact table may grow by up to 300 rows per day (1 time per day X 300 customers).  You are probably saying, that makes perfect sense, so what?  So if you change your time granularity to hourly, how much growth could you see per day? 7200 (24 X 300) rows per day. 

So what if you add in another level of granularity - say location?  How many locations could there be?  If a customer can only be at one location over the granularity of time, then that would be a granularity factor of one no matter how many customers you have.  If a customer can go to more than one location per day, then we say that we can have up to 600 rows per day. You can see how this gets a little tricky, but lets say our customer is a truck driver and they have two locations for each period.  Where they start from (from the prior reporting point) and where they are as of that point. Now what happens when you go hourly?  300 customers x 2 locations x 24 hours in a day = 14,400 rows per day.

As you are starting to see, granularity is one of the most important factors if not the most important when thinking about data marts and their growth.

Print | posted on Thursday, February 7, 2008 11:28 PM | Filed Under [ DataWarehousing ]



# re: Data Warehousing, Data Mart Dimensions and Fact Table Growth

I am addressing what I mean by Type 1 vs. Type 2 (All Types that include the concept of history) here: http://geekswithblogs.net/robz/archive/2008/02/08/data-marts-slowly-changing-dimensions.aspx
2/8/2008 1:06 PM | Robz
Comments have been closed on this topic.

Powered by: