In the
previous post, I have stated the scenario and the things we have to look at before making any decisions. Knowing the right questions to ask, we get the right answers that will matter to our design.
In my case, for the 4 things I listed in the previous post, the answers are below:
1. How often are the dimension tables updated? Does it keep history or not?
--Dimensions get updated whenever there is an update. Therefore 3 columns in all dimension table, start date, end date and history. History is flag either 'Y' or 'N' to indicate whether it is still active as of the current time stamp. The history is kept and new update comes as new insert
2. How often are data loaded into LSP_PWE_Config_Fact?
The configuration data are updated once a day at 11: 00 pm. There is currently a datekey column in the fact table that gives the date and hour. However, because it is only updated once a day, so before the next update we will have datekey as ''201211112300" for all of the records.
3. How often are data loaded into Pseudowire Fact or any performance fact?
The statistic record are at hourly level. In other words, the datekey for fact table is having 1 hour increment: 201211110000, 201211110200 ...... 201211112300 etc.
4. What does date key in date dimension and fact table have?
As indicated above. The datekey in Date Dimension also contains hours.
Remember the 2 dashboard requirements I mentioned before? I will just mention it again here:
Overview report queries: LSP Dim --> LSP_PWE_CONFIG_F--> Pseudowire Dim
In this case, LSP_PWE_CONFIG_F is more of an mapping table that join the 2 dimensions when they are N:N relationship.
Performance reports queries: LSP Dim --> LSP_PWE_CONFIG_F--> Pseudowire Dim --> Pseudowire Fact
Now how will I go above creating such reports? How would I design the data model for this?
First of all, I have to check the date against fact table to avoid full table scan, moreover, to ensure we are getting the right data at right time at the hourly or daily level, we have to join date Dimension to fact table.
What about configuration fact?
We also need to look at the date in the configuration fact table to ensure that we are picking the dimension objects that are active during the date range we pass from the filter. Does that mean we will have a model like this?
It looks like it will work but it has an issue. Suppose we have a list of LSP and Pseudowires that started on Nov 10 at 1:am, and it's never ended. The performance of these objects are constantly updated in fact table every hour and everyday. Now, the user goes into the system and select the date as Nov 11 from 5am to 7am. Fair enough, the user wants to see the performance of these objects between 5am to 7am on Nov 11th knowing that these values are recorded in the fact table.
However, based on the above model, no data will be returned, which is wrong. The reason why no data will be return is that the datekey from day dimension in this case will be 2011110500 ....2011110700, which will be passed to both LSP PWE Config fact and Pseudowire Fact. Now Pseudowire fact obviously will have data for these keys because records are updated every hour, the same won't be true for LSP PWE Config fact. Because of the fact that the data are updated once a day at 11pm, which is 23rd hour. So all of the keys in that table will be date||2300. This means, unless user's filter values includes 11pm's hour, the report will always be empty.
So how are we going to solve this problem?
There are several options we can think of from modeling's perspective. I will just list 2 here:
1. Substring the date key in LSP PWE Configuration fact so that the join will only look at the date part, but not hour part. In this case, the report will be correct since this table updates once a day
2. Add a start date, end date including time stamps and history column in the LSP PWE Config fact just like in Dimension tables. Instead of joining LSP PWE Config fact to date dimension, we will remove this join. Instead, we will use start date, end date from this table to receive filter values as well as hours. The date will be filtered based on directly passing filter values into this table, instead of through datekey.
Now obviously, the first approach sounds much easier. Yet, it has a drawback. Although it is going to work after making the change to check only date, not hours, it has a limitation. When the system decides to change the data collection frequency of this table from once a day to twice a day, this design is not going to be flexible enough to accommodate that. Because now we do need to look at hours, then what to do? From an architect's point of point, I need to take these potential changes into consideration in every design. So I can't just provide short term solutions.
Therefore, after some deliberation, I have decided to explore the second approach.
Stay tuned