Sunday, August 1, 2010

CDC Technique for dimension table which is based on a multi-table query

By: Milind Zodge

In the Data warehousing project you have dimension and fact tables. Usually, if data is coming from a single table, we can use the approach what I have presented in the last article "Change data capture for Oracle 9i database without adding triggers on the source table".
There are also plenty of other options available like CDC, using timestamp etc. However the problems comes when you have a dimension table which is constructed based on a multi-table query. In this case none of the above approach can work directly.
Overview
Consider a case of Sales Representative dimension, this dimension is based several attributes like area, login etc. These attributes are coming from different tables. Now we will see what we can use to have an incremental update of this table.
The examples shown in this article is for Oracle database however same concept can be used for other database engines.

Step 1: Creating a Function which will return hash value
We will be using a hash value technique to compare the rows. Well we really have one more option, compare each field and see if any one of them is changed and that way determine the changed row.

However hash value method is faster than the above approach and code also become manageable with less conditional statements. Both methods are same though.
Create a function such that it will read a value as a text parameter and will return a hash value for it.

e.g.
FUNCTION dim_hashvalue (p_input_str VARCHAR2) RETURN VARCHAR2 IS
l_str VARCHAR2(20);
BEGIN
l_str := dbms_obfuscation_toolkit.md5(input_string => p_input_str);
RETURN l_str;
END dim_hashvalue;

Step 2: Add a new column in the dimension table to hold a hashvalue
Create a new column "hashvalue" in the dimension table. And update its value by using the above created function using the required columns.
Make sure you use the same set of columns and in the same sequence in the ETL logic to create a hash value for a new row.

Step 3: Write ETL code
In the ETL code read the records from this multi-table SQL in a cursor loop. For each record find out the hash key value. Get the old hash key value by selecting the record from the dimension table using a key. If no record exist then insert the record. If record exists compare these tow hash keys if it is different update the record otherwise skip it.

Conclusion
This way you can achieve change data capture for a multi-table select statement query used for dimension table.

No comments: