Home > View Post

SSIS and reusing tasks within packages

To a traditional developer, who has gotten use to concepts like refactoring and automated testing, SSIS can feel like a very dark place.

It's very difficult to avoid cutting and pasting code, even within the same package, which makes them very brittle to change.

Take this simple example where we consider a slowly changing dimension (and, for the purposes of this example we're not using the "Slowly Changing Dimension" task provided out of the box). In Data Warehousing we typically consider changes in two ways:

  • Type I - these changes merely overwrite the existing data. We're not interested in maintaining a history of this particular attribute.
  • Type II - these changes insert a new copy of the record, such that we still have the old attributes to maintain a history.

To demonstrate the concept, imagine we have a Products table in our Data Warehouse as shown below, where we consider the ProductName attribute to be Type I (we don't need to maintain a history) and the Price attribute to by Type II (we DO need to maintain a history).

ProductIdProductNamePriceEffectiveFromEffectiveTo
6Soopa Hair Gel£1.102004-01-01NULL

So, if somebody changes the name of the product it will be picked up in our ETL (Extract, Transform and Load) processes and our Product table in the DataWarehouse would look like this (simplified):

ProductIdProductNamePriceEffectiveFromEffectiveTo
6SoopaDoopa Hair Gel£1.102004-01-01NULL

However, if the price is changed, the table will look like this:

ProductIdProductNamePriceEffectiveFromEffectiveTo
6SoopaDoopa Hair Gel£1.102004-01-012006-06-02
6SoopaDoopa Hair Gel£1.252006-06-02NULL

Easy peasy. And this type of thing is really common in SSIS, and usually ends in a DataFlow something like this (this is very simplified for the purposes of this post):

SSIS data flow

The problem with this flow is that both OLE DB Commands which perform the updates are identical, as are the OLE DB Destinations used to insert the Type II changes.

However, if you introduce the Union All (merges flows) and Multicast (splits flows into multiple, identical outputs) you can reduce this code duplication like so:

SSIS data flow

Now I realise that the latter actually contains more objects, but the Multicast and Union All tasks are very very simple. So if, (when) we decide we need to change something, we only have to manage the updates and inserts in one place. The scenario I've described here is a little false but there are plenty of places where you can use this technique.

It maybe that there is some performance penalty that you pay but I'm yet to witness any. As usual, your mileage may vary.

What techniques do you employ to make your packages less brittle to change and more manageable?

Tags: SQL

 
Josh Post By Josh Twist
9:08 AM
02 Jun 2006

» Next Post: Resolve function in Visual Studio 2005
« Previous Post: Vista Woes

Comments are closed for this post.

© 2005 - 2017 Josh Twist - All Rights Reserved.