Slowly Changing Dimensions with PySpark and Delta Lake

Slowly Changing Dimensions with PySpark and Delta Lake

Slowly Changing Dimensions (SCDs) are a vital concept in data warehousing, particularly in managing data that changes over time. As the entities evolve over time, it’s crucial to track and manage these changes effectively. This is where Slowly Changing Dimensions (SCD) come into play.

Take an example of table :-

Primarily there are three types of SCD -

  1. SCD Type 1 - It's a simplest form of approach where existing data is overwritten by new data, only the latest data is saved and losing the historical information. This method is suitable when historical data isn't critical.

    Table 1:

    Table 2:

  2. SCD Type 2 - Type 2 SCDs keep track of both current and historical data. When a change occurs in an attribute, a new record is inserted into the dimension table with the updated attribute values and a new surrogate key (often referred to effective date or suppress). This preserves historical data, enabling analysis of how dimensions evolve over time. This adds majorly 3 types of columns start_date, end_date.

    Table 1:

    Table 2:

  3. SCD Type 3 - Unlike SCD1 and SCD2, which primarily focus on preserving the most recent state of data, SCD3 allows for limited historical tracking of changes while minimizing storage requirements. It maintains a changed column details.

    Table 1 :

    Table 2