I’ve had a recent engagement troubleshooting some data warehouse/ETL issues, one of which was some broken referential integrity (RI) between a fact table and its dimensions. This made me think about a recurrent issue in data warehouse design which is should you use foreign key constraints. So, here’s my current thinking/practice on the issue for my and others reference.
Every fact table has foreign keys to the dimensions that it uses/references. The relationship is one-to-many with the dimension on the one side and the fact on the many. There cannot be an id for a dimension row in the fact table that does not exist in the dimension itself – that would break referential integrity and result in spurious results in any reports. That’s not to say that a fact can’t exist if the dimension row is still unknown or not yet loaded – there are techniques to handle late arriving dimension rows – but whatever value exists for the fact table’s foreign key for the dimension in question must exist. If it doesn’t something is badly wrong.
The easy way to guarantee that this never happens is to create a foreign key constraint between the dimension and the fact. When this is enforced it is not possible to insert (or update) a fact with an invalid or missing dimension key. So, this is the simple solution. However, there is a cost to maintaining referential integrity and one of the goals of data warehouses is performance. Now the existence of foreign key constraints won’t have any impact on report performance, but it may impact ETL. If there are issues with ETL performance, such as a time-consuming load process or a very small ETL window before up-to-date reports must be available, then optimising the ETL process may be required. When the fact table rows are inserted, the existence of foreign key constraint will impact the time taken to load the fact table, because after all, this constraint requires referential integrity is checked at insertion time. No related dimension record will cause the integrity check to fail and the fact won’t get inserted. This would normally be a good thing, except where you have to squeeze the last bit of performance out of your ETL process, in which case disabling the constraint may help.
However, once this is done, the door is now potentially open for compromised referential integrity. There is an argument that a well designed and tested ETL process should look after referential integrity and you should never see a fact with an invalid/missing foreign key. While this is true, things never stay the same and there are always future unknown unknowns such as inexperienced developers working on the ETL and breaking something.
Often, when I am faced with decisions such as this I refer to Kimball. This is what The Data Warehouse ETL Toolkit (pp 212-214) says on the matter:
Option 1, check RI during data preparation in the ETL process
Option 2, do it at insert time – i.e. use foreign key constraints
Option 3, create a process to check RI after the fact has been inserted.
Kimball says option 1 is often the most practical, because of the performance hit of option 2, but points out that this is just a software/hardware limitation in practice and may not be an issue with a particular system or environment. Option 3 is generally not preferred to the other two options.
So Kimball seems to lean towards checking RI in the ETL over foreign key constraints. However, I subscribe to the principle that simple is better than complex, and ETL processes can get quite complex, which means that during further development/maintenance subtle errors may creep in. Foreign key constraints are by contrast simple to understand and simple to implement.
So, without getting too deep into a theoretical and subjective argument here is my current practice for foreign keys constraints:
- At design time foreign key constraints should be used. They guarantee referential integrity when the facts are inserted and will throw an error that the developer/admin can catch and fix before the users report issues with the data.
- Design the ETL as if the constraints are not there; i.e. follow Kimball’s recommendation to enforce RI during data preparation.
- If ETL performance is an issue and removing the foreign key constraints will make a difference then remove the constraints. But, test the pre and post ETL performance to see what’s been gained, and only do this if it makes real difference and other performance optimisations haven’t been tried first or have been ruled out for other reasons, such as cost.
This gives the assurance that RI cannot be compromised, even if modifications are made to a complex ETL process, while at the same time providing scope to optimise fact table load speed by disabling the foreign key constraints should it be required.