Category Archives: Uncategorized

Heat maps in Power BI

As of the March 2019 update, Power BI includes a heat map option available through the existing map visual.  It works well and is a welcome addition to the built in map features of Power BI, but there are a few issues to consider. There is a also a custom visual heatmap published by a Microsoft employee on AppSource. that may fit your needs better.  Read on for more details…

The March 19 release of Power BI has introduced heat maps for displaying spatial point data.  It’s nice to see the mapping capabilities getting some attention at last as this as one area of huge untapped potential in Power BI.  To a large extent, we’ve been left with the default bubble and filled maps with the shape map still stuck in preview for well over a year!  For anything beyond basic mapping, especially where you need some spatial intelligence you’re really looking at third-party offerings from MapBox and Esri. It really doesn’t seem like mapping is a high priority for Power BI.  However, in October 2018 Tableau released heat maps, and where Tableau goes, expect Power BI to not be far behind.  (The Tableau heat map feature appears to have broader application than maps though, being a “density mark type” that can be applied to other visual types such as scatter plots too.)  So now we have heat maps in Power BI!

As part of this review I also looked at what alternative heat map custom visuals are available via Appsource.  There is currently a visual available for free called Heatmap created by a developer named Weiwei Cui.  A little bit of investigation reveals that Weiwei works for Microsoft Research Asia, so although this is a “third-party” custom visual, that third-party is in effect Microsoft.  I say this because, I and many other developers are cautious about using custom visuals in production because of ongoing support and maintenance implications.

Previously the built-in mapping options for point data (i.e. a single data point with a latitude and longitude) were limited to the bubble map and it is this visual that has been updated with a heat map option.  The other built in map – filled map – also takes point data which includes latitudes and longitudes but fills or shades an area of the map according to some existing boundary, so the data is aggregated to the area defined rather than the precise area on the map with the bubble map.

You may be familiar with the appearance you get when you add some spatial data to a bubble map in Power BI. The initial affect is rarely what you want especially if you have a lot of data points.  You tend to get a dense cluster of points that completely obliterate the underlying layers of the map. You can tweak the bubble size and change the colour, but that’s about it when it comes to customisation.  For this reason its a map type that I’ve infrequently used but with the heat map capability it’s time to look at this visual again.

For this review I downloaded the street crime data for Lancashire from data.police.uk.  This is a nice open data set for this kind of demonstration with each recorded crime including its latitude and longitude as well as some other interesting attributes.  I’ve included data from from February 2016 to January 2019 in the dataset and as you can see below this results in the usual blob of data points obscuring the underlying map.  We can do better with a heat map!

 

Here are my thought and findings:

The heat map feature is straightforward to use and it is built-in to the existing map visual.  You can now choose the default bubble map or turn the heat map setting on to convert the map to a heat map and enable a different set of customisations relevant to this map style. It can take some time to tweak the data point radius and colour scheme to get the effect that you want and there are a few things to be aware of.

The radius of the data points can be set in pixels or metres.  This has implications when zooming in or out.  As you zoom in, you might expect the areas of highest intensity (most data points) to become more prominent or easier to see, but this doesn’t happen if the radius is set in pixels.  In fact the opposite occurs as the data point density is less dense as it is spread over a greater part of the screen.  You can alter this behaviour by setting the radius in metres.  This fixes the area of intensity so that zooming retains the “hotspots”.  However, if you zoom out far enough, the pixel radius performs better as the metres radius shrinks until it becomes practically invisible.  This matters because your map may auto-zoom depending on what cross-filtering you have enabled, so you need to understand the characteristics of the data – i.e. how spread out it is on the map, to determine how to specify the radius.

The custom Heatmap handles zooming differently.  As you zoom out the datapoints become relatively closer together so the heat intensifies. As you zoom in the opposite happens.  But, instead of the heat spreading (and the colours at the low end of the colour scale predominating, each point retains its intensity.  In effect, on the new heat map, when the zoom is high enough that each point is identifiable, its colour is at the low end of the range (e.g green).  This can make the patterns in the data difficult to discern when zooming in.  The custom heatmap avoids this by revealing each data-point as a dot using the colour from the high end of the range.  The behaviour of the custom visual just feels more intuitive.  However, there is one issue to bear in mind in that since heat maps display a higher  density of data in a more intense colour, you may not want a single point when zoomed-in to show in high intensity (e.g. Red), unless of course there are multiple data points at that precise location.  So in that regard the new heat map handles this better - despite the effect on how clear the data points are when displayed in low intensity colours (e.g. pale green).  I didn't find any way to change the default behaviour of the custom Heatmap to behave like the new heat map.  At least you have a choice between the two visuals depending on what you want to see when zoomed-in but its a pity you can't toggle this behaviour in  a single visual.

Try zooming in and out on the maps to see this effect.

The bottom line here, is that the scale / zoom level has an impact on the heat map display.  Some experimentation will be needed with zoom levels (and radius settings with new heat map) to get the desired effect. Disabling auto-zoom might help maintain the default zoom level, but the user could manually zoom anyway and disabling auto-zoom would stop the map zooming in to the current bounds of the data when slicers or filters are applied, which is probably not what you want to happen.

The bubble map has also been included for comparison.  If your only choice is between the original bubble map view or the new heat map view, your choice would be determined by how many data points there are on the map and what level of scale or zoom you need the map to be at.  The bubble map woud be more effective when zooming in to few data points as they would stand out better on the map compared to the heat map.  For a higher-level view with more data points, the heat map gives a better representation as the overlapping datapoints are turned to advantage in the heat map to display the density of datapoints.  This overcomes the major drawback of the bubble map of overlapping points obscuring underlying pattern in the data-  how do you know how many datapoints lie under the pile you can see? Of course you can reduce the size of the bubbles as I have done in the example, but the whole point of the heat map is to show dense data that would cause even very small points to overlap.  The custom heatmap has the best of both worlds as it deals with the heatmap representation of the data but reveals more clearly identifiable points when zoomed. (N.B. A single point on the the map is not necessarily only one data point though, as points at the same precise location are aggregated).

The heat map suffers from the major shortcoming of all the Power BI maps.  No map legend or distance scale!  These are such fundamental map basics, their omission is mystifying.  The custom heatmap visual has both a legend for the data and a distance scale.

One area where the new heat map is better than the bubble map and custom Heatmap visual is in map labels.  The new heat map puts labels above the data on the map, so the colours do not obscure placenames.  On both the bubble map and custom Heatmap, labels are obscured.  This can be mitigated somewhat by setting the transparency of the data colour, but the new heat map approach is better. The best option of all would be to have a configurable setting to specify the z-order of the data layer and map labels, but none of these maps provide this.

There appears to be a bug in the new heat map where the data becomes invisible. It seems to happen intermitently, e.g. when switching pages, but is difficult to reproduce and doesn't seem to be an issue on published reports, so might only be an issue in Power BI Desktop.

There’s an irritating behaviour in the radius settings. If you delete the current value, you must enter the new value immediately (i.e. within about 250ms) or the default value suddenly appears.  I found I often ended up with a concatenation of the value I typed followed by the default value when I was experimenting with this setting.

The custom Heatmap is far more customisable.  There are five levels for the colour range, whereas the heat map has three.  The base map in the custom Heatmap has the option to turn on or off the display certain map features such as labels, place names, roads, etc.  It also supports multiple languages.

The custom Heatmap also has an additional rendering style called contour, which is effectively an isoline (e.g. isobar, isotherm) map.  This effectively changes it from a heatmap, but is an interesting option to consider.

I experienced an issue with the custom Heatmap once deployed to the Power BI service in that it appears to show nothing other than the base map when on a page with the other maps. Selecting a filter though – or just clicking on one of the other visuals-  does show data.  The map on its own page did not have this issue.

In summary, if you need heat map functionality from one of the Power BI built in maps, can live with the limitations and can't use a custom visuals then now you have it.  But Weiwei Cui’s Heatmap offer a bit more customisation and may fit your needs better.  Either way the addition of the new heat map feature to the map visual is a welcome improvement.  I'm looking forward to more map features in future Power BI releases.

Foreign key constraints in data warehouses

I’ve had a recent engagement troubleshooting some data warehouse/ETL issues, one of which was some broken referential integrity (RI) between a fact table and its dimensions. This made me think about a recurrent issue in data warehouse design which is should you use foreign key constraints. So, here’s my current thinking/practice on the issue for my and others reference.

Every fact table has foreign keys to the dimensions that it uses/references. The relationship is one-to-many with the dimension on the one side and the fact on the many. There cannot be an id for a dimension row in the fact table that does not exist in the dimension itself – that would break referential integrity and result in spurious results in any reports. That’s not to say that a fact can’t exist if the dimension row is still unknown or not yet loaded – there are techniques to handle late arriving dimension rows – but whatever value exists for the fact table’s foreign key for the dimension in question must exist. If it doesn’t something is badly wrong.

The easy way to guarantee that this never happens is to create a foreign key constraint between the dimension and the fact. When this is enforced it is not possible to insert (or update) a fact with an invalid or missing dimension key. So, this is the simple solution. However, there is a cost to maintaining referential integrity and one of the goals of data warehouses is performance. Now the existence of foreign key constraints won’t have any impact on report performance, but it may impact ETL. If there are issues with ETL performance, such as a time-consuming load process or a very small ETL window before up-to-date reports must be available, then optimising the ETL process may be required. When the fact table rows are inserted, the existence of foreign key constraint will impact the time taken to load the fact table, because after all, this constraint requires referential integrity is checked at insertion time. No related dimension record will cause the integrity check to fail and the fact won’t get inserted. This would normally be a good thing, except where you have to squeeze the last bit of performance out of your ETL process, in which case disabling the constraint may help.

However, once this is done, the door is now potentially open for compromised referential integrity. There is an argument that a well designed and tested ETL process should look after referential integrity and you should never see a fact with an invalid/missing foreign key. While this is true, things never stay the same and there are always future unknown unknowns such as inexperienced developers working on the ETL and breaking something.

Often, when I am faced with decisions such as this I refer to Kimball. This is what The Data Warehouse ETL Toolkit (pp 212-214) says on the matter:

Option 1, check RI during data preparation in the ETL process

Option 2, do it at insert time – i.e. use foreign key constraints

Option 3, create a process to check RI after the fact has been inserted.

Kimball says option 1 is often the most practical, because of the performance hit of option 2, but points out that this is just a software/hardware limitation in practice and may not be an issue with a particular system or environment. Option 3 is generally not preferred to the other two options.

So Kimball seems to lean towards checking RI in the ETL over foreign key constraints. However, I subscribe to the principle that simple is better than complex, and ETL processes can get quite complex, which means that during further development/maintenance subtle errors may creep in. Foreign key constraints are by contrast simple to understand and simple to implement.

So, without getting too deep into a theoretical and subjective argument here is my current practice for foreign keys constraints:

  • At design time foreign key constraints should be used. They guarantee referential integrity when the facts are inserted and will throw an error that the developer/admin can catch and fix before the users report issues with the data.
  • Design the ETL as if the constraints are not there; i.e. follow Kimball’s recommendation to enforce RI during data preparation.
  • If ETL performance is an issue and removing the foreign key constraints will make a difference then remove the constraints. But, test the pre and post ETL performance to see what’s been gained, and only do this if it makes real difference and other performance optimisations haven’t been tried first or have been ruled out for other reasons, such as cost.

This gives the assurance that RI cannot be compromised, even if modifications are made to a complex ETL process, while at the same time providing scope to optimise fact table load speed by disabling the foreign key constraints should it be required.