Business Intelligence

This is a quick post about how to get the most out of Power BI (Fabric) model performance, keeping those models under control, refresh times low and live dashboard snappy. In this post I am going to focus on the data sources and efficiently loading data.

Power Query and M can do everything!

While Power Query and its underlying M language is great and can do so much, as with anything, that does not mean you should use it for everything.

I imagine most Power Query users at some point reduce the data that is in their model. This might be with filters on certain columns to reduce the rows, or by completely deleting unused columns. This makes sense and if you are not doing this, start doing it right now. Those eagle eyed of you however may have noticed that this does not always reduce the number of rows that are initially loaded and as such your model refreshes still take a long time. This is because your steps may be forcing all the data to get loaded into Power Query before these steps are applied. You may find that sometimes it does improve the load times and you are not sure why.

Using SQL queries as data sources

A way to improve load times of your model is to potentially reduce the result set up front by using a SQL query as your data source. This is not wholly a bad idea, however it can quickly become more painful than it first seems depending on the scale of your activities.

Doing this represents another piece of custom code that needs to be managed, and dare I say source and change controlled. Even if you put the code in a stored procedure or view to ensure that the code is managed in your data platform, the above still applies. This assumes you have permissions on the data platform to do this in the first place.

Perhaps equally as important is that doing this removes the ability for Power Query to perform Query Folding for you (there are workarounds, but this will be true for most). If your power query only has the single source load step, then this might not be a problem, but just because that is the case now, doesn’t mean it will be in the future.

Short version, use tables, not queries… yes even if the query is “SELECT * FROM myTable).

What is Query Folding?

The short version is, Power Query taking the steps you created and “folding” them into a single efficient SQL query, reducing the rows you have to import and pushing back processing on to the most likely, more efficient database platform.

How do I know when query folding is happening?

This is actually really easy. For each step in your ‘applied steps’ section in Power query, there are some icons/indicators.

You want to see as much of the top icon as possible, however the image was taken from the following link and contains a much more detailed explanation of the indicators: Query folding indicators in Power Query – Power Query | Microsoft Learn

Further to the indicators, you can right click on a step and open the Query Plan for it. Here you can see the plan for that step giving a more detailed view of the parts that are folding with the remote tag, and you can even see the query that is being run. More on the query plan here: Query plan – Power Query | Microsoft Learn

Order Matters

As soon as a step cannot fold, all following steps will not fold (technically this is not exactly true, but is a good rule to follow for simplicity sake). This presents an opportunity to, where not fundamental to your transformations, get as much folding done before the non-foldable step(s).

Order also matters literally. For example (shamelessly taken from the MS link above), if you use LastN, there is no BOTTOM in SQL, but there is a TOP, so instead consider changing the order of the data and use TopN. Sounds obvious, but only if you are aware of what query folding is.

What will fold?

The easiest way to think about all this is, what is possible in the data source. Things like filtering (where clause), removing columns (select), some basic calculations (multiply a column by 10) will generally fold on a SQL database source for example.

While this is true, there is a level of complication involved. Just because you might be able to do it in SQL, doesn’t mean that Power Query will know how to, so another good rule is to keep the applied steps as simple as possible for as long as possible, especially those that reduce the size of the data set.

While you may not have much influence over this, a simple dimensional model can also make this easier as it will naturally lean towards your power query steps being simple.

Deliberate

As you have probably gathered, there is a lot of information on query folding. It can get complex. What I hope to have provided is a short introduction to get you started and ultimately promote awareness so that you can deliberately use it in the future.

This is a quick post about how to get the most out of Power BI (Fabric) model performance, keeping those models under control, refresh times low and live dashboards snappy. In this post I am going to focus on the data storage engine used in Power BI.

The same engine, with some tweaks, powers SQL Server columnstore technology, so what I will cover here also applies to that.

VertiPaq

VertiPaq is the data engine in Power BI. Its how the data is stored and retrieved once you have got it into your Power BI semantic model. It is primarily an in Memory engine with high compression capabilities. It is sometimes referred to as xVelocity engine.
The compression and performance is largely aligned with the number of unique rows in any given column. Therefore, the fewer number of unique rows in a column, the more compressed that column will be.

Date and Time

Lets consider the use of Dates and Time. Imagine we have a sales table which has entries for each sales transaction and stores the date and time to the nearest hour. Which is more performant in Power BI, a combined Date and Time column, or 2 separate columns of Date and Time?
There are 365 days in a year, so if our data only covered a single year, a separate Date column would have a maximum 365 unique values.
There are 24 hours in a day. As we are only storing the time of transactions to the nearest hour, the Time column would have a maximum of 24 unique values.
If we had a combined Date and Time column, the uniqueness is a Cartesian Product of the 2 columns, in our case 365 x 24 = 8,760 potential unique values. An exponential increase, which will have an exponential increase in your model size.

Unique Identifiers

Unique Identifiers in your model will also have a big impact. Lets assume you are reporting over your very large customer base. You want to do things like counts of different customers, by demographics and more. Each row in your customer fact table represents an individual customer, along with a unique ID. Suddenly you have a completely uncompressible column. Worse still the customer ID is not stored as an efficient INT or BIGINT, its a VARCHAR.
Not a lot you can do about this right? We need it to count customers don’t we? Well No actually. If you know each row represents a customer, and you are not explicitly using the Customer ID for anything, you can just count the number of rows and remove the completely uncompressible column. The row count will still represent a count of the unique customers, by what ever filtering you apply, but your model will be noticeably smaller and more performant.

Summary

Always consider Column uniqueness in your Power BI/Fabric semantic models. Can you split columns and still provide the same functionality, like the Date and Time example? Are you explicitly using Unique identifiers or can you get the same functionality another way?

Category: Business Intelligence

Power BI/Fabric model performance – Query Folding