How to Reduce Your Power BI Model Size by 90%

what makes Power Bi so quick and highly effective in relation to efficiency? So highly effective, that it performs advanced calculations over hundreds of thousands of rows within the blink of a watch.

On this article, we are going to dig deep to find what’s “below the hood” of Energy BI, how your information is being saved, compressed, queried, and eventually, introduced again to your report. When you end studying, I hope that you’ll get a greater understanding of the laborious work occurring within the background and recognize the significance of making an optimum information mannequin to get most efficiency from the Energy BI engine.

First look below the hood — Formulation Engine and Storage Engine

First, I need you to satisfy the VertiPaq engine, “mind & muscle tissues” of the system behind not solely Energy BI, but in addition Evaluation Companies Tabular and Excel Energy Pivot. Fact to be stated, VertiPaq represents just one a part of the storage engine throughout the Tabular mannequin, apart from DirectQuery, which we are going to focus on individually in one of many subsequent articles.

Once you ship the question to get information in your Energy BI report, here’s what occurs:

Formulation Engine (FE) accepts the request, processes it, generates the question plan, and eventually executes it
Storage Engine (SE) pulls the info out of the Tabular mannequin to fulfill the request issued throughout the question generated by the Formulation Engine

Storage Engine works in two other ways to retrieve requested information: VertiPaq retains a snapshot of the info in reminiscence. This snapshot might be refreshed sometimes from the unique information supply.

Quite the opposite, DirectQuery doesn’t retailer any information. It simply forwards the question straight to the info supply for each single request.

Photo by RKTW extend on Unsplash

Knowledge within the Tabular mannequin is often saved both as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nonetheless, there may be additionally a chance of implementing a hybrid Composite mannequin, which depends on each architectures in parallel.

Formulation Engine — “Mind” of Energy BI

As I already pressured, Formulation Engine accepts the question, and because it’s in a position to “perceive” DAX (and MDX additionally, however it’s out of the scope of this sequence), it “interprets” DAX into a particular question plan, consisting of bodily operations that should be executed to get outcomes again.

These bodily operations might be joins between a number of tables, filtering, or aggregations. It’s essential to know that Formulation Engine works in a single-threaded method, which signifies that requests to Storage Engine are all the time being despatched sequentially.

Storage Engine — “Muscle tissue” of Energy BI

As soon as the question has been generated and executed by the Formulation Engine, the Storage Engine comes into the scene. It bodily goes via the info saved throughout the Tabular mannequin (VertiPaq) or goes on to a distinct information supply (SQL Server, for instance, if DirectQuery storage mode is in place).

In the case of specifying the storage engine for the desk, there are three potential choices to select from:

Import mode — based mostly on VertiPaq. Desk information is being saved in reminiscence as a snapshot. Knowledge might be refreshed periodically
DirectQuery mode — information is being retrieved from the info supply at question time. Knowledge resides in its unique supply earlier than, throughout, and after the question execution
Twin mode — a mixture of the primary two choices. Knowledge from the desk is being loaded into reminiscence, however at question time it can be retrieved immediately from the supply

Versus Formulation Engine, which doesn’t help parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As now we have drawn an enormous image beforehand, let me clarify in additional element what VertiPaq does within the background to spice up the efficiency of our Energy BI reviews.

Once we select Import mode for our Energy BI tables, VertiPaq performs the next actions:

Reads the info supply, transforms information right into a columnar construction, encodes, and compresses information inside every of the columns
Establishes a dictionary and index for every of the columns
Prepares and establishes relationships
Computes all calculated columns and calculated tables, and compresses them

The 2 foremost traits of VertiPaq are:

VertiPaq is a columnar database
VertiPaq is an in-memory database

As you possibly can see within the illustration above, columnar databases retailer and compress information differently from conventional row-store databases. Columnar databases are optimized for vertical information scanning, which signifies that each column is structured in its personal method and bodily separated from different columns!

With out going into deep evaluation about benefits and disadvantages between row-store vs column-store databases, since it might require a separate sequence of articles, let me simply pinpoint just a few key differentials when it comes to efficiency.

With columnar databases, single-column entry is quick and efficient. As soon as the computation begins to contain a number of columns, issues turn into extra advanced, because the middleman steps’ outcomes should be quickly saved in a roundabout way.

Merely stated, columnar databases are extra CPU-intensive, whereas row-store databases improve I/O, due to many scans of ineffective information.

Thus far, we painted an enormous image of the structure that permits Energy BI to totally shine as an final BI device. Now, we’re able to dive deeper into particular architectural options and consequently leverage this data to profit from our Energy BI reviews, by tuning our information mannequin to extract the utmost from the underlying engine.

Inside VertiPaq in Energy BI — Compress for fulfillment!

As you may recall from the earlier a part of this text, we scratched the floor of VertiPaq, a strong storage engine, which is “accountable” for the blazing-fast efficiency of most of your Energy BI reviews (each time you might be utilizing Import mode or Composite mannequin).

3, 2, 1…Fasten your seatbelts!

One of many key traits of the VertiPaq is that it’s a columnar database. We discovered that columnar databases retailer information optimized for vertical scanning, which signifies that each column has its personal construction and is bodily separated from different columns.

That reality permits VertiPaq to use several types of compression to every of the columns independently, selecting the optimum compression algorithm based mostly on the values in that particular column.

Compression is being achieved by encoding the values throughout the column. However, earlier than we dive deeper into an in depth overview of encoding methods, simply take into account that this structure shouldn’t be completely associated to Energy BI — within the background is a Tabular mannequin, which can also be “below the hood” of Evaluation Companies Tabular and Excel Energy Pivot.

Worth Encoding

That is essentially the most fascinating worth encoding kind since it really works completely with integers and, due to this fact, requires much less reminiscence than, for instance, when working with textual content values.

How does this look in actuality? Let’s say now we have a column containing a lot of telephone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to search out the minimal worth on this vary (which is 4.000) as a place to begin, then calculate the distinction between this worth and all the opposite values within the column, storing this distinction as a brand new worth.

At first look, 3 bits per worth won’t appear like a major saving, however multiply this by hundreds of thousands and even billions of rows and you’ll recognize the quantity of reminiscence saved.

As I already pressured, Worth Encoding is being utilized completely to integer information kind columns (foreign money information kind can also be saved as an integer).

Hash Encoding (Dictionary Encoding)

That is in all probability essentially the most ceaselessly used compression kind by a VertiPaq. Utilizing Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “actual” values with index values from the dictionary.

Right here is an instance to make issues clearer:

As it’s possible you’ll discover, VertiPaq recognized distinct values throughout the Topics column, constructed a dictionary by assigning indexes to these values, and eventually saved index values as tips that could “actual” values. I assume you might be conscious that integer values require method much less reminiscence area than textual content, in order that’s the logic behind the sort of information compression.

Moreover, by having the ability to construct a dictionary for any information kind, VertiPaq is virtually information kind impartial!

This brings us to a different key takeover: irrespective of in case your column is of textual content, bigint or float information kind — from VertiPaq perspective it’s the identical — it must create a dictionary for every of these columns, which suggests that each one these columns will present the identical efficiency, each when it comes to pace and reminiscence area allotted! After all, by assuming that there are not any important variations in dictionary sizes between these columns.

So, it’s a fable that the info kind of the column impacts its measurement throughout the information mannequin. Quite the opposite, the variety of distinct values throughout the column, which is called cardinality, principally influences column reminiscence consumption.

RLE (Run-Size-Encoding)

The third algorithm (RLE) creates a form of mapping desk, containing ranges of repeating values, avoiding storing each single (repeated) worth individually.

Once more, looking at an instance will assist to raised perceive this idea:

In actual life, VertiPaq doesn’t retailer Begin values, as a result of it could actually shortly calculate the place the following node begins by summing earlier Rely values.

As highly effective as it would have a look at first look, the RLE algorithm is very depending on the ordering throughout the column. If the info is saved the way in which you see within the instance above, RLE will carry out nice. Nonetheless, in case your information buckets are smaller and rotate extra ceaselessly, then RLE wouldn’t be an optimum resolution.

Yet one more factor to remember concerning RLE: In actuality, VertiPaq doesn’t retailer information the way in which it’s proven within the illustration above. First, it performs Hash encoding and creates a dictionary of the topics, after which applies the RLE algorithm, so the ultimate logic, in its most simplified method, can be one thing like this:

So, RLE happens after Worth or Hash Encoding, in these eventualities when VertiPaq “thinks” that it is smart to compress information moreover (when information is ordered in that method that RLE would obtain higher compression).

Re-Encoding concerns

Irrespective of how “good” VertiPaq is, it could actually additionally make some dangerous choices, based mostly on incorrect assumptions. Earlier than I clarify how re-encoding works, let me simply briefly iterate via the method of information compression for a particular column:

VertiPaq scans a pattern of rows from the column
If the column information kind shouldn’t be an integer, it would look no additional and use Hash encoding
If the column is of integer information kind, some further parameters are evaluated: if the numbers within the pattern linearly improve, VertiPaq assumes that it’s in all probability a main key and chooses Worth encoding
If the numbers within the column are fairly shut to one another (the quantity vary shouldn’t be very large, like in our instance above with 4.000–5.000 telephone calls per day), VertiPaq will use Worth encoding. Quite the opposite, when values fluctuate considerably throughout the vary (for instance between 1.000 and 1.000.000), then Worth encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nonetheless, it could actually occur generally that VertiPaq decides about which algorithm to make use of based mostly on the pattern information, however then some outlier pops up and it must re-encode the column from scratch.

Let’s use our earlier instance for the variety of telephone calls: VertiPaq scans the pattern and chooses to use Worth encoding. Then, after processing 10 million rows, hastily it discovered a 500.000 worth (it may be an error, or no matter). Now, VertiPaq re-evaluates the selection, and it could actually resolve to re-encode the column utilizing the Hash algorithm as a substitute. Certainly, that will influence the entire course of when it comes to the time wanted for reprocessing.

Lastly, right here is the checklist of parameters (so as of significance) that VertiPaq considers when selecting which algorithm to make use of:

Variety of distinct values within the column (Cardinality)
Knowledge distribution within the column — column with many repeating values might be higher compressed than one containing ceaselessly altering values (RLE might be utilized)
Variety of rows within the desk
Column information kind — impacts solely the dictionary measurement

Lowering the info mannequin measurement by 90% — actual story!

After we laid the theoretical floor for understanding the structure behind the VertiPaq storage engine, and which forms of compression it makes use of to optimize your Energy BI information mannequin, it’s the suitable second to get our fingers soiled and apply our data in a real-life case!

Start line = 776 MB

Our information mannequin is sort of easy, but memory-intensive. Now we have a reality desk (factChat), which comprises information about dwell help chats and one dimension desk (dimProduct), which pertains to a reality desk. Our reality desk has round 9 million rows, which shouldn’t be an enormous deal for Energy BI, however the desk was imported as it’s, with none further optimization or transformation.

Now, this pbix file consumes a whopping 777 MB!!! You possibly can’t consider it? Simply have a look:

Simply keep in mind this image! After all, I don’t must inform you how a lot time this report must load or refresh, and the way our calculations are sluggish due to the file measurement.

…and it’s even worse!

Moreover, it’s not simply 776 MBs that take our reminiscence, since reminiscence consumption is being calculated considering the next elements:

PBIX file
Dictionary (you’ve discovered concerning the dictionary to start with sections of this text)
Column hierarchies
Person-defined hierarchies
Relationships

Now, if I open Activity Supervisor, go to the Particulars tab, and discover the msmdsrv.exe course of, I’ll see that it burns greater than 1 GB of reminiscence!

Oh, man, that actually hurts! And we haven’t even interacted with the report! So, let’s see what we are able to do to optimize our mannequin…

Rule #1 — Import solely these columns you actually need

The primary and a very powerful rule is: preserve in your information mannequin solely these columns you actually need for the report!

That being stated, do I really want each the chatID column, which is a surrogate key, and the sourceID column, which is a main key from the supply system? Each of those values are distinctive, so even when I must rely the entire variety of chats, I’d nonetheless be nice with solely one among them.

So, I’ll take away the sourceID column and verify how the file seems now:

By eradicating only one pointless column, we saved greater than 100 MB!!! Let’s study additional what might be eliminated with out taking a deeper look (and we are going to come to this later, I promise).

Do we actually want each the unique begin time of the chat and UTC time, one saved as a Date/Time/Timezone kind, the opposite as Date/Time, and each going to a second stage of precision??!!

Let me do away with the unique begin time column and preserve solely UTC values.

One other 100 MB of wasted area gone! By eradicating simply two columns we don’t want, we lowered the dimensions of our file by 30%!

Now, that was with out even wanting into the small print of the reminiscence consumption. Let’s now activate DAX Studio, my favourite device for troubleshooting Energy BI reviews. As I already pressured just a few occasions, this device is a MUST in case you plan to work significantly with Energy BI — and it’s utterly free!

One of many options in DAX Studio is a VertiPaq Analyzer, a really useful gizmo constructed by Marco Russo and Alberto Ferrari from sqlbi.com. After I connect with my pbix file with DAX Studio, listed below are the numbers associated to my information mannequin measurement:

I can see right here what the costliest columns are in my information mannequin and resolve if I can discard a few of them, or if I must preserve all of them.

At first look, I’ve few candidates for removing — sessionReferrer and referrer columns have excessive cardinality and due to this fact can’t be optimally compressed. Furthermore, as these are textual content columns and should be encoded utilizing a Hash algorithm, you possibly can see that their dictionary measurement is extraordinarily excessive! In case you take a better look, you possibly can discover that these two columns take virtually 40% of my desk measurement!

After checking with my report customers in the event that they want any of those columns, or possibly solely one among them, I’ve received a affirmation that they don’t carry out any evaluation on these columns. So, why on Earth ought to we bloat our information mannequin with them??!!

One other robust candidate for removing is the LastEditDate column. This column simply exhibits the date and time when the document was final edited within the information warehouse. Once more, I checked with the report customers, and so they didn’t even know that this column exists!

I eliminated these three columns, and the result’s:

Oh, God, we halved the dimensions of our information mannequin by simply eradicating just a few pointless columns.

Fact be advised, there are just a few extra columns that might be dismissed from the info mannequin, however let’s now deal with different methods for information mannequin optimization.

Rule #2 — Cut back the column cardinality!

As it’s possible you’ll recall from the earlier a part of the article, the rule of thumb is: the upper the cardinality of a column, the more durable for VertiPaq to optimally compress the info. Particularly if we aren’t working with integer values.

Let’s take a deeper have a look at VertiPaq Analyzer outcomes:

As you see, even when the chatID column has increased cardinality than the datetmStartUTC column, it takes virtually 7 occasions much less reminiscence! Since it’s a surrogate key integer worth, VertiPaq applies Worth encoding, and the dimensions of a dictionary is irrelevant. Then again, Hash encoding is being utilized for the column of date/time information kind with excessive cardinality, so the dictionary measurement is enormously increased.

There are a number of methods for decreasing the column cardinality, similar to splitting columns. Listed here are just a few examples of utilizing this system.

For Integer columns, you possibly can cut up them into two even columns utilizing division and modulo operations. In our case, it might be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization method have to be carried out on the supply aspect (on this case, by writing a T-SQL assertion). If we use the calculated columns, there isn’t any profit in any respect, for the reason that unique column needs to be saved within the information mannequin first.

An analogous method can convey important financial savings when you might have decimal values within the column. You possibly can merely cut up values earlier than and after the decimal as defined in this article.

Since we don’t have any decimal values, let’s deal with our drawback — optimizing the datetmStartUTC column. There are a number of legitimate choices to optimize this column. The primary is to verify in case your customers want granularity increased than the day stage (in different phrases, are you able to take away hours, minutes, and seconds out of your information).

Let’s verify what financial savings this resolution would convey:

The very first thing we discover is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s outcomes present that this column is now virtually completely optimized, going from taking on 62% of our information mannequin to only barely over 2.5%! That’s huuuuge!

Nonetheless, it appeared that the day-level grain was not nice sufficient, and my customers wanted to investigate figures on the hour stage. OK, so we are able to no less than do away with minutes and seconds, and that will additionally lower the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,topic
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my customers additionally didn’t want a chatVariables column for evaluation, so I’ve additionally eliminated it from the info mannequin.

Lastly, after disabling Auto Date/Time in Choices for Knowledge Load, my information mannequin measurement was round 220 MB! Nonetheless, one factor nonetheless bothered me: the chatID column was nonetheless occupying virtually 1/3 of my desk. And that is only a surrogate key, which isn’t utilized in any of the relationships inside my information mannequin.

So, right here I used to be inspecting two completely different options: the primary was to easily take away this column and combination the variety of chats, counting them utilizing the GROUP BY clause. Nonetheless, there can be no profit by preserving the chatID column in any respect, because it’s not getting used anyplace in our information mannequin. As soon as I’ve eliminated it from the mannequin, one final time, let’s verify the pbix file measurement:

Please recall the quantity we began at: 776 MB! So, I’ve managed to scale back my information mannequin measurement by virtually 90%, making use of some easy methods which enabled the VertiPaq storage engine to carry out extra optimum compression of the info.

And this was an actual use case, which I confronted over the past yr!

Common guidelines for decreasing information mannequin measurement

To conclude, right here is the checklist of basic guidelines it is best to have in mind when attempting to scale back the info mannequin measurement:

Maintain solely these columns your customers want within the report! Simply sticking with this one single rule will prevent an unbelievable quantity of area, I guarantee you…
Attempt to optimize column cardinality each time potential. The golden rule right here is: check, check, check…and if there’s a important profit from, for instance, splitting one column into two, or to substitute a decimal column with two complete quantity columns, then do it! However, additionally take into account that your measures should be rewritten to deal with these structural adjustments, with the intention to show anticipated outcomes. So, in case your desk shouldn’t be large, or if it’s a must to rewrite lots of of measures, possibly it’s not price splitting the column. As I stated, it depends upon your particular situation, and it is best to fastidiously consider which resolution makes extra sense
Similar as for columns, preserve solely these rows you want: for instance, possibly you don’t must import information from the final 10 years, however solely 5! That may also scale back your information mannequin measurement. Speak to your customers, ask them what they really want, earlier than blindly placing all the pieces inside your information mannequin
Combination your information each time potential! Meaning — fewer rows, decrease cardinality, so all good issues you might be aiming to attain! In case you don’t want hours, minutes, or seconds stage of granularity, don’t import them! Aggregations in Energy BI (and Tabular mannequin on the whole) are an important and large matter, which is out of the scope of this sequence, however I strongly suggest you verify Phil Seamark’s blog and his sequence of posts on inventive aggregations utilization
Keep away from utilizing DAX calculated columns each time potential, since they aren’t being optimally compressed. As a substitute, attempt to push all calculations to an information supply (SQL database, for instance) or carry out them utilizing the Energy Question editor
Use correct information sorts (for instance, in case your information granularity is on a day stage, there isn’t any want to make use of Date/Time information kind. Date information kind will suffice)
Disable Auto Date/Time possibility for information loading (this can take away a bunch of robotically created date tables within the background)

Conclusion

After you discovered the fundamentals of the VertiPaq storage engine and completely different methods it makes use of for information compression, I wished to wrap up this text by exhibiting you a real-life instance of how we are able to “assist” VertiPaq (and Energy BI consequently) to get the very best out of report efficiency and optimum useful resource consumption.

Thanks for studying, hope that you just loved the article!

Source link

Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

Lessons Learned After 6.5 Years Of Machine Learning

Prescriptive Modeling Makes Causal Bets – Whether You Know it or Not!

Why PDF Extraction Still Feels LikeHack

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Predicting the NBA Champion with Machine Learning

Amazon Closes Operations in Quebec, Laying off 1,700 Workers

Fiveonefour Unveils Aurora AI Agents for Data Engineering

Our Picks