Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Inequality in Practice: E-commerce Portfolio Analysis | by Piotr Gruszecki | Jan, 2025
    Artificial Intelligence

    Inequality in Practice: E-commerce Portfolio Analysis | by Piotr Gruszecki | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 31, 2025No Comments38 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    From Mathematical Principle to Actionable Insights: A 6-Yr Shopify Case Examine

    Towards Data Science

    Picture generated by DALL-E, based mostly on writer’s immediate, impressed by “The Bremen City Musicians”

    Are your top-selling merchandise making or breaking what you are promoting?

    It’s terrifying to suppose your whole income may collapse if one or two merchandise fall out of favor. But spreading too skinny throughout tons of of merchandise usually results in mediocre outcomes and brutal value wars.

    Uncover how a 6-year Shopify case research uncovered the right steadiness between focus and diversification.

    Why hassle?

    Understanding focus in your product portfolio is greater than merely an mental train; it has a direct affect on essential enterprise decisions. From stock planning to advertising spend, understanding how your income is distributed amongst items impacts your method.

    This publish walks by sensible methods for monitoring focus, explaining what these measurements truly imply and easy methods to get helpful insights out of your knowledge.

    I’ll take you thru elementary metrics and superior evaluation, together with interactive visualisations that convey the information to life.

    I’m additionally sharing chunks of R code used on this evaluation. Use it instantly or adapt the logic to your most well-liked programming language.

    Taking a look at market evaluation or funding principle, we regularly concentrate on focus — how worth is distributed throughout totally different components. In e-commerce, this interprets right into a elementary query: How a lot of your income ought to come out of your prime merchandise?

    Is it higher to have a number of sturdy sellers or a broad product vary? This isn’t only a theoretical query …

    Having most of your income tied to few merchandise means your operations are streamlined and centered. However what occurs when market preferences shift? Conversely, spreading income throughout tons of of merchandise may appear safer, but it surely usually means you lack any actual aggressive benefit.

    So the place’s the optimum level? Or somewhat what’s the optimum vary, and the way numerous ratios describe it.

    What makes this evaluation significantly precious is that it’s based mostly on actual knowledge from a enterprise that saved increasing its product vary over time.

    On Datasets

    This evaluation was completed for an actual US-based e-commerce retailer — considered one of our purchasers who kindly agreed to share their knowledge for this text. The info spans six years of their progress, giving us a wealthy view of how product focus evolves as enterprise matures.

    Whereas working with precise enterprise knowledge provides us real insights, I’ve additionally created an artificial dataset in one of many later sections. This small, synthetic dataset helps illustrate the relationships between numerous ratios in a extra managed setting — displaying patterns “relying on fingers”.

    To be clear: this artificial knowledge was created solely from scratch and solely loosely mimics common patterns seen in actual e-commerce — it has no direct connection to our shopper’s precise knowledge. That is totally different from my earlier article, the place I generated artificial knowledge based mostly on actual patterns utilizing Snowflake performance.

    Knowledge Export

    The principle evaluation attracts from actual knowledge, however that small synthetic dataset serves an essential function — it helps clarify relationships between numerous ratios in a manner that’s straightforward to understand. And belief me, having such a micro dataset with clear visuals is available in actually useful when explaining advanced dependencies to stakeholders 😉

    The uncooked transaction export from Shopify accommodates every little thing we require, however we should prepare it correctly for focus evaluation. The info accommodates the entire merchandise for every transaction, however the date is barely in a single row per transaction, thus we should propagate it to all merchandise whereas retaining the transaction id. Most likely not for the primary iteration of the research, but when we wish to fine-tune it, we must always take into account easy methods to deal with reductions, returns, and so forth. Within the case of international gross sales, conduct a worldwide and country-specific research.

    We now have a product title and an SKU, each of which ought to adhere to some naming conference and logic when coping with variants. If we have now a grasp catalogue with all of those descriptions and codes, we’re very lucky. When you’ve got it, use it, however examine it to the ‘floor reality’ with precise transaction knowledge.

    Product Variants

    In my case, the product names had been structured with a base title and a variant separated by a splash. Quite simple to make use of, divided into most important product and variants. Exceptions? After all, they’re at all times current, particularly when coping with 6 years of extremely profitable ecommerce knowledge:). As an example, some names (e.g. “All-purpose”) included a splash, whereas others didn’t. Then, some did have variants, whereas some didn’t. So count on for some tweaks right here, however this can be a crucial stage.

    Variety of distinctive merchandise, with and with out variants — all charts rendered by writer, with personal R code

    In the event you’re questioning why we have to exclude variations from focus evaluation, the determine above illustrates it clearly. The values are significantly totally different, and we might count on radically totally different outcomes if we analysed focus with variants.

    The evaluation is predicated on transactions, counting variety of merchandise with/with out variants in a given month. But when we have now a lot of variants, not all of them can be current in one-month transactions. Sure, that’s right — so allow us to take into account a bigger time vary, one yr.

    Merchandise, their variants, by transaction date, yearly

    I calculated the variety of variants per base product in a calendar yr based mostly on what we have now in transactions. The variety of variants per base product is split into a number of bins. Let’s take the yr 2024. The plot exhibits that we have now considerably round 170 base gadgets, with lower than half having just one variant (mild inexperienced bar). Nevertheless, the opposite half had multiple model, and what’s noteworthy (and, I consider, non-obvious, except you’re employed in attire ecommerce) is that we have now merchandise with a extremely giant variety of variations. The black bin accommodates gadgets that are available in 100 or extra totally different variants.

    In the event you guessed that they had been rising their choices by introducing new merchandise whereas retaining previous ones accessible, you might be right. However wouldn’t it’s fascinating to know whether or not the variations stem from heritage or new merchandise? What if we simply included merchandise launched within the present yr? We might verify it through the use of the date of product introduction somewhat than transactions. As a result of our solely dataset is a transaction dump, the primary transaction for every product is taken because the introduction date. And for every product, we take all variations that appeared in transactions, with no time constraints (from product introduction to probably the most present document).

    Merchandise, their variants, by product introduction date, yearly

    Now let’s have these two plots aspect by aspect for simple comparability. Taking transactions dates we have now extra merchandise in annually, and the distinction grows — since there are additionally transactions with merchandise launched beforehand. No suprises right here, as anticipated. In the event you had been questioning why knowledge for 2019 differ — good catch. Actually, store began operation in 2018, however I eliminated these few preliminary months; nonetheless, it’s their affect what makes the distinction in 2019.

    Merchandise variants and it’s affect on income is just not our focus on this article. However as it’s usually in actual evaluation, there are ‘branching’ choices, as we progress, even within the preliminary section. We haven’t even completed knowledge preparation, and it’s already getting fascinating.

    Merchandise, their variants, by product introduction and transaction, yearly
    Similar knowledge as above, side by bin

    Understanding the product construction is crucial for conducting significant focus analyses. Now that our knowledge is appropriately formatted, we will look at precise focus measurements and what they reveal about perfect portfolio construction. Within the following half, we’ll have a look at these measurements and what they imply for e-commerce companies.

    With regards to figuring out focus, economists and market analysts have completed the heavy lifting for us. Over many years of analysis into markets, competitiveness, and inequality, they’ve produced highly effective analytical strategies which have confirmed helpful in a wide range of sectors. Moderately than creating novel metrics for e-commerce portfolio evaluation, we will use present time-tested strategies.

    Let’s see how theoretical frameworks can make clear sensible e-commerce questions.

    Herfindahl-Hirschman Index

    HHI (Herfindahl-Hirschman Index) might be the commonest technique to measure focus. Regulators use it to verify if a market isn’t changing into too concentrated — they take percentages of every firm’s market share, sq. them, and add up. Easy as that. The consequence might be anyplace from practically 0 (many small gamers) to 10,000 (one firm takes all of it).

    Why use HHI for e-commerce portfolio evaluation? The logic is easy — as a substitute of firms competing in a market, we have now merchandise competing for income. The maths works precisely the identical manner — we take every product’s share of whole income, sq. it, and sum up. Excessive HHI means income relies on few merchandise, whereas low HHI exhibits income is unfold throughout many merchandise. This offers us a single quantity to trace portfolio focus over time.

    HHI, and Merchandise for context

    Pareto

    Who has not heard of Pareto’s guidelines? In 1896, Italian economist Vilfredo Pareto noticed that 20% of the inhabitants held 80% of Italy’s land. Since then, this sample has been present in a wide range of fields, together with wealth distribution and retail gross sales.

    Whereas popularly known as the “80/20 rule,” the Pareto precept is just not restricted to those figures. We are able to use any x-axis criterion (for instance, the highest 30% of merchandise) to find out the suitable y worth (income contribution). The Lorenz curve, fashioned by linking these places, gives an entire image of focus.

    Pareto traces for various income share thresholds

    The chart above exhibits what number of merchandise do we have to obtain sure income share (of the month-to-month income). I took arbitrarily cuts at .2, .3, .5, .8, .95, and naturally additionally together with 1 — which implies whole variety of merchandise, contributing to 100% of income in a given month.

    Lorenz curve

    If we type merchandise by their income contribition, and chart the road, we get Lorenz curve. On each axis we have now percentages, of merchandise and their reveue share. I case of completely uniform income distribution, we’d have a straight line, whereas in case of “excellent focus”, very steep curve, climbing near 100% income, after which quickly turning proper, to incorporate some residual income from different merchandise.

    Lorenz curve

    It’s fascinating to see that line, however typically it’ll look fairly comparable, like a “bended stick”. So allow us to now examine these traces for few earlier months, and likewise few years again (sticking to October). The month-to-month traces are fairly comparable, and in case you suppose — it could be good to have some interactivity on this plot, you might be completely proper. The yearly comparability exhibits extra variations (we nonetheless have month-to-month knowledge, taking October in annually), and that is comprehensible, since these measurements are extra distant in time.

    Lorenz curve — evaluating intervals

    So we do see variations between the traces, however can’t we quantify them in some way, to not rely solely on visible similarity? Undoubtedly, and there’s a ratio for that — Gini Ratio. And by the way in which, we can have numerous ratios in subsequent chapters.

    Gini Ratio

    To translate form of Lorenz curve into numeric worth, we will use Gini ratio — outlined as a ratio between two areas, above and under the equality line. On a plot under it’s a ratio between darkish and lightweight blue areas.

    Gini ratio visualization

    Allow us to then visualize for 2 intervals — October 2019, and October 2024, very same intervals, as we have now on one of many plots earlier than.

    Gini ratio, evaluating two intervals

    As soon as we have now good understanding, with visuals, how the Gini ratio is calculated, let’s plot it, over the entire interval.

    I exploit R for evaluation, so I’ve Gini ratio simply accessible (in addition to different ratios, which I’ll present later). The preliminary knowledge desk (x3a_dt) accommodates income per product, per thirty days. The ensuing one has Gini ratio per thirty days.

    #-- calculate Gini ratio, month-to-month
    library(knowledge.desk, ineq)
    x3a_ineq_dt <- x3a_dt[, .(gini = ineq::ineq(revenue, type = "Gini")), month]

    Good we have now all these packages for heavy lifting. The maths behind is just not tremendous difficult, however our time is valuable.

    The plot under exhibits the results of calculations.

    Gini over time

    I haven’t included a smoothing line, with its confidence interval channel, since we do not need measurement factors, however the results of Gini calculation, with its personal errors distribution. To be very strict and exact on math, we’d must calculate the boldness interval, and based mostly on that plot smoothed line. The outcomes are under.

    Gini over time, with pattern line

    Since we don’t use instantly statistical significance of calculated ratio, this tremendous strict method is slightly bit an overkill. I haven’t completed it whereas charting pattern line for HHI, nor will do in subsequent plots. However it’s good to pay attention to this nuance.

    We now have seen to date two ratios — HHI and Gini, and they’re removed from being similar. Lorenz curve nearer to diagonal signifies extra uniform distribution, which is what we have now for October 2019, however the HHI is greater, than for 2024, indicating extra focus in 2019. Possibly I made a mistake in calculations, even worse, early on throughout knowledge preparation? That may be actually unlucky. Or the information is okay, however we’re scuffling with correct interpretation?

    I’ve very often moments of such doubts, particularly when transferring with the evaluation actually fast. So how can we deal with that, tightening grip on knowledge and our understanding of dependencies? Keep in mind, that no matter evaluation you do, there may be at all times first time. And very often we do not need a luxurious of ‘leisure’ analysis, it’s extra usually already work for a Consumer (or a superior, stakeholder, whoever requested it, even ourselves, whether it is our initiative).

    We have to have an excellent understanding of easy methods to interpret all these ratios, together with dependencies between them. In the event you plan to current your outcomes to others, questions listed here are assured, so higher to be properly ready. We are able to work with an present dataset, or we will generate a small set, the place will probably be simpler to catch dependencies. Allow us to observe the latter method.

    Allow us to begin with making a small dataset,

    library(knowledge.desk)

    #-- Create pattern income knowledge
    income <- checklist(
    "2021" = rep(15, 10), # 10 values of 15
    "2022" = c(rep(100, 5), rep(10, 25)), # 5 values of 100, 25 values of 10
    "2023" = rep(25, 50), # 50 values of 25
    "2024" = c(rep(100, 30), rep(10, 70)) # 30 values of 100, 70 values of 10
    )

    combining it into an information.desk.

    #-- Convert to knowledge.desk in a single step
    x_dt <- knowledge.desk(
    yr = rep(names(income), sapply(income, size)),
    income = unlist(income)
    )

    A fast overview of the information.

    Instance dataset

    It appears we have now what we wanted — a easy dataset, however nonetheless fairly reasonable. Now we’re continuing with calculations and charts, much like what we had for an actual dataset earlier than.

    #-- HHI, Gini
    xh_dt <- x_dt[, .(hhi = ineq::Herfindahl(revenue),
    gini = ineq::Gini(revenue)), year]
    #-- Lorenz
    xl_dt <- x_dt[order(-revenue), .(
    cum_prod_pct = seq_len(.N)/.N,
    cum_rev_pct = cumsum(revenue)/sum(revenue)), year]

    And rendering plots.

    Ratios comparability

    These charts assist lots in understanding ratios, relations between them and to knowledge. It’s at all times a good suggestion to have such micro evaluation, for ourselves and for stakeholders — as ‘again pocket’ slides, and even sharing them upfront.

    Nerdy element — easy methods to barely shift the road, so it doesn’t overlap, and add labels inside a plot? Render a plot, after which make guide advantageous tuning, anticipating a number of iterations.

    #-- shift the road
    xl_dt[year == "2021", `:=` (cum_rev_pct = cum_rev_pct - .01)]

    For labelling I exploit ggrepel, however as a default, it’ll label all of the factors, whereas we want just one per line. An as well as deciding which one, for good trying chart.

    #-- resolve which factors to label
    labs_key2_dt <- knowledge.desk(
    yr = c("2021", "2022", "2023", "2024"), place = c(4, 5, 25, 30))

    #-- set keys
    checklist(xl_dt, labs_key2_dt) |> lapply(setkey, yr)

    #-- be a part of
    label_positions2 <- xl_dt[
    labs_key2_dt, on = .(year), # join on 'year'
    .SD[get('position')], # Use get('place') to reference the place from labs_key_dt
    by = .EACHI] # for annually

    Render the plot.

    #-- render plot
    plot_22b <- xl_dt |>
    ggplot(aes(cum_prod_pct, cum_rev_pct, colour = yr, group = yr, label = yr)) +
    geom_line(linewidth = .2) +
    geom_point(alpha = .8, form = 21) +
    theme_bw() +
    scale_color_viridis_d(possibility = "H", start = 0, finish = 1) +
    ggrepel::geom_label_repel(
    knowledge = label_positions2, power = 10,
    field.padding = 2.5, level.padding = .3,
    seed = 3, course = "x") +
    ... extra styling

    I started with HHI, the Lorenz curve, and the accompanying Gini ratios, as they gave the impression to be good beginning factors for focus and inequality measurements. Nevertheless, there are quite a few totally different ratios used to outline distributions, whether or not for inequality or basically. It’s unlikely that we’d make use of all of them without delay, subsequently choose the subset that gives probably the most insights in your particular problem.

    With a correct construction of a dataset, it’s fairly simple to calculate them. I’m sharing code snippets, with a number of ratios calculated month-to-month. We use a dataset, we have already got — month-to-month income per product (base merchandise, excluding variants).

    Beginning with ratios from the ineq package deal.

    #---- inequality ----
    x3_ineq_dt <- x3a_dt[, .(
    # Classical inequality/concentration measures
    gini = ineq::ineq(revenue, type = "Gini"), # Gini coefficient
    hhi = ineq::Herfindahl(revenue), # Herfindahl-Hirschman Index
    hhi_f = sum((rev_pct*100)^2), # HHI - formula
    atkinson = ineq::ineq(revenue, type = "Atkinson"), # Atkinson index
    theil = ineq::ineq(revenue, type = "Theil"), # Theil entropy index
    kolm = ineq::ineq(revenue, type = "Kolm"), # Kolm index
    rs = ineq::ineq(revenue, type = "RS"), # Ricci-Schutz index
    entropy = ineq::entropy(revenue), # Entropy measure
    hoover = mean(abs(revenue - mean(revenue)))/(2 * mean(revenue)), # Hoover (Robin Hood) index

    Diustribution shape and top/bottom shares and ratios.

     # Distribution shape measures
    cv = sd(revenue)/mean(revenue), # Coefficient of Variation
    skewness = moments::skewness(revenue), # Skewness
    kurtosis = moments::kurtosis(revenue), # Kurtosis

    # Ratio measures
    p90p10 = quantile(revenue, 0.9)/quantile(revenue, 0.1), # P90/P10 ratio
    p75p25 = quantile(revenue, 0.75)/quantile(revenue, 0.25), # Interquartile ratio
    palma = sum(rev_pct[1:floor(.N*.1)])/sum(rev_pct[floor(.N*.6):(.N)]), # Palma ratio

      # Focus ratios and shares
    top1_share = max(rev_pct), # Share of prime product
    top3_share = sum(head(type(rev_pct, lowering = TRUE), 3)), # CR3
    top5_share = sum(head(type(rev_pct, lowering = TRUE), 5)), # CR5
    top10_share = sum(head(type(rev_pct, lowering = TRUE), 10)), # CR10
    top20_share = sum(head(type(rev_pct, lowering = TRUE), flooring(.N*.2))), # Prime 20% share
    mid40_share = sum(type(rev_pct, lowering = TRUE)[floor(.N*.2):floor(.N*.6)]), # Center 40% share
    bottom40_share = sum(tail(type(rev_pct), flooring(.N*.4))), # Backside 40% share
    bottom20_share = sum(tail(type(rev_pct), flooring(.N*.2))), # Backside 20% share

    Primary statistics, quantiles.

     # Primary statistics
    unique_products = .N, # Variety of distinctive merchandise
    revenue_total = sum(income), # Complete income
    mean_revenue = imply(income), # Imply income per product
    median_revenue = median(income), # Median income
    revenue_sd = sd(income), # Income commonplace deviation

    # Quantile values
    q20 = quantile(income, 0.2), # twentieth percentile
    q40 = quantile(income, 0.4), # fortieth percentile
    q60 = quantile(income, 0.6), # sixtieth percentile
    q80 = quantile(income, 0.8), # eightieth percentile

    Rely measures.

     # Rely measures
    above_mean_n = sum(income > imply(income)), # Variety of merchandise above imply
    above_2mean_n = sum(income > 2*imply(income)), # Variety of merchandise above 2x imply
    top_quartile_n = sum(income > quantile(income, 0.75)), # Variety of merchandise in prime quartile
    zero_revenue_n = sum(income == 0), # Variety of merchandise with zero income
    within_1sd_n = sum(abs(income - imply(income)) <= sd(income)), # Merchandise inside 1 SD
    within_2sd_n = sum(abs(income - imply(income)) <= 2*sd(income)), # Merchandise inside 2 SD

    Income above (or under) the edge.

      # Income above threshold
    rev_above_mean = sum(income[revenue > mean(revenue)]) # Income from merchandise above imply
    ), month]

    The ensuing desk has 40 columns, and 72 rows (months).

    As talked about earlier, it’s troublesome to think about, one would work with 40 ratios, so I’m somewhat displaying a technique easy methods to calculate them, and one ought to decide related ones. As at all times, it’s good to visualise and see how they relate to one another.

    Chosen ratios over time

    We are able to calculate correlation matrix between all ratios, or chosen subset.

    # Choose key metrics for a clearer visualization
    key_metrics <- c("gini", "hhi", "atkinson", "theil", "entropy", "hoover",
    "top1_share", "top3_share", "top5_share", "unique_products")

    cor_matrix <- x3_ineq_dt[, .SD, .SDcols = key_metrics] |> cor()

    Change column names to extra pleasant names.

    # Make variable names extra readable
    pretty_names <- c(
    "Gini", "HHI", "Atkinson", "Theil", "Entropy", "Hoover",
    "Prime 1%", "Prime 3%", "Prime 5%", "Merchandise"
    )
    colnames(cor_matrix) <- rownames(cor_matrix) <- pretty_names

    And render the plot.

    corrplot::corrplot(cor_matrix, 
    kind = "higher",
    technique = "colour",
    tl.col = "black",
    tl.srt = 45,
    diag = F,
    order = "AOE")
    Correlation matrix, chosen ratios

    After which we will plot some fascinating pairs. After all, a few of them have constructive or unfavorable correlation by definition, whereas in different circumstances it isn’t that apparent.

    Chosen ratios, with unfavorable correlation

    We began evaluation with ratios and Lorenz curve as a top-down overview. It’s a good begin, however there are two problems — the ratios have a comparatively broad vary, when the enterprise is doing okay, and there may be hardly connection to actionable insights. Even when we discover that the ratio is on the sting, or exterior of the secure vary, it’s unclear what we must always do. And directions like “lower focus” are slightly ambiguous.

    E-commerce talks and breaths merchandise, so to make evaluation relatable, we have to reference to specific merchandise. Folks would additionally like to grasp which merchandise represent core 50%, 80% of income, and equally essential, if these merchandise keep persistently as prime contributors.

    Allow us to take one month, August 2024 and see which merchandise contributed to 50% income in that month. Then, we verify income from these actual merchandise in different months. There are 5 merchandise, producing (no less than) 50% income in August.

    Merchandise income, aspects by product

    We are able to additionally render extra visually interesting plot with a streamgraph. Each plots present very same dataset, however they complement one another properly — bar plots for precision, whereas streamgraph for a narrative.

    Merchandise income, stream graph

    The pink line indicated chosen month. In the event you really feel “itching” to shift that line, like in an old school radio, you might be completely proper — that must be an interactive chart, and truly it’s, together with a slider for income share share (we produced it for a Consumer).

    So what if we shift that pink ‘tuning line’ slightly bit backwards, possibly to 2020? The logic in knowledge preparation may be very comparable — get merchandise contributing to a sure income share threshold, and verify the income from these merchandise in different months.

    Merchandise income, stream graph

    With interactivity on two components — income contribution share and the date, one can be taught lots in regards to the enterprise, and that is precisely the purpose of those charts. One can look from totally different angles:

    • focus, what number of merchandise do we want for sure income threshold,
    • merchandise themselves, do they keep in sure income contribution bin, or do they alter and why? Is it seasonality, a legitimate alternative, misplaced provider or one thing else?
    • time window, whether or not we have a look at one month or an entire yr,
    • seasonality, evaluating comparable time of a yr with earlier intervals.

    What the Knowledge Tells Us

    Our 6-year dataset reveals the evolution of an e-commerce enterprise from excessive focus to balanced progress. Listed here are the important thing patterns and classes:

    With 6 years of knowledge, I had a novel likelihood to look at focus metrics evolve because the enterprise grew. Beginning with only a handful of merchandise, I noticed precisely what you’d count on — sky-high focus. However as new merchandise entered the combination, issues acquired extra fascinating. The enterprise discovered its rhythm with a dozen or so prime performers, and the HHI settled into a cushty 700–800 vary.

    Right here’s one thing fascinating I found: focus and inequality may sound like twins, however they’re extra like distant cousins. I seen this whereas evaluating HHI towards Lorenz curves and their Gini ratios. Belief me, you’ll wish to get snug with the maths earlier than explaining these patterns to stakeholders — they’ll scent uncertainty from a mile away.

    Wish to actually perceive these metrics? Do what I did: create a dummy dataset so easy it’s virtually embarrassing. I’m speaking fundamental patterns {that a} fifth-grader might grasp. Appears like overkill? Possibly, but it surely saved me numerous hours of head-scratching and misinterpretation. Preserve these examples in your again pocket — or higher but, share them upfront. Nothing builds confidence like displaying you’ve completed your homework.

    Look, calculating these ratios isn’t rocket science. The actual magic occurs whenever you dig into how every product contributes to your income. That’s why I added the “present me the cash” part — I don’t consider in fast fixes or magic formulation. It’s about rolling up your sleeves and understanding how every product actually behaves.

    As you’ve most likely seen your self, these streamgraphs I confirmed you might be virtually begging for interactivity. And boy, does that add worth! When you’ve acquired your keys and joins sorted out, it’s not even that difficult. Give your customers an interactive device, and abruptly you’re not drowning in one-off questions anymore — they’re discovering insights themselves.

    Right here’s a professional tip: use this focus evaluation as your foot within the door with stakeholders. Present your product groups that streamgraph, and I assure their eyes will mild up. Once they begin asking for interactive variations, you’ve acquired them hooked. The most effective half? They’ll suppose it was their concept all alongside. That’s the way you get actual adoption — by letting them uncover the worth themselves.

    Knowledge Engineering Takeaways

    Whereas very often we usually know what to anticipate in a dataset, it’s virtually assured that there can be some nuances, exceptions, or possibly even surprises. It’s good to spend a while reviewing datasets, utilizing devoted features (like str, glimpse in R), searching for empty fields, outliers, but in addition merely scrolling by to grasp the information. I like comparisons, and on this case, I’d examine to smelling fish on a market earlier than leaping to organize sushi 🙂

    Then, if we work with a uncooked knowledge export, fairly doubtless there can be a number of columns within the knowledge dump; in any case, if we click on ‘export all’, wouldn’t we count on precisely that? For many evaluation we are going to want a subset of those columns, so it’s good to trim and preserve solely what we want. I assume we work with a script, so if it seems, we want extra, not a difficulty, simply add missed column and rerun that chunk.

    Within the dataset dump there was a timestamp in a single row per transaction, whereas we wanted it per every product. Therefore some mild knowledge wrangling to propagate these timestamps to all of the merchandise.

    After cleansing the dataset, it’s essential to contemplate the context of study, together with the inquiries to be answered and the mandatory adjustments to the information. This “contextual cleansing/wrangling” is crucial because it determines whether or not the evaluation succeeds or fails. In our state of affairs, the purpose was to analyse product focus, subsequently filtering out variants (dimension, color, and so on.) was important. If we had skipped that, the result would have been radically totally different.

    Very often we will count on some “traps”, the place initially it appears we will apply easy method, whereas truly, we must always add a little bit of sophistication. For instance — Lorenz curve, the place we have to calculate what number of merchandise do we have to get to a sure income threshold. That is the place I exploit rolling joins, which match right here completely.

    The core logic to supply streamgraphs is to search out merchandise which represent sure income share in a given month, then “freeze” them and get their income in different months. The toolset I used was including additional column, with a product quantity, after sorting per thirty days, after which taking part in with keys and joins.

    An essential factor of this evaluation was including interactivity, permitting customers to play with some parameters. That raises the bar, as we want all these operations to be carried out lightning quick. The elements we want are proper knowledge construction, extra columns, correct keys and joins. Put together as a lot as potential, precalculating in an information warehouse, so the dashboarding device is just not overloaded. Take caching into consideration.

    Learn how to Begin?

    Strike a steadiness between delivering what stakeholders request and exploring doubtlessly precious insights they haven’t requested for but. The evaluation I introduced follows this sample — getting preliminary focus ratios is easy, whereas constructing an interactive streamgraph optimized for lightning-fast operation requires vital effort.

    Begin small and have interaction others. Share fundamental findings, focus on what you would be taught collectively, and solely then proceed with extra labor-intensive evaluation when you’ve secured real curiosity. And at all times preserve a strong grip in your uncooked knowledge — it’s invaluable for answering these inevitable ad-hoc questions rapidly.

    Constructing a prototype earlier than full manufacturing permits for validation of curiosity and suggestions with out devoting an excessive amount of time. In my case, such easy focus ratios sparked debates that ultimately led to the extra superior interactive research on which stakeholders rely as we speak.

    Begin small, safe real curiosity … :-)) / picture generated by DALL-E, based mostly on writer’s immediate.

    I’ll present you the way I ready the information at every step of this evaluation. Since I used R, I’ll embrace the precise code snippets — they’ll allow you to get began quicker, even in case you’re working in a unique language. That is the code I used for the research, although you’ll most likely must adapt it to your particular wants somewhat than simply copying it over. I made a decision to maintain the code separate from the primary evaluation, to make it extra streamlined and readable for each technical and enterprise customers.

    Whereas I’m presenting evaluation based mostly on Shopify export, there isn’t a limitation for a selected platform, we simply want transactions knowledge.

    Shopify export

    Let’s begin with getting our knowledge from Shopify. The uncooked export wants some work earlier than we will dive into focus evaluation — right here’s what I needed to cope with first.

    We begin with export of uncooked transactions knowledge from Shopify. It would take a while, and when prepared, we get an e-mail with hyperlinks to obtain.

    #-- 0. libs
    pacman::p_load(knowledge.desk)

    #-- 1.1 load knowledge; the csv recordsdata are what we get as a full export from Shopify
    xs1_dt <- fread(file = "shopify_raw/orders_export_1.csv")
    xs2_dt <- fread(file = "shopify_raw/orders_export_2.csv")
    xs3_dt <- fread(file = "shopify_raw/orders_export_3.csv")

    As soon as we have now knowledge, we have to mix these recordsdata into one dataset, trim columns and carry out some cleaning.

    #-- 1.2 verify all columns, restrict them to important (for this evaluation) and bind into one knowledge.desk
    xs1_dt |> colnames()
    # there are 79 columns in full export,
    # so we choose a subset, related for this evaluation
    sel_cols <- c("Identify", "E-mail", "Paid at", "Success Standing", "Accepts Advertising and marketing", "Forex", "Subtotal",
    "Lineitem amount", "Lineitem title", "Lineitem value", "Lineitem sku", "Low cost Quantity",
    "Billing Province", "Billing Nation")

    #-- mix into one knowledge.desk, with a subset of columns
    xs_dt <- knowledge.desk::rbindlist(l = checklist(xs1_dt, xs2_dt, xs3_dt),
    use.names = T, fill = T, idcol = T) %>% .[, ..sel_cols]

    Some knowledge preparations.

    #-- 2. knowledge prep
    #-- 2.1 substitute areas in column names, for simpler dealing with
    sel_cols_new <- sel_cols |> stringr::str_replace(sample = " ", alternative = "_")
    setnames(xs_dt, previous = sel_cols, new = sel_cols_new)

    #-- 2.2 transaction as integer
    xs_dt[, `:=` (Transaction_id = stringr::str_remove(Name, pattern = "#") |> as.integer())]

    Anonymize emails, as we don’t want/wish to cope with actual emails throughout evaluation.

    #-- 2.3 anonymize e-mail 
    new_cols <- c("Email_hash")
    xs_dt[, (new_cols) := .(digest::digest(Email, algo = "md5")), .I]

    Change column sorts; this relies on private preferences.

    #-- 2.4 change Accepts_Marketing to logical column
    xs_dt[, `:=` (Accepts_Marketing_lgcl = fcase(
    Accepts_Marketing == "yes", TRUE,
    Accepts_Marketing == "no", FALSE,
    default = NA))]

    Now we concentrate on transactions dataset. Within the export recordsdata, the transaction quantity and timestamp is in just one row per all gadgets within the basket. We have to get these timestamps and propagate to all gadgets.

    #-- 3 transactions dataset
    #-- 3.1 subset transactions
    #-- restrict columns to important for transaction solely
    trans_sel_cols <- c("Transaction_id", "Email_hash", "Paid_at",
    "Subtotal", "Forex", "Billing_Province", "Billing_Country")

    #-- get transactions desk based mostly on requirement of non-null fee - as fee (date, quantity) is just not for all merchandise, it's only as soon as per basket
    xst_dt <- xs_dt[!is.na(Paid_at) & !is.na(Transaction_id), ..trans_sel_cols]

    #-- date columns
    xst_dt[, `:=` (date = as.Date(`Paid_at`))]
    xst_dt[, `:=` (month = lubridate::floor_date(date, unit = "months"))]

    Some additional info, as I name them, derivatives.

    #-- 3.2 is person returning? their n-th transaction
    setkey(xst_dt, Paid_at)
    xst_dt[, `:=` (tr_n = 1)][, `:=` (tr_n = cumsum(tr_n)), Email_hash]

    xst_dt[, `:=` (returning = fcase(tr_n == 1, FALSE, default = TRUE))]

    Do we have now any NA’s within the dataset?

    xst_dt[!complete.cases(xst_dt), ]

    Merchandise dataset.

    #-- 4 merchandise dataset
    #-- 4.1 subset of columns
    sel_prod_cols <- c("Transaction_id", "Lineitem_quantity", "Lineitem_name",
    "Lineitem_price", "Lineitem_sku", "Discount_Amount")

    Now we be a part of these two datasets, to have transaction traits (trans_sel_cols) for all of the merchandise.

    #-- 5 be a part of two datasets
    checklist(xs_dt, xst_dt) |> lapply(setkey, Transaction_id)
    x3_dt <- xs_dt[, ..sel_prod_cols][xst_dt]

    Let’s verify which columns we have now in x3_dt dataset.

    And it’s also a second to examine the dataset.

    x3_dt |> str()
    x3_dt |> dplyr::glimpse()
    x3_dt |> head()

    Time for knowledge cleansing. First up: splitting the Lineitem_name into base merchandise and their variants. In principle, these are separated by a splash (“-”). Easy, proper? Not fairly — some product names, like ‘All-Goal’, comprise dashes as a part of their title. So we have to deal with these particular circumstances first, quickly changing problematic dashes, doing the break up, after which restoring the unique product names.

    #-- 6. cleansing, aggregation on product names
    #-- 6.1 break up product title into base and variants
    #-- break up product names into core and variants
    product_cols <- c("base_product", "variants")
    #-- with particular therapy for 'all-purpose'
    x3_dt[stringr::str_detect(string = Lineitem_name, pattern = "All-Purpose"),
    (product_cols) := {
    tmp = stringr::str_replace(Lineitem_name, "All-Purpose", "AllPurpose")
    s = stringr::str_split_fixed(tmp, pattern = "[-/]", n = 2)
    s = stringr::str_replace(s, "AllPurpose", "All-Goal")
    .(s[1], s[2])
    }, .I]

    It’s good to make validation after every step.

    # validation
    x3_dt[stringr::str_detect(
    string = Lineitem_name, pattern = "All-Purpose"), .SD,
    .SDcols = c("Transaction_id", "Lineitem_name", product_cols)]

    We preserve transferring with knowledge cleansing — the precise steps rely after all on a selected dataset, however I share my circulation, for example.

    #-- two situations, to deal with `(32-ounce)` in prod title; we do not need that hyphen to chop the title
    x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = "ounce", negate = T) &
    stringr::str_detect(string = `Lineitem_name`, pattern = "All-Purpose", negate = T),
    (product_cols) := {
    s = stringr::str_split_fixed(string = `Lineitem_name`, pattern = "[-/]", n = 2); .(s[1], s[2])
    }, .I]

    x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = "ounce", negate = F) &
    stringr::str_detect(string = `Lineitem_name`, pattern = "All-Purpose", negate = T),
    (product_cols) := {
    s = stringr::str_split_fixed(string = `Lineitem_name`, pattern = ") - ", n = 2); .(paste0(s[1], ")"), s[2])
    }, .I]

    #-- small patch for exceptions
    x3_dt[stringr::str_detect(string = base_product, pattern = "))$", negate = F),
    base_product := stringr::str_replace(string = base_product, pattern = "))$", replacement = ")")]

    Validation.

    # validation
    x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = "ounce")
    ][, .SD, .SDcols = c(eval(sel_cols[6]), product_cols)
    ][, .N, c(eval(sel_cols[6]), product_cols)]

    x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = "All")
    ][, .SD, .SDcols = c(eval(sel_cols[6]), product_cols)
    ][, .N, c(eval(sel_cols[6]), product_cols)]

    x3_dt[stringr::str_detect(string = base_product, pattern = "All")]

    We use eval(sel_cols[6]) to get the title of a column sel_cols[6] which is Forex.

    We additionally must cope with NA’s, however with an understanding of a dataset — the place we might have NA’s and the place they aren’t presupposed to be, indicating a difficulty. In some columns, like `Discount_Amount`, we have now values (precise low cost), zeros, but in addition typically NA’s. Checking ultimate value, we conclude they’re zeros.

    #-- cope with NA'a - substitute them with 0
    sel_na_cols <- c("Discount_Amount")
    x3_dt[, (sel_na_cols) := lapply(.SD, fcoalesce, 0), .SDcols = sel_na_cols]

    For consistency and comfort, altering all column names to lowercase.

    setnames(x3_dt, tolower(names(x3_dt)))

    And verification.

    After all assessment dataset, with some check aggregations, and likewise simply printing it out.

    Save dataset as each Rds (native R format) and csv.

    x3_dt |> fwrite(file  = "knowledge/merchandise.csv")
    x3_dt |> saveRDS(file = "knowledge/x3_dt.Rds")

    Conducting steps above we must always have a clear dataset, for futher evaluation. The code ought to function a suggestion, but in addition can be utilized instantly, in case you work in R.

    Variations

    As a primary glimpse, we are going to verify variety of merchandise per thirty days, each base_product, and together with all variations.

    As a small cleansing, I take solely full months.

    month_last <- x3_dt[, max(month)] - months(1)

    Then we depend month-to-month numbers, storing in momentary desk, that are then joined.

    x3_a_dt <- x3_dt[month <= month_last, .N, .(base_product, month)
    ][, .(base_products = .N), keyby = month]

    x3_b_dt <- x3_dt[month <= month_last, .N, .(lineitem_name, month)
    ][, .(products = .N), keyby = month]

    x3_c_dt <- x3_a_dt[x3_b_dt]

    Some knowledge wrangling.

    #-- names, as we wish them on plot
    setnames(x3_c_dt, previous = c("base_products", "merchandise"), new = c("base", "all, with variants"))

    #-- lengthy kind
    x3_d_dt <- x3_c_dt[, melt.data.table(.SD, id.vars = "month", variable.name = "Products")]

    #-- reverse elements, so they seem on plot in a correct order
    x3_d_dt[, `:=` (Products = forcats::fct_rev(Products))]

    We’re able to plot the dataset.

    plot_01_w <- x3_d_dt |>
    ggplot(aes(month, worth, colour = Merchandise, fill = Merchandise)) +
    geom_line(present.legend = FALSE) +
    geom_area(alpha = .8, place = position_dodge()) +
    theme_bw() +
    scale_fill_viridis_d(course = -1, possibility = "G", start = 0.3, finish = .7) +
    scale_color_viridis_d(course = -1, possibility = "G", start = 0.3, finish = .7) +
    labs(x = "", y = "Merchandise",
    title = "Distinctive merchandise, month-to-month", subtitle = "Impression of aggregation") +
    theme(... extra styling)

    The subsequent plot exhibits the variety of variants grouped into bins. This offers us an opportunity to speak about chaining operations in R, significantly with the information.desk package deal. In knowledge.desk, we will chain operations by opening a brand new bracket proper after closing one — leading to ][ syntax. It creates a compact, readable chain that’s still easy to debug since you can execute it piece by piece. I prefer succinct code, but that’s just my style — use whatever approach works best for you. We can write code in one line, or multi-line, with logical steps.

    On one of the plots we look at a date, when each product was first seen. To get that date, we set a key on date, and then take the first occurrence date[1] per every base_product.

    #-- variations per yr, product, with a date, when it was 1st seen
    x3c_dt <- x3_dt[, .N, .(base_product, variants)
    ][, .(variants = .N), base_product][order(-variants)]

    x3_dt |> setkey(date)
    x3d_dt <- x3_dt[, .(date = date[1]), base_product]

    checklist(x3c_dt, x3d_dt) |> lapply(setkey, base_product)

    x3e_dt <- x3c_dt[x3d_dt][order(variants)
    ][, `:=` (year = year(date) |> as.factor())][year != 2018
    ][, .(products = .N), .(variants, year)][order(-variants)
    ][, `:=` (
    variant_bin = cut(
    variants,
    breaks = c(0, 1, 2, 5, 10, 20, 100, Inf),
    include.lowest = TRUE,
    right = FALSE
    ))
    ][, .(total_products = sum(products)), .(variant_bin, year)
    ][order(variant_bin)
    ][, `:=` (year_group = fcase(
    year %in% c(2019, 2020, 2021), "2019-2021",
    year %in% c(2022, 2023, 2024), "2022-2024"
    ))
    ][, `:=` (variant_bin = forcats::fct_rev(variant_bin))]

    The ensuing desk is strictly as we want it for charting.

    The second plot makes use of transaction date, so the information wrangling is comparable, however with out date[1] step.

    If we wish to have a few plots mixed, we will produce them individually, and mix utilizing for instance ggpubr::ggarrange() or we will mix tables into one dataset after which use faceting performance. The previous is when plots are of fully totally different nature, whereas latter is beneficial, once we can naturally have mixed dataset.

    For instance, few extra traces from my script.

    x3h_dt <- knowledge.desk::rbindlist(
    l = checklist(
    introduction = x3e_dt[, `:=` (year = as.numeric(as.character(year)))],
    transaction = x3g_dt),
    use.names = T, fill = T, idcol = T)

    And a plot code.

    plot_04_w <- x3h_dt |>
    ggplot(aes(yr, total_products,
    colour = variant_bin, fill = variant_bin, group = .id)) +
    geom_col(alpha = .8) +
    theme_bw() +
    scale_fill_viridis_d(course = 1, possibility = "G") +
    scale_color_viridis_d(course = 1, possibility = "G") +
    labs(x = "", y = "Base Merchandise",
    title = "Merchandise, and their variants",
    subtitle = "Yearly",
    fill = "Variants",
    colour = "Variants") +
    facet_wrap(".id", ncol = 2) +
    theme(... different styling choices)

    Faceting has huge benefit, as a result of we function on one desk, which helps lots in assuring knowledge consistency.

    Pareto

    The essence of Pareto calculation is to search out what number of merchandise do we have to obtain sure income share. We have to put together the dataset, in a few steps.

    #-- calculate amount and income per base_product, month-to-month
    x3a_dt <- x3_dt[, {
    items = sum(lineitem_quantity, na.rm = T);
    revenue = sum(lineitem_quantity * lineitem_price);
    .(items, revenue)}, keyby = .(month, base_product)
    ][, `:=` (i = 1)][order(-revenue)][revenue > 0, ]

    #-- calculate share share, and cumulative share
    x3a_dt[, `:=` (
    rev_pct = revenue / sum(revenue),
    cum_rev_pct = cumsum(revenue) / sum(revenue), prod_n = cumsum(i)), month]

    In case we’d must masks actual product names, allow us to create a brand new variable.

    #-- merchandise title masking
    x3a_dt[, masked_name := paste("Product", .GRP), by = base_product]

    And dataset printout, with a subset of columns.

    And filtered for one month, displaying few traces from prime and from the underside.

    The important column is cum_rev_pct, which signifies cumulative share income from merchandise 1-n. We have to discover which prod_n covers income share threshold, as within the pct_thresholds_dt desk.

    So we’re prepared for precise Pareto calculation. The code under, with feedback.

    #-- pareto
    #-- set share thresholds
    pct_thresholds_dt <- knowledge.desk(cum_rev_pct = c(0, .2, .3, .5, .8, .95, 1))

    #-- set key for be a part of
    checklist(x3a_dt, pct_thresholds_dt) |> lapply(setkey, cum_rev_pct)

    #-- subset columns (non-compulsory)
    sel_cols <- c("month", "cum_rev_pct", "prod_n")

    #-- carry out a rolling be a part of - essential step!
    x3b_dt <- x3a_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_cols]

    Why can we carry out a rolling be a part of? We have to discover the primary cum_rev_pct to cowl every threshold.

    We’d like 2 merchandise for 20% income, 4 merchandise for 30% and so forth. And to have 100% income, after all we want contribution from all 72 merchandise.

    And a plot.

    #-- knowledge prep
    x3b1_dt <- x3b_dt[month < month_max,
    .(month, cum_rev_pct = as.factor(cum_rev_pct) |> forcats::fct_rev(), prod_n)]

    #-- charting
    plot_07_w <- x3b1_dt |>
    ggplot(aes(month, prod_n, colour = cum_rev_pct, fill = cum_rev_pct)) +
    geom_line() +
    theme_bw() +
    geom_area(alpha = .2, present.legend = F, place = position_dodge(width = 0)) +
    scale_fill_viridis_d(course = -1, possibility = "G", start = 0.2, finish = .9) +
    scale_color_viridis_d(course = -1, possibility = "G", start = 0.2, finish = .9,
    labels = perform(x) scales::%(as.numeric(as.character(x))) # Convert issue to numeric first
    ) +
    ... different styling choices ...

    Lorenz curve

    To plot Lorenz curve, we have to type merchandise by it’s contribution to whole income, and normalize each variety of merchandise and income.

    Earlier than the primary code, a useful technique to select n-th month from the dataset, from starting or from the top.

    month_sel <- x3a_dt$month |> distinctive() |> type(lowering = T) |> dplyr::nth(2)

    And the code.

    xl_oct24_dt <- x3a_dt[month == month_sel, 
    ][order(-revenue), .(
    cum_prod_pct = seq_len(.N)/.N,
    cum_rev_pct = cumsum(revenue)/sum(revenue))]

    To chart separate traces per every time interval, we have to modify accordingly.

    #-- Lorenz curve, yearly aggregation
    xl_dt <- x3a_dt[order(-revenue), .(
    cum_prod_pct = seq_len(.N)/.N,
    cum_rev_pct = cumsum(revenue)/sum(revenue)), month]

    The xl_dt dataset is prepared for charting.

    Indices, ratios

    The code is easy right here, assuming ample prior knowledge preparation. The logic and a few snippets in the primary physique of this text.

    Streamgraph

    The streamgraph proven earlier is an instance of a chart which will seem troublesome to render, particularly when interactivity is required. One of many causes I included it on this weblog is to indicate how we will simplify duties with keys, joins, and knowledge.desk syntax particularly. Utilizing keys, we will obtain very efficient filtering for interactivity. As soon as we have now a deal with on the information, we’re nearly completed; all that continues to be are some settings to fine-tune the plot.

    We begin with thresholds desk.

    #-- set share thresholds
    pct_thresholds_dt <- knowledge.desk(cum_rev_pct = c(0, .2, .3, .5, .8, .95, 1))

    Since we wish joins carried out month-to-month, it’s good to create an information subset overlaying one month, to check the logic, earlier than extending for a full dataset.

    #-- check logic for one month
    month_sel <- as.Date("2020-01-01")
    sel_a_cols <- c("month", "rev_pct", "cum_rev_pct", "prod_n", "masked_name")
    x3a1_dt <- x3a_dt[month == month_sel, ..sel_a_cols]

    We now have 23 merchandise in January 2020, sorted by income share, and we even have cumulative income, reaching 100% with the final, twenty third product.

    Now we have to create an intermediate desk, telling us what number of merchandise do we have to obtain every income threshold.

    #-- set key for be a part of
    checklist(x3a1_dt, pct_thresholds_dt) |> lapply(setkey, cum_rev_pct)

    #-- carry out a rolling be a part of - essential step!
    sel_b_cols <- c("month", "cum_rev_pct", "prod_n")
    x3b1_dt <- x3a1_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_b_cols]

    As a result of we work with a one-month knowledge subset (and selecting month with not that many merchandise), it is vitally straightforward to verify the result — evaluating x3a1_dt and x3b1_dt tables.

    And now we have to get merchandise names, for chosen threshold.

    #-- get merchandise
    #-- set keys
    checklist(x3a1_dt, x3b1_dt) |> lapply(setkey, month, prod_n)

    #-- specify threshold
    x3b1_dt[cum_rev_pct == .8][x3a1_dt, roll = -Inf, nomatch = 0]

    #-- or, an equal, specify desk's row
    x3b1_dt[5, ][x3a1_dt, roll = -Inf, nomatch = 0]

    To attain 80% income, we want 7 merchandise, and from the be a part of above, we get their names.

    I feel you already see, why we use rolling joins, and might’t use simle < or > logic.

    Now, we have to lengthen the logic for all months.

    #-- lengthen for all months

    #-- set key for be a part of
    checklist(x3a_dt, pct_thresholds_dt) |> lapply(setkey, cum_rev_pct)

    #-- subset columns (non-compulsory)
    sel_cols <- c("month", "cum_rev_pct", "prod_n")

    #-- carry out a rolling be a part of - essential step!
    x3b_dt <- x3a_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_cols]

    Get the merchandise.

    #-- set keys, be a part of
    checklist(x3a_dt, x3b_dt) |> lapply(setkey, month, prod_n)
    x3b6_dt <- x3b_dt[cum_rev_pct == .8][x3a_dt, roll = -Inf, nomatch = 0][, ..sel_a_cols]

    And confirm, for a similar month as in a check knowledge subset.

    If we wish to freeze merchandise for a sure month, and see income from them in the entire interval (what second streamgraphs exhibits), we will set key on product title and carry out a be a part of.

    #-- freeze merchandise
    x3b6_key_dt <- x3b6_dt[month == month_sel, .(masked_name)]
    checklist(x3a_dt, x3b6_key_dt) |> lapply(setkey, masked_name)

    sel_b2_cols <- c("month", "income", "masked_name")
    x3a6_dt <- x3a_dt[x3b6_key_dt][, ..sel_b2_cols]

    And we get precisely, what we wanted.

    Utilizing joins, together with rolls, and deciding what might be precalculated in a warehouse, and what’s left for dynamic filtering in a dashboard does require some observe, but it surely positively pays off.

    Picture generated by DALL-E, based mostly on writer’s immediate, impressed by “The Bremen City Musicians”



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow DeepSeek ripped up the AI playbook—and why everyone’s going to follow it
    Next Article How AI Is Transforming the SEO Landscape — and Why You Need to Adapt
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025
    Artificial Intelligence

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Meta’s Antitrust Trial Begins as FTC Argues Company Built Social Media Monopoly

    April 14, 2025

    Humanoids at Work: Revolution or Workforce Takeover?

    February 10, 2025

    What I’d Do If I Started Learning Data Science/ML Today 🚀 | by Evgeniy | Jun, 2025

    June 16, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.