Machine Learning Incidents in AdTech | by Ben Weber

Supply: https://unsplash.com/photos/a-couple-of-signs-that-are-on-a-fence-xXbQIrWH2_A

Challenges with deep studying in manufacturing

14 min learn

14 hours in the past

One of many greatest challenges I encountered in my profession as a knowledge scientist was migrating the core algorithms in a cellular AdTech platform from basic machine studying fashions to deep studying. I labored on a Demand Facet Platform (DSP) for person acquisition, the place the position of the ML fashions is to foretell if displaying advert impressions to a tool will consequence within the person clicking on the advert and putting in a cellular app. For a fast hands-on overview of the clicking prediction downside, please take a look at my past post.

Whereas we had been in a position to rapidly get to a state the place the offline metrics for the deep studying fashions had been aggressive with logistic regression fashions, it took awhile to get the deep studying fashions working easily in manufacturing, and we encountered many incidents alongside the best way. We had been in a position to begin with small-scale checks utilizing Keras for mannequin coaching and Vertex AI for managed TensorFlow serving, and ran experiments to match iterations of our deep studying fashions with our champion logistic regression fashions. We had been ultimately in a position to get the deep studying fashions to outperform the basic ML fashions in manufacturing, and modernize our ML platform for person acquisition.

When working with machine studying fashions on the core of a posh system, there are going to be conditions the place issues are going to go off the rails and it’s necessary to have the ability to rapidly recuperate and be taught from these incidents. Throughout my time at Twitch, we used the Five W’s approach to writing postmortems for incidents. The thought is to determine “what” went unsuitable, “when” and “the place” it occurred, “who” was concerned, and “why” an issue resulted. The observe up is to then set up the best way to keep away from this sort of incident sooner or later and to arrange guardrails to forestall comparable points. The purpose is to construct a an increasing number of strong system over time.

In one among my previous roles in AdTech, we bumped into a number of points when migrating from basic ML fashions to deep studying. We ultimately acquired to a state the place we had a strong pipeline for coaching, validating, and deploying fashions that improved upon our basic fashions, however we bumped into incidents throughout this course of. On this publish we’ll cowl 8 of the incidents that occurred and describe the next steps we took for incident administration:

What was the problem?
How was it discovered?
How was it fastened?
What did we be taught?

We recognized quite a lot of root causes, however usually aligned on comparable options when making our mannequin pipelines extra strong. I hope sharing particulars about these incidents offers some steering on what can go unsuitable when utilizing deep studying in manufacturing.

Incident 1: Untrained Embeddings

What was the problem?
We discovered that most of the fashions that we deployed, comparable to predicting click on and set up conversion, had been poorly calibrated. This meant that the anticipated worth of conversion by the mannequin was a lot larger than the precise conversion that we noticed for impressions that we served. After drilling down additional, we discovered that the miscalibration was worse on categorical options the place we had sparse coaching information. Finally we found that we had embedding layers in our set up mannequin the place we had no coaching information accessible for a number of the vocabulary entries. What this meant is that when becoming the mannequin, we weren’t making any updates to those entries, and the coefficients remained set to their randomized initialized weights. We known as this incident “Untrained Embeddings”, as a result of we had embedding layers the place a number of the layer weights by no means modified throughout mannequin coaching.

How was it discovered?
We principally found this concern by means of instinct after reviewing our fashions and information units. We used the identical vocabularies for categorical options throughout two fashions, and the set up mannequin information set was smaller than the clicking mannequin information set. This meant that a number of the vocabulary entries that had been high quality to make use of for the clicking mannequin had been problematic for the set up mannequin, as a result of a number of the vocab entries didn’t have coaching examples within the smaller information set. We confirmed that this was the problem by evaluating the weights within the embedding layers earlier than and after coaching, and discovering {that a} subset of the weights had been unchanged after becoming the mannequin. As a result of we randomly initialized the weights in our Keras fashions, this led to points with the mannequin calibration on stay information.

How was it fastened?
We first restricted the dimensions of our vocabularies used for categorical options to scale back the chance of this concern occurring. The second change we made was setting the weights to 0 for any embedding layers entries the place the weights had been unchanged throughout coaching. Long term, we moved away from reusing vocabularies throughout totally different prediction duties.

What did we be taught?
We found that this was one of many points that was resulting in mannequin instability, the place fashions with comparable efficiency on offline metrics would have noticeably totally different efficiency when deployed to manufacturing. We ended up constructing extra tooling to match mannequin weights throughout coaching runs as a part of our mannequin validation pipeline.

Incident 2: Padding Subject with Batching for TensorFlow Serving

What was the problem?
We migrated from Vertex AI for mannequin serving to an in-house deployment of TensorFlow serving, to cope with a number of the tail-end latency points that we had been encountering with Vertex on the time. When making this variation, we bumped into a difficulty with the best way to cope with sparse tensors when enabling batching for TensorFlow serving. Our fashions contained sparse tensors for options, such because the checklist of recognized apps put in on a tool, that may very well be empty. After we enabled batching when serving on Vertex AI, we had been ready to make use of empty arrays with out concern, however for our in-house mannequin serving we acquired error responses when utilizing batching and passing empty arrays. We ended up passing “[0]” values as a substitute of “[ ]” tensor values to keep away from this concern, however this once more resulted in poorly calibrated fashions. The core concern is that “0” referred to a particular app relatively than getting used for out-of-vocab (OOV). We had been introducing a function parity concern to our fashions, as a result of we solely made this variation for mannequin serving and never for mannequin coaching.

How was it discovered?
As soon as we recognized the change that had been made, it was simple to reveal that this padding method was problematic. We took information with an empty tensor and adjusted the worth from “[]” to “[0]” whereas preserving the entire different tensor values fixed, and confirmed that this variation resulted in several prediction values. This made sense, as a result of we had been altering the tensor information to assert that an app was put in on the gadget the place that was not really the case.

How was it fastened?
Our preliminary repair was to vary the mannequin coaching pipeline to carry out the identical logic that we carried out for mannequin serving, the place we exchange empty arrays with “[0]”, however this didn’t utterly handle this concern. We later modified the vocab vary from [0, n-1] to [0, n], the place 0 had no that means and was added to each tensor. This meant that each sparse tensor had at the very least 1 worth and we had been ready to make use of batching with our sparse tensor setup.

What did we be taught?
This concern principally got here up as a result of totally different threads of labor on the mannequin coaching and mannequin serving pipelines, and lack of coordination. As soon as we recognized the variations between the coaching and serving pipelines, it was apparent that this discrepancy may trigger points. We labored to enhance on this incident by together with information scientists as reviewers on pull requests on the manufacturing pipeline to assist determine all these points.

Incident 3: Untrained Mannequin Deployment

What was the problem?
Early on in our migration to deep studying fashions, we didn’t have many guardrails in place for mannequin deployments. For every mannequin variant we had been testing we’d retrain and robotically redeploy the mannequin every day, to be sure that the fashions had been skilled on current information. Throughout one of many coaching runs, the mannequin coaching resulted in a mannequin that at all times predicted a 25% click on charge whatever the enter information and the ROC AUC metric on the validation information set was 0.5. We had basically deployed a mannequin to manufacturing that at all times predicted a 25% click on charge no matter any of the function inputs.

How was it discovered?
We first recognized the problem utilizing our system monitoring metrics in Datadog. We logged our click on predictions (p_ctr) as a histogram metric, and Datadog offers p50 and p99 aggregations. When the mannequin was deployed, we noticed the p50 and p99 values for the mannequin converge to the identical worth of ~25%, indicating that one thing had gone unsuitable with the clicking prediction mannequin. We additionally reviewed the mannequin coaching logs and noticed that the metrics from the validation information set indicated a coaching error.

How was it fastened?
On this case, we had been in a position to rollback to the clicking mannequin from yesterday to resolve the problem, however it did take a while for the incident to be found and our rollback method on the time was considerably guide.

What did we be taught?
We discovered that this concern with unhealthy mannequin coaching occurred round 2% of the time and wanted to arrange guardrails towards deploying these fashions. We added a mannequin validation module to our coaching pipeline that checked for thresholds on the validation metrics, and in addition in contrast the outputs of the brand new and prior runs on the mannequin on the identical information set. We additionally arrange alerts on Datadog to flag massive modifications within the p50 p_ctr metric and labored on automating our mannequin rollback course of.

Incident 4: Unhealthy Warmup Knowledge for TensorFlow Serving

What was the problem?
We used warmup files for TensorFlow serving to enhance the rollout time of recent mannequin deployments and to assist with serving latency. We bumped into a difficulty the place the tensors outlined within the warmup file didn’t correspond to the tensors outlined within the TensorFlow mannequin, leading to failed mannequin deployments.

How was it discovered?
In an early model of our in-house serving, this mismatch between warmup information and mannequin tensor definitions would trigger all mannequin serving to return to a halt and require a mannequin rollback to stabilize the system. That is one other incident that was initially captured by our operational metrics on Datadog, since we noticed a big spike in mannequin serving error requests. We confirmed that there was a difficulty with the newly deployed mannequin by deploying it to Vertex AI and confirming that the warmup information had been the basis reason behind the problem.

How was it fastened?
We up to date our mannequin deployment module to substantiate that the mannequin tensors and warmup information had been appropriate by launching a neighborhood occasion of TensorFlow serving within the mannequin coaching pipeline and sending pattern requests utilizing the warmup file information. We additionally did further guide testing with Vertex AI when launching new varieties of fashions with noticeably totally different tensor shapes.

What did we be taught?
We realized that we would have liked to have totally different environments for testing TensorFlow mannequin deployments earlier than pushing them to manufacturing. We had been in a position to do some testing with Vertex AI, however ultimately arrange a staging setting for our in-house model of TensorFlow serving to offer a correct CI/CD setting for mannequin deployment.

Incident 5: Problematic Time-Based mostly Options

What was the problem?
We explored some time-based options in our fashions, comparable to weeks_ago, to seize modifications in habits over time. For the coaching pipeline, this function was calculated as flooring(date_diff(at this time, day_of_impression)/7). It was a extremely ranked function in a few of our fashions, however it additionally added unintended bias to our fashions. Throughout mannequin serving, this worth is at all times set to 0, since we’re making mannequin predictions in actual time, and at this time is similar as day_of_impression. The important thing concern is that the mannequin coaching pipeline was discovering patterns within the coaching information that will create bias points when making use of the mannequin on stay information.

How was it discovered?
This was one other incident that we discovered principally by means of instinct and later confirmed to be an issue by evaluating the implementation logic throughout the coaching and mannequin serving pipelines. We discovered that the mannequin serving pipeline at all times set the worth to 0 whereas the coaching pipeline used a variety of values provided that we regularly use months outdated examples for coaching.

How was it fastened?
We created a variant of the mannequin with the entire relative time-based options eliminated and did an A/B check to match the efficiency of the variants. The mannequin that included the time based mostly options carried out higher on the holdout metrics throughout offline testing, however the mannequin with the options eliminated labored higher within the A/B check and we ended up eradicating the options from the entire fashions.

What did we be taught?
We discovered that we had launched bias into our fashions in an unintended means. The options had been compelling to discover, as a result of person habits does change over time, and introducing these options did lead to higher offline metrics for our fashions. Finally we determined to categorize these as problematic below the function parity class, the place we see variations in values between the mannequin coaching and serving pipelines.

Incident 6: Suggestions Options

What was the problem?
We had a function known as clearing_price that logged how excessive we had been prepared to bid on an impression for a tool over the last time that we served an advert impression for the gadget. This was a helpful function, as a result of it helped us to bid on gadgets with a excessive bid flooring, the place the mannequin wants excessive confidence {that a} conversion occasion will happen. This function by itself typically wasn’t problematic, however it did turn out to be an issue throughout an incident the place we launched unhealthy labels into our coaching information set. We ran an experiment that resulted in false positives in our coaching information set, and we began to see a suggestions concern the place the mannequin bias grew to become a difficulty.

How was it discovered?
This was a really difficult incident to determine the basis reason behind, as a result of the experiment that generated the false constructive labels was run on a small cohort of visitors, so we didn’t see a sudden change in operational metrics like we did with a number of the different incidents in Datadog. As soon as we recognized which gadgets and impressions had been impacted by this check, we appeared on the function drift of our information set and located that the common worth of the clearning_price function was rising steadily for the reason that rollout of the experiment. The false positives within the label information had been the basis reason behind the incident, and the drift on this function was a secondary concern that was inflicting the fashions to make unhealthy predictions.

How was it fastened?
Step one was to rollback to a best-known mannequin earlier than the problematic experiment was launched. We then cleaned up the info set and eliminated the false positives that we may determine from the coaching information set. We continued to see points and in addition made the decision to take away the problematic function from our fashions, just like the time-based options, to forestall this function from creating future suggestions loops sooner or later.

What did we be taught?
We realized that some options are useful for making the mannequin extra assured in predicting person conversions, however should not well worth the danger as a result of they’ll introduce a tail-spin impact the place the fashions rapidly deteriorate in efficiency and create incidents. To exchange the clearing value function, we launched new options utilizing the minimal bid to win values from auction callbacks.

Incident 7: Unhealthy Function Encoding

What was the problem?
We explored just a few options that had been numeric and computed as ratios, comparable to the common click on charge of a tool, computed because the variety of clicks over the variety of impressions served to the gadget. We ran right into a function parity concern the place we dealt with divide by zero in several methods between the coaching and serving mannequin pipelines.

How was it discovered?
We’ve a function parity examine the place we log the tensors created throughout mannequin inference for a subset of impressions and run the coaching pipeline on these impressions and examine the values generated within the coaching pipeline with the logged worth at serving time. We observed a big discrepancy for the ratio based mostly options and located that we encoded divide by zero as -1 within the coaching pipeline and 0 within the serving pipeline.

How was it fastened?
We up to date the serving pipeline to match the logic within the coaching pipeline, the place we set the worth to -1 when a divide by zero happens for the ratio based mostly options.

What did we be taught?
Our pipeline for detecting function parity points allowed us to rapidly determine the basis reason behind this concern as soon as the mannequin was deployed to manufacturing, however it’s additionally a state of affairs we wish to keep away from earlier than a mannequin is deployed. We utilized the identical studying from incident 2, the place we included information scientists on pull request critiques to assist determine potential points between our coaching and serving mannequin pipelines.

Incident 8: String Parsing

What was the problem?
We used a 1-hot encoding method the place we select the highest okay values, that are assigned indices from 1 to okay, and use 0 as an out-of-vocab (OOV) worth. We bumped into an issue with the encoding from strings to integers when coping with categorical options comparable to app bundle, which regularly has further characters. For instance, the vocabulary might map the bundle com.dreamgames.royalmatch to index 3, however within the coaching pipeline the bundle is ready to com.dreamgames.royalmatch$hl=en_US and the worth will get encoded to 0, as a result of it’s thought of OOV. The core concern we bumped into was totally different logic for sanitizing string values between the coaching and serving pipelines earlier than making use of vocabularies.

How was it discovered?
This was one other incident that we found with our function parity checker. We discovered a number of examples the place one pipeline encoded the values as OOV whereas the opposite pipeline assigned non-zero values. We then in contrast the function values previous to encoding and observed discrepancies between how we did string parsing within the coaching and serving pipelines.

How was it fastened?
Our brief time period repair was to replace the coaching pipeline to carry out the identical string parsing logic because the serving pipeline. Long term we centered on truncating the app bundle names on the information ingestion step, to scale back the necessity for guide parsing steps within the totally different pipelines.

What did we be taught?
We realized that coping with problematic strings at information ingestion offered probably the most constant outcomes when coping with string values. We additionally bumped into points with unicode characters displaying up in app bundle names and labored to appropriately parse these throughout ingestion. We additionally discovered it vital to sometimes examine the vocabulary entries which might be generated by the system to ensure particular characters weren’t displaying up in entries.

Takeaways

Whereas it could be tempting to make use of deep studying in manufacturing for mannequin serving, there’s a variety of potential points you can encounter with stay mannequin serving. It’s necessary to have strong plans in place for incident administration when working with machine studying fashions, with the intention to rapidly recuperate when mannequin efficiency turns into problematic and be taught from these missteps. On this publish we coated 8 totally different incidents I encountered when utilizing deep studying to foretell click on and set up conversion in a cellular AdTech platform. Right here’s are the important thing takeaways I realized from these machine studying incidents:

It’s necessary to log function values, encoded values, tensor values, and mannequin predictions throughout mannequin serving, to make sure that you do not need function parity or mannequin parity points in your mannequin pipelines.
Mannequin validation is a vital step in mannequin deployment and check environments may also help scale back incidents.
Watch out for the options that you just embrace in your mannequin, they could be introducing bias or inflicting unintended suggestions.
You probably have totally different pipelines for mannequin coaching and mannequin serving, the workforce members engaged on the pipelines needs to be reviewing one another’s pull requests for ML function implementations.

Machine studying is a self-discipline that may be taught so much from DevOps to scale back the prevalence of incidents, and MLOps ought to embrace processes for effectively responding to points with ML fashions in manufacturing.

Source link

STOP Building Useless ML Projects – What Actually Works

Implementing IBCS rules in Power BI

Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

Using Graph Databases to Model Patient Journeys and Clinical Relationships

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Combating Click Fraud: Leveraging Machine Learning for Secure Digital Advertising (using Pyspark, Databricks) | by Phanindra Anagam | Apr, 2025

Dari Angka ke Visual: Transformasi Data dengan Algoritma Machine Learning | by Muhammad Raihan Nur Aziz | Feb, 2025

Top Climate Tech Stories of 2024

Our Picks