Working experiments is a activity that always falls to information scientists. If that’s you, congrats! It may be a rewarding and high-impact space of labor, but additionally requires instruments discovered outdoors the everyday ML-heavy information science curriculum.
Even with one of the best instruments, solely a small share of experiments ship significant enterprise worth. I’ve been fortunate to design and execute many experiments. Of these, I’ve a couple of winners. From these, I’m sharing some tales as an instance key ideas associated to experiments.
Background: I work at an organization known as IntelyCare. We assist join nurses with numerous work alternatives (full-time, part-time, contracts, per-diem… the entire menu).
- One in every of our core choices is a nursing-only job board. If you happen to have a look within the 12 months 2025, you’ll discover two potential methods of sorting jobs by date and by relevance.
Why it issues: The type-by-relevance function is our present finest lever to ensure expertise for paying prospects. It additionally offers us a chance to enhance the general effectivity of our job board by steering eyeballs away from low-quality jobs.
Sadly, we are able to’t put each job on the high of a search outcome. We face a tradeoff between the amount of top-page listings and the high quality of the expertise within the type of elevated applies.
The way it works: “Relevance” doesn’t imply what it usually means. Sorry!
We give every job a rating between 0 and 100. When filling a web page with jobs, sorting by relevance means we type the outcomes by that rating. That’s it! For brevity, we’ll say any job with a rating larger than 0 is “boosted.”
I do know what you’re pondering, “This isn’t relevance!” And also you’re proper, at the least within the regular sense of the phrase. The rating doesn’t fluctuate throughout job-seekers or search phrases. A greater title could be “related to Google.” We’re OK with that as a result of an enormous share of our job-board visitors comes from Google, as proven under.
In Math: We’ve got N jobs. Daily we generate a vector of N integers between 0 and 100. We feed this vector right into a black field named Google. If we do job, the black field rewards us with many job purposes.
By placing the “proper” jobs on the high of the web page (loaded phrase there), we are able to enhance upon a chronological type. Earlier than we are able to establish the best jobs, we have to understand how a lot Google really rewards higher-placed jobs.
Day 0: Making progress when you understand nothing
Generally, simply to justify all of the simplifying assumptions I’m going to make later, I begin a mission by writing down the mathematics equation I’d like to resolve. I think about ours appears one thing like this:

- S is our vector of relevancy scores. There are N jobs, so every s_i (a component of S) corresponds to a special job. A perform known as applies turns S right into a scalar. Every day we’d like to seek out the S that makes that quantity as giant as potential — the relevancy scores that generate the best variety of job purposes for intelycare.com/jobs.
- applies is a high-quality goal perform on Day 0. Afterward our goal perform may change (e.g. income, lifetime worth). Applies are simple to depend, although, and lets me spend my complexity tokens elsewhere. It’s Day 0, individuals. We’ll come again to those questions on Day 1.
- Downside. We all know nothing in regards to the applies perform till we begin feeding it relevancy scores. 😱
First issues first: Seeing that we all know nothing in regards to the applies perform, our first query is, “how will we select an ongoing wave of day by day S vectors so we are able to study what the applies perform appears like?”
- We all know (1) which jobs are boosted and when, (2) what number of applies every job receives every day. Be aware the absence of page-load information. It’s Day 0! You may not have all the info you need on Day 0, but when we’re intelligent, we are able to make do with what now we have.
- Be aware the delicate change in our goal. Earlier, our purpose was to perform some enterprise goal (maximize applies), and ultimately, we’ll come again to that purpose. We took off our enterprise hat for a minute and placed on our science hat. Our solely purpose now’s to study one thing. If we are able to study one thing, we are able to use it (later) to assist obtain some enterprise goal.🤓
- Since our purpose is to study one thing, above all we need to keep away from studying nothing. Bear in mind it’s Day 0 and now we have no assure that the Google Monster pays any consideration to how we type issues. We might as properly go for broke and ensure this factor works earlier than throwing extra time at bettering it.
How will we select an preliminary wave of day by day S vectors? We’ll give each job a rating of 0 (default rating), and select a random subset of jobs to spice up to 100.
- Perhaps I’m stating the apparent, nevertheless it must be random if you wish to isolate the impact of page-position on job purposes. We wish the one distinction between boosted jobs and different jobs to be their relative ordering on the web page as decided by our relevance scores. [I can’t tell you how many phone screens I’ve conducted where a candidate doubled down on running an A/B test with the good customers in one group and the bad customers in the other group. In fairness, I’ve also vetted marketing-tech vendors who do the same thing 😭].
- The randomness might be good in a while for different causes. It’s possible that some jobs profit from page-placement greater than others. We’ll have a better time figuring out these jobs with a giant, randomly-generated dataset.
The plan: Refined however necessary particulars
We all know we are able to’t increase each job. Anytime I put a job on the high of the web page, I bump all different jobs down the web page (traditional instance of a “spillover”).
The spillover will get worse as I increase increasingly jobs, I impose a better and better punishment on all different jobs by pushing them down within the type (together with different boosted jobs).
- With little exception, nursing jobs are in-person and native, so any boosting spillovers might be restricted to different close by jobs. That is necessary.
How will we select an preliminary wave of day by day S vectors? (closing reply) We’ll give each job a rating of 0 (default rating), and select a random subset of jobs to spice up to 100. The scale of the random subset will fluctuate throughout geographies.
- We create 4 teams of distinct geographies with roughly the identical quantity of internet visitors in every group. Every group is balanced alongside the important thing dimensions we predict are necessary. We randomly increase a special proportion of jobs in every group.
Right here’s the way it seemed…

- Every black circle represents a special geography. Its elevation exhibits the distinction in applies-per-job between boosted jobs and all different jobs (measured as a p.c).
- Whereas teams are balanced in mixture, the person geographies fluctuate significantly. The stability remains to be necessary although. In any other case, what you see within the chart may very well be an artifact of the combo of city/rural or giant/small geographies in every group. As it’s, we’re assured the outcomes come from our relevancy scores.
- A fast-and-dirty interpretation of this chart is one thing like, “the 5% of jobs on the high of the web page have ~26% extra applies per day than the 95% of jobs positioned under. The ten% of jobs on the high of the web page have ~21% extra applies per day than the 90% of jobs beneath…” and so forth. I might by no means be so daring as to say that in actual life, however in a perfect-experiment world it could be true.
- By the point we increase 25% of jobs, the increase expertise is totally averaged out! We diluted the perks of premium placement to virtually nothing for the median geography. “And when everyone is super, no one will be!
.” Are you able to think about studying this the laborious manner? - There are a lot of different layers to peel again. Maybe dilution occurs extra rapidly for nursing specialties with many pages of listings? What about states that overlap with our long-standing per-diem staffing enterprise? Many high-quality questions, now we have solutions for some, however all greater than I can embody on this put up.
What comes subsequent? Day 1 is when the actual enjoyable begins! 🎉
- We now have guardrails in opposition to diluting our premium expertise (tremendous necessary), however what’s the finest ~10% of jobs to spice up every day? Clearly our paying prospects have precedence, however then what?
- Does increase assist some jobs greater than others? The randomly-generated information from our experiment is properly suited to reply this and plenty of different questions. We’ll save these questions for future posts.
- As soon as now we have a method for enhancing, is our goal actually to maximise the complete variety of applies? Or will we solely care in regards to the applies for boosted jobs? 🤔 (Generally I miss the Day 0 days when all the roles had been equally related. Is likely to be time to revisit these equations on the high of the put up.)
Key takeaways for individuals who made it this far
- By being considerate about how we generated our preliminary information, we rapidly discovered a convincing reply to our query, set ourselves as much as reply many future questions, and saved ourselves a ton of time attempting to construct an uplift mannequin on non-existent historic information.
- Considering of a take a look at? Go for it! If you happen to execute properly, you possibly can see the outcomes clearly in a chart and keep away from all of the difficult statistics (obligatory xkcd reference). [hmm, maybe *most* of the statistics. I still love a good regression table.]
- Spillovers are in every single place. Generally various the remedy throughout an aggregated group may also help prefer it did right here. That may rapidly axe your sample-size, however I discover it higher to have a small information set with that means than a giant information set that’s sizzling rubbish.
Bonus: We ran this experiment in 2023. How are issues now?
On the time of our little geo-randomized experiment, you see within the charts that our premium job openings carried out ~25% higher than common jobs (that means that they had 25% extra applies on common).
Why it issues: We’ve taken over a 12 months to develop and iterate our product to make sure our premium listings ship the very best expertise. Taking a look at some latest numbers… (actually working the queries as I write this)
- Boosted job openings obtain 425% extra applies than common openings
- Boosted jobs are 450% extra prone to have obtain at the least one apply in comparison with common openings
Not dangerous! This isn’t randomized, in order that 425% consists of all types of choice bias, further product work, a crack website positioning staff, and a profitable electronic mail operation, all along with the incremental results from premium web page place. Importantly, all the additional product and advertising and marketing work is concentrated on a small variety of jobs as our preliminary testing recommends. 🏆