Offline Experimentation in ML teams, A Product Manager’s Perspective

Why offline experimentation is a key part of the process of iterative machine learning product development

Published in

Product Coalition

6 min readMay 12, 2020

Testing new recommendation algorithms offline can be simulated before showing it to millions of customers

If you are a product manager working in a ML/AI team and you have not got your hands dirty in offline experimentation, this post should serve as a welcoming encouragement. If you want to work in a ML/AI team, diving deeper will help you prepare for when that moment arrives. Let’s dive into it.

One of the main advantages of working in many machine learning products is the ability to simulate a scenario based on historical data by performing offline experiments. Problems such as predicting if a customer will contact a customer support agent, or finding the right dress to recommend can be simulated in this environment, and can be a good indicator of the future performance of the system in real life. Building an intuition and understanding its drawbacks is as important as understanding its advantages.

Why does offline Experimentation Matters?

Offline experimentation allows teams to quickly iterate through different solutions for a given problem. These can be widely different ones, or dozens of small variations. It is both quicker and cheaper than doing A/B testing, and often gives more (and different) information than UX Research studies. However, I would argue that from a product manager’s perspective there are three other main advantages of performing offline experimentation.

It allows teams to align early on a given problem to solve, hypothesis to test, and a set of metrics to evaluate success. It is very important that offline metrics, online metrics and business KPIs are as aligned as possible.
It also allows teams to have a rough estimate on the expected performance of a solution for a given problem, which can be used to calculate the return on investment, secure further funding for a promising solution, or kill a project that might be either too costly to invest on, or for which the solution is not mature enough.
When the simulation is accompanied by an extensive offline evaluation, it can be used to gain further insights on user behaviour, the quality of the input data, or the effect of offline evaluation in other systems.

In this figure, where each block represents an experiment, we can clearly see the advantages of running many offline experiments early on, and later experimenting online on a small subset of promising solutions.

How does offline experimentation works?

In order to perform offline evaluation, data is collected from the live system that you want to simulate. If the data is not easily available, you can either use publicly available datasets to use for offline evaluation, or go build a very naive system to collect live data. If you go for the second option, you should consider that solutions are often domain sensitives. If you build a naive system for preference elicitation, take into account that the quality of the data heavily depends on the elicitation approach.

The next step is to split the data into a training and evaluation set. I would argue that this is one of the most important steps in the whole process of building a machine learning product, as it will have big implications on the quality of the solution. For example, in our work on “Session-based Complementary Fashion Recommendations”, one of the most interesting things was the problem definition, and building a datasets to adequately model the problem that we wanted to solve. The most important thing is to ensure that you (as a product manager) understand how the dataset is built, how it is split into training and testing sets, and that the division is a s close to the real system you are trying to simulate as possible.

Common ways to separate data into training and testing sets, the best approach is problem-specific

The next steps are to run the simulations and share the results. Running the simulation is usually done by the engineers or scientists in the team, so I purposely leave that part out. However, when sharing the experiment results, it is important that the product manager is involved.

Sharing the Experiment Results

Once the team designed the experiment and ran it, there are a multitude of formats in which to present the results. However, take into consideration that experiments should be both immutable and accessible. Thus, they should be shared in a format that everyone should understand — without reading the code — and they should be archived in a format that makes it easily to retrieve and where results can not be modified. As the experiments should be easy to understand by anyone in the team (engineers, scientists, product managers, designers), they should contain:

A short explanation of What is the hypothesis to test? What is the experiment design? What are the evaluation metrics that define success?
A summary of the experiment: Which showcases the main findings, supported by data obtained through the experiment
A conclusion or suggestion: Should this experiment be A/B tested, or should we forget about this idea, bury it deep, and never explore it again?

Good examples are sharing the results in Jupyter notebooks that are easy to find, or in the team’s flavour of AirBnB’s knowledge repository (For more details on this, you can read the blog post or talk I did with my colleague Sergio G.Z.)

What are the main shortcomings of offline experimentation?

In a recent research paper from my (previous) team we discuss how offline and online metrics are not always aligned. Moreover, they are domain sensitive. It is then very important to define what is the minimum offline effect you would like to see before moving forward with further testing, and this is something where domain knowledge and intuition plays a big role. If a 10% offline increase in a KPI only translates to a 0.1% increase online, and you obtained 2% increase for that KPI in an experiment, it is very unlikely that the experiment is worth testing online because it will not bring customer value. There are cases where this does not holds, specially when the problem is not easily modelled with existing data, I discuss that later.

The second drawback is that if a system is designed to change behaviour, it is unlikely that offline evaluation might work well in this case. I think this is a big issue when building recommender systems that favour exploration and serendipitous behaviour (rather than exploitation). However, you can add randomization to the online system before using it for offline experimentation, an approach that worked really well for Pinterest.

Bias and fairness are a big issue when designing machine learning systems that are getting more and more attention. Offline experimentation can be a really good point in time to have discussions on those topics with your team. Moreover, having a comprehensive offline evaluation criteria would help you uncover those bias. For example, is it desired than an uplift on revenue of 10% comes from an over performance in a small group of customers of customers, while slightly under performing for a large part of customers. Simply put, if as a product manager you don’t dive deep into the metrics and results for offline evaluation, you will not know.

Another important issue is over fitting; with limited data, a too-complex model might perform really well on the offline experimentation setting, but might not generalize well when once the product is launched.

After offline experimentation, what comes next?

Depending on the product you are trying to build, or the results of the experiment or experiments validating or invalidating the hypothesis offline, you usually have three options: go back to the drawing board with the team and decide if the problem can be solved at all (the art of the possible), or if look more iterations, further test online (we usually never release before A/B testing), or try some of the experiments from the list the team built together. This decision should be made as a team.

In summary, offline experimentation is a cost-effective tool to accelerate product development in machine learning teams, and product managers should be involved in their team’s offline experimentation, particularly when designing the datasets, sharing experiments, and deciding on the success criteria for those experiments. What is your take on offline experimentation?