Designing Online Controlled Experiments — Important Concepts and Considerations

Wilson Wong
Product Coalition
Published in
13 min readDec 10, 2020

--

From time to time, we come up with ideas and speculate about how they might play out in reality. Take, for instance, a food scientist who believes that a new flavour enhancer could improve consumer demand, or a pharmaceutical company that have developed a medication which they believe could relieve high-blood pressure. At other times, we may have two or more variations of an idea and need to choose between them. Consider a chemist who need to find out which of his two new fertiliser formulations would deliver the best growth for nightshade plants compared to the prevailing one.

Controlled experiments, which has its roots in science, offer a good way to collect data for analysis to learn more about an idea and its impact on customers and commercials. In a controlled experiment, we assemble two or more groups of items or subjects and expose them to almost identical conditions except for the factor whose effect we are trying to understand. Typically, we maintain a group called the control group whereby the factor of interest is either maintained at its status quo or prevailing level or absent altogether. For the remaining groups, the factor of interest is manipulated and we observe how the items or subjects in those groups respond. The control group is used as the standard or benchmark to which comparisons are made.

Photo by CHUTTERSNAP on Unsplash

In our chemist example, the factor of interest is the fertiliser formulation and we are keen to understand its effect on plant growth rate. We can create three or more groups of nightshade plants and ensure that the plants and the conditions they are placed in are practically identical in every way (e.g., temperature, light, soil) except for the fertiliser. The plants in the first and the second groups can be treated with the chemist’s two new fertiliser formulations, respectively. As for the third group, its plants have the prevailing formulation. Depending on the exact goal of the experiment, a fourth group can also be created whereby the plants do not have any fertilisers at all. The chemist can then observe and take note of the rate at which the plants in the different groups grow. If the experiment was planned and executed properly, we can almost definitely rely on the results and the comparison to establish cause and effect relationship, e.g., the plants in group 2 grew the fastest and can be attributed to the second formulation.

The use of controlled experiments to understand and test ideas also extends to the online world, in the form of A/B testing. For instance, a data scientist who believes that a new data source could improve click-through for their product recommendation engine can use A/B testing to compare the results from that idea against the status quo. Similarly, a product team who has arrived at either the blue or the red colour as potentially being better for the existing sign-up button through user testing can conduct an online controlled experiment to gather data to arbitrate between the three.

Lots have been written on the topic of highly successful tech companies such as Google, Amazon and others using A/B testing to iterate on ideas quickly. In addition to the attribution of tens and hundreds of millions of dollars in revenue to their practicing of online controlled experiments, this data-driven approach to assessing the merits of ideas has also been lauded by some as potentially an answer to taming the HiPPO (whereby people of authority unduly exert their influence in decision making processes). If it sounds too good to be true, there is often a catch (or two). Broadly speaking, the main pitfalls revolve around (1) poor experimental setup, and (2) the tendency to want to test anything and everything just because the word A/B testing or hypothesis sounds sophisticated and businesses want to appear “scientific” or metric-driven.

Photo by Collin Armstrong on Unsplash

In this article, we unpack the first point, which is the importance of proper experimental design in the context of product testing. This article assumes a basic understanding of A/B testing. The aim is to advance that understanding by diving into three important concepts and the recommended practices when designing online controlled experiments with the help of an example. It is important to note that planning for and executing an experiment is just the first half of the exercise. During the experiment, data will be collected on how the items or subjects across the various groups interact with or react to the factors that are being manipulated (or not). The second half involves analysing those data and interpreting the results to answer the questions of interest or establish causality. This second half will be covered in a separate article.

The importance of well thought-out experiments stems from my firm belief that no experiments are better than poorly conducted experiments. The reason is simple. The results that you get from a flawed experiment may lead you to decide wrongly, which I know in itself may not be a big issue as you can justify it as a learning experience. However, it becomes less defensible when you consider the opportunity cost and the time and effort spent on the experiment for you to arrive at the same bad decisions that you could have simply reached by relying on gut feel or even selecting from amongst the ideas randomly. This is simply bad return on effort. If you are a product manager or someone with vested interest in an idea or product who rely on A/B testing, or a data scientist responsible for the experiments, you would not want too many incidents of poorly conducted experiments.

Important experimental design concepts

The upfront thinking and proper planning of experiments ensure that the right type of data and enough of them are collected for analysis to answer questions of interest or establish causality. This process is called experimental design. This is achieved by ensuring focus on the thing that is being tested (e.g., fertiliser formulation) and eliminating whatever influence other variables might have (e.g., temperature, light). Designing an experiment requires an understanding of the following three core concepts.

Variables are things that we want to manipulate, be affected or controlled. There are three types of variables to understand and think about for every experiment.

  • Independent variables (also known as input or predictor variables) are the ones that we want to manipulate, e.g., the colour of a button, the order in which the search results are displayed, the composition of the fertiliser. We also refer to this as the “treatment”.
  • Dependent variables (sometimes referred to as output or response variables) are those that we want to see changed, e.g., users’ likelihood of clicking on the button or the search results, the rate at which the plant grows.
  • Control variables (or constant variables) are the things that we are aware of and are not investigating or interested in, e.g., characteristics about the users who access the button or the search results, the temperature that the plant is growing in. These variables, if not managed or kept constant during the experiment, have the potential to influence the dependent variables and can make us think that the effects we see are due to our independent variables when in reality it is not.
Photo by Brett Jordan on Unsplash

Sampling strategies refer to the different ways of putting certain users into groups for testing. There are three common ones to be discussed.

  • Randomisation is the process of splitting users into groups at random. It offers the experimenter some level of control over variables that either cannot be or are difficult to be held constant in order to reduce any potential confounding effects. This is achieved by “spreading out” whatever effects that the confounding variables might have across the test groups.
  • Stratification is the splitting of the entire population that we want to study into relatively homogeneous sub-populations or strata based on certain known traits to ensure that the variability within each stratum is less than the variability of the entire collection. Stratification is sometimes called blocking. Some common examples include splitting users by their industries, activity levels or the devices they use, e.g., stratum A is all users in the Healthcare industry, stratum B for Finance users, etc. We can then perform randomisation within a stratum to create samples for the test groups or across multiple strata to create test groups with representation from the corresponding sub-populations.
  • Convenience sampling (also known as accidental or opportunity sampling) is the selection of users for testing from a part of the population that is close to hand. The test groups are constructed from users simply because they are readily available or the process is the most convenient instead of any other reasons. Whatever the conclusions might be that are drawn from an experiment using such strategy, they cannot be scientifically generalised to the total population. This type of sampling is most useful for pilot testing.

Metrics are the means through which we capture and quantify the variables that we are affecting or controlling. It is good practice to maintain the two types of metrics for each experiment.

  • Evaluation metrics (aka success metrics) are the metrics that we use to detect for the changes in the dependent variables that we want to see. It is important that the evaluation metrics specifically target the dependent variables, e.g., if we think that changing the colour of the button will affect the likelihood of clicks then the metrics need to be able to directly capture that.
  • Guardrail metrics (aka invariant or sanity metrics) are the metrics that we monitor for unintended side effects or violation of any assumptions that we may have during the experiment. Such metrics are commonly used to monitor critical business indicators that we do not want to be adversely affected by the treatment or to keep an eye on the control variables that we are keeping constant.

A metric can either be continuous or calculated as proportion or ratio. The metrics can take one of the following forms at the per subject level:

  • Continuous metrics contain one numeric value column, e.g., clicks per user.
  • Proportion metrics contain one binary indicator value column, e.g., proportion of searches with a click.
  • Ratio metrics as the name implies has the numerator values and the denominator values, e.g., impression to click ratio.

Good experimental design practices

In the previous section, we covered in detail the three important concepts that form the basis of an experiment, which a good design revolves around. A properly designed and conducted experiment allows the experimenter to reliably make causal inferences about the relationship between the independent and dependent variables. In this section, we will use an example experiment to illustrate the kind of thinking around those three concepts.

The example involves a commonly used technique in search engine improvement called query expansion. The idea is, we apply the technique to augment the queries as they come through from the users, typically by adding more words to the original queries in order to return more relevant results. For instance, imagine that, in an online marketplace, a buyer searches for advertisements using the query bike. Some sellers, on the other hand, write and post their ads using only the word bicycle to describe the same item. In more traditional search engines that are not optimised, the user query will not match those ads that contain the word bicycle, which in turn means they will not be returned in the results. This can make the situation worse for search terms that are already producing very low number of results or none at all.

Photo by Mikkel Bech on Unsplash

The goal of this example experiment is to demonstrate that query expansion, which is the treatment, can reduce the cases of users abandoning search sessions due to inadequate or no relevant results. By extension, the number of ads being clicked on as well as those that the users eventually transact with should increase. The details of query expansion are covered in my previous article on Search Quality In Practice.

Now that we have some context about the idea to be tested, let us move on to planning for the experiment. The first step involves thinking about the variables that we want to manipulate and those that would be affected. In our case, the independent variable is the number of relevant ads that the users get when their queries are expanded (or not). There are several dependent variables that matter in our case. They are about how our subjects, who are the prospective buyers searching for ads, react to the change. The main ones are:

  • Whether the users engage with the search results or not.
  • The intensity of the engagement in terms of clicks on ads and transactions.

From experience, the task of identifying the independent and dependent variables tend to be straightforward. Often, the challenging task is figuring out other variables that need to be controlled to ensure validity of the conclusions drawn from the experiment results. For this example experiment, the first two things that come to mind which need to be considered are:

  • Consumer psychology — If the user’s intent is to purchase a low-cost item like a smartphone screen protector, then their engagement level with the search results can be inherently lower. The opposite tends to be true with the users investing more time into engaging with options (ads in the search results) for big ticket items such as a $2,000 electric scooter.
  • Inventory supply and demand — If there are pockets of demand whereby supply is scarce, these search queries tend to yield smaller sets of ads anyway. This in turn means that engagement with the search results is going to be low. If you only have one or three ads, you can only click on them to many times, as opposed to say three to five pages of ads which will inevitably attract more clicks.

Based on the above, the variables that we need to be aware of and potentially control include the users’ search queries, the number of items or ads available (prior to the experiment) and the nature of the items. This is when having access to the users’ historical purchasing patterns or their profile would come in handy.

In the second step, we need to think about blocking what we can and randomising what we cannot when creating test groups from the population. This is typically done based on the analysis of the variables discussed above as candidates to be kept constant in conjunction with our understanding of the mechanics of the treatment. In this example, let us also assume that the query expansion treatment deals mostly with queries about outdoor and sports equipments, e.g., bike, bicycle, trike, scooter, skate. With those things in mind, we will need to control the categories of the ads or items that the users are likely to search for. We do this by implementing blocking on the ad category variable. We partition the users into two strata or blocks — users with searching and purchasing habits that fall into the outdoor and sports equipments category, and users in all the other categories.

From here on, the decision on what to do with the control variable depends on the conclusions that we want to draw. If we are also interested in finding out how the subjects in other categories would respond as well, we will implement random sampling from both the blocks into the test groups so that users who are more likely to find and purchase outdoor and sports equipments vs other categories have equal representation in those groups. If we do not need to study and generalise the treatment effect to the population and are happy to keep the conclusion’s validity restricted to just the sub-population of users with keen interest in the outdoor and sports, we can create the test groups by randomly sampling from just that block and ignore the others. The same randomised block design can be applied to the other variables that we want to control in order to reduce the impact of any confounding factors or variances in the data.

Lastly, the metrics that we choose to detect the effects of the treatment on the dependent variables have to be sensitive and robust. Even before that, best practice suggests that if we can keep an eye on the independent variable or the treatment at work and monitor it quantitatively, we should. In our case, we want to look out for indicators that the treatment is working as we expect, which is the size of search results growing. The following sanity metrics are some valid ones:

  • # of results per search, either in general or for queries that we are expanding. This should rise.
  • % of searches with 0-results. This should fall.

The two metrics above should be sensitive enough, in that we should be able to detect those changes very quickly with not many observations required. The success metrics, on the other hand, should be able to capture and reflect any increases in engagement from the more and relevant results, in line with our dependent variables. They include:

  • # of clicks per search session and per user, which if the query expansion works as we expect it to, this metric should indicate an increase.
  • # of transactions per user, which is an important metric as we need to determine if more transactions have occurred as a result of buyers being able to see more relevant ads or options in their search results.

Robustness is another characteristic that we need to think about. The metrics should ideally be robust against the changes that we are not interested in. In other words, the metric must not change a lot when variables other than the independent one are manipulated. In our example metric # of transactions per user, we have to ask ourselves what other factors in our search engine or experience might cause a substantial increase or drop in that? Perhaps the quality of the photos of the item in the advertisement? If we suspect or know that buyers are less likely to transact if the images are grainy, we need to rethink whether that metric is a suitable one or not. Otherwise, we can return to the step of identifying variables that we might want to control and apply blocking and sampling.

--

--

I'm a seasoned data x product leader trained in artificial intelligence. I code, write and travel for fun. https://wilsonwong.ai