The Ultimate Guide to Experimentation for Product Teams

Negar Mokhtarnia ๐Ÿš€
Product Coalition
Published in
8 min readSep 14, 2020

--

Image: experimentation

Part 1of series on experimentation

Why run Experiments?

What is a Minimum Viable Test?

What are the different types of experiments?

How to define success in experiments?

10 tips to avoid common pitfalls

What is experimentation in the product development context?

Experimentation is simply applying the scientific principles to establish causation between changes to product and their outcomes. This topic has become more relevant in the past decade as products have become more complex and product managers have had to continually deliver incremental value. Running experiments adds a structured approach to discovering unbiased learnings and uncovering the real cause to changes in metrics, even when they are too small to be measured independently.

Why do you need to run Experiments?

โ—‹ Understand the real impact of change- In an organization where many teams are taking on various initiatives to effect a KPI, it is very hard to attribute the exact impact of each change. Experimentation reduces that noise and allows drivers of change to be readily identifiable.

โ—‹ Continuously optimize the design- Designers often have many ideas for a single feature. Experimentation allows them to optimize the UI based on how large numbers of customers interact with the feature.

โ—‹ Reduce risk of complicated releases- By releasing to a small portion of customers, the product team will have the opportunity to understand the full impact on all KPIs, any outage risks and best path to scale the release. These learnings can then be applied to a full release or become a part of the release playbook.

โ—‹ Predict change to a complicated system with hard to measure inter-dependencies- For example, in a 2-sided market, any changes to one side of the market will have knock on effects that are hard to predict. Experimenting with a small cohort will allow the team to see these effects in real time to reduce unwanted side effects.

โ—‹ Get a quick signal with minimal investment and disruption- It is often possible to test a hypothesis with a simplified implementation (Minimum Viable Test) and reduced risk to extract learnings and plan next steps. This allows your team to spend their time and talent only on ideas that are validated and prioritized based on measured impact.

What is a Minimum Viable Test?

On average, less than a quarter of experiments are successful. In order to reduce the investment and time to results for experiments we start with Minimum Viable Tests (MVTs). An MVT is the most efficient but valid way to gather data regarding a hypothesis. Oftentimes data and learnings from the MVTs will inform the design of more complicated experiments, thereby increasing their success rate.

There are a few key guidelines we follow when we design MVTs:

1. Break down the Hypothesis to the smallest possible assumptions

2. Plan to collect as much data as possible from the variants to improve the odds of finding a key learning

3. Simplify the flows and UI as much as possible

4. Plan for simplest development implementation

5. Analyze the results for insights not to declare ideas as winners

6. Have a clear exit plan to be able to stop the experiment quickly if things donโ€™t go as planned

What are the different types of experiments?

โ—‹ A/B testing- Allows you to randomly assign users to different product experiences and you can see which variant the users interact with the most. This testing methodology works best for Conversion Rate Optimization, UX concepts, pricing strategy and new feature development. In this methodology, the determining metrics are then compared between the original and variants to select the winning variant.

โ—‹ Multivariate testing- It is similar to A/B testing but with multiple variables changing at the same time. This is most useful when we have alternatives that have multiple different components and we are looking for effectiveness of various combinations. Once we have enough data points, we will be able to tell which elements, and also combination of elements, is best performing.

โ—‹ Funnel testing- It is also similar to A/B testing but with changes across multiple website pages that the customer will go through. This is most useful when we have components that need to stay consistent across pages (new navigation across multiple pages) or we are changing the way customers flow through various pages (new paths, shortcuts, etc)

โ—‹ Split testing- In this methodology the assets themselves are divided into similar groups and changes are enacted to some of these groups. This methodology is mostly used for SEO optimization where changing all assets at once can be costly and there are potential penalties by search engines for having multiple versions (A/B variants) of the page.

โ—‹ Time lapse testing- Compares a benchmark of KPIs before and after the change has been released. However, since samples are not comparable, the results are not as accurate as A/B testing. This method is not recommended as only large directional changes can be observed and causality of outcomes cannot be established.

How to define and measure success in experiments?

It is best practice to measure the success of experiments by calculating the statistical significance of the change on the main metrics by the treatment. Basically, the higher the statistical significance the more we can be certain that the control and variant are truly different and the change in the metrics are not simply due to chance. Most product teams use a 95% significance level which means they can be 95% sure that observed differences are real.

One of the main costs of experimentation is the time it takes to reach a significant result. Depending on how impactful the change is and how much traffic a certain product gets, the time to results can vary from days to months. This delay not only slows down the time to market of the change but also reduces the number of total experiments that can be run due to overlaps. It is recommended to calculate the time required to gather a large enough sample size (to reach statistical significance) ahead of time using the treatment effect (change to metrics for variants), confidence level (usually 95%) and number of users that can be treated (traffic or usage).

Sometimes teams are tempted to conclude a test early to either scale the gains to the full base as soon as possible or prevent a MVT design to persist for too long. It is, however, best practice to run not just to get the minimum number of treated users but also to observe a few product cycles. These product cycles are fully dependent on the nature of the product, for instance an e-commerce might have weekly ebbs and flows in customer type and traffic whereas a hospitality product might encounter seasonal patterns.

To better understand the context of the impact and optimize further, it is recommended to analyze the response of various cohorts and customer segments. This not only allows the team to target the eventual change to the users most likely to gain value from it but also enhances the teamโ€™s understanding of different segments and the nuances in their behaviour. For example, as composition of new customers change based on acquisition targeting, the effectiveness of new features for new cohorts may be different or customers that primarily use a mobile device may prefer a different UI than those who use bigger screens.

10 tips to avoid common experimentation pitfalls:

1. Set your hypothesis upfront with specific metrics- This has a few advantages. 1- It makes it very clear to everyone involved what metrics you are trying to influence and aligns the team to that objective. 2- Since itโ€™s a hypothesis you have already inherently accepted that there is a chance for failure, so you avoid the reputation risk and the pressure that many teams feel to deliver successful features. 3. You become less likely to get derailed by changes in metrics that are not primary to the experiment.

2. Make sure there is enough tracking upfront- There is nothing worse than waiting a few weeks just to realize the metrics you were reporting are not accurately tracked. This often requires your developer to have a good idea of the context of the experiment, which metrics would be affected and how.

3. Estimate how long the experiment will run before you start- In order to properly engage stakeholders and ensure other business changes are not going to impact your results, you will need to have an approximate timeline. To understand how long an experiment will run you will need to estimate the approximate change expected and how quickly you can get to that sample size (the rest is just statistics).

4. Define which counter metrics you will watch out for- Before you start your experiment you also want to define other business metrics that may be indirectly impacted by your experiment. This is crucial if you are in a complex business and you are experimenting with high risk ideas. These counter metrics ensure you and your stakeholders that your experiments are not having negative effects.

5. Experiment in phases gradually increasing risk and reward- New and novel ideas are by definition high risk, high reward. In order to reduce exposure; start with smaller components of your hypothesis or smaller changes to the experience. If the building blocks are successful you can slowly build in more functionality and complexity as well as iterating where required.

6. Design for high velocity testing- To increase the impact of your tests, design for high velocity of experiments. If experiments succeed, double down and optimize further and if they fail try something new.

7. Triangulate findings with qualitative user data- Sometimes it is hard to fully understand why certain experiments succeed or fail. This is where deep understanding of users and qualitative research is the missing piece of the puzzle. By layering experiment results with user research and sympathy you can learn insights that will not be unlocked by experimentation alone.

8. Customize your experiments for different personas- Often, changes applied to user experience across your full user base are not statistically significant as a whole. Only some of your users will respond positively and if they are a small proportion of your base, you will not see a significant change. Instead analyze your experiments for each segment separately and consider which segments will require their own experimentation idea backlog.

9. Keep customers in the same treatment- To avoid incoherence in the customer journey and contaminating your data, you can attach an experiment identifier to each of your users which ensures that they do not see multiple variants of the same experiment. Most commercially available experimentation platforms have this functionality out-of-the-box.

10. Take into account the external changes- Many times your competitorsโ€™ actions or overall market conditions can change how your users react to a certain experience. A common example is seasonal trends, users behave very differently in signing up for self-improvement programs in January vs the rest of the year. If you have a continuous experimentation pipeline, keep in mind which experiments may be vulnerable to these changes and consider retesting at a different condition.

This is the part 1 of a series on Product experimentation and Growth. In the next article I will cover โ€œHow to design high impact experiments?โ€

--

--