Skip to main content

Feature Experimentation

Overview

At DevCycle we believe that experimentation should be a part of the natural lifecycle of all features. So no matter the feature type selected, can be experimented on. Experiments can be as simple as comparing any target audiences against a metric, or can be fully randomized A/B tests using statistical methodologies.

This article outlines why and how to run and analyze experiments on your features within DevCycle.

Why Experiment

Experimentation is crucial for testing modifications to your product. You may investigate which changes would result in the best outcomes. It's also known as split testing or A/B testing, or comparative analysis depending on who you ask.

Experimentation can be used to test new features, design changes, marketing campaigns, or anything that could potentially impact how a product or service is used. You may want to experiment on any of these things and more:

  • Validate to make sure application performance remains the same or improves.
  • Validate in a controlled way whether code changes increase or decrease error rates.
  • Confirm that a new feature is driving more conversions or revenue.
  • Measure real impacts of features on SLAs and SLOs.

You've likely been doing "Experimentation" without knowing it. Whenever you release a new feature or service, compare the before and after (and during). When combined with Features, DevCycle can give direct metrics on a feature's performance during a release, allowing you to react and make changes accordingly.

Of course, with this in mind, your team isn't restricted to a simple on or off approach. Using DevCycle, a team can have numerous Variations which are released and tested at the same time, giving an even deeper view with more flexibility.

Using Experimentation

To run an experiment on any feature, all you need is two things:

  1. At least two Variations served to your users
  2. At least one metric defined and attached to your feature

Comparing Multiple Variations

The primary concept of an experiment is the need to have at least two different experiences to compare performances. There are several ways in DevCycle to run multiple experiences for users. We go into depth on this in our Targeting documentation.

To get started with your first feature experiment, it is best to keep it simple and run a basic A/B test comparing two variations, one control and one treatment Variation, delivered randomly to all your users.

To set this up, create a targeting rule in Production that delivers to All Users and serves Variations randomly with percentages set equally at 50% against your first Variation, and 50% against your second Variation.

Random Distribution 50/50

Adding Metrics to Your Feature

info

Experimentation relies on custom events.

Experimentation is available to all customers on any plan. However, to perform experiments, events must be sent to DevCycle to calculate metrics. These events are added to your existing plan. To learn more, read about our pricing, or contact us.

Now that you have two segments receiving different experiences, the only other thing you need to run an experiment is a metric to evaluate the comparative performance of those experiences.

To add a metric to your feature, click “Comparative Analysis” under the “Data & Results” section on the sidebar of the feature editing page. Click the “Choose a Metric” dropdown. This will bring up the option to add a metric that has already been created in the project or to create a new one.

Add Metrics from Feature Editing Page

For the creation of new metrics check out our documentation here.

Once you have metrics in your project, all you need to do is:

  1. Select a metric you want to use to judge the performance of your experiment
  2. Set the Variation that you want to use as your control Variation

Adding a Metric

Now that you have a metric added and a control Variation selected, the performance of the experiment will be tracked over time. With the performance of the treatment Variation compared to the control Variation with Difference and Chance of Beating Baseline percentages shown and updated in real-time as the experiment runs.

Reviewing Metric Performance

Any number of metrics can be added to a feature for analysis, keep clicking “Choose a Metric” and add pre-existing or create new metrics as needed.

Determining a Winner

The most important part of an experiment is determining a winner.

The length of time an experiment needs to run to determine a winner varies depending on the overall traffic, the observed conversion rate, and the size of the difference in conversion or values between the Variations. Typically experiments should be run for a minimum of 1-2 weeks to achieve valid statistical significance with a good amount of time to get a proper cross-section of your user base.

Given the time it takes, your team should generally avoid early analysis and create a process by which an experiment runs with no review of results until a pre-determined amount of time has passed.

Once this time has passed, the charts and graphs for any added metrics can be reviewed to determine which Variation performed best. When metrics are created, you define if a decrease or an increase is the targeted improvement. Our results graphs take this into account and show clearly if the metrics have driven either positive or negative results. The charts also provide guidance on if statistical significance has been achieved by providing a Chance of Beating baseline.

Statistical significance has been achieved if this number is either 0% or 100%.

Positive Results Positive Metric Results

Negative Results Negative Metric Results

Experimentation using a Custom Property for Randomization

info

For documentation on this functionality outside of the context of experimentation you can check out our documentation dedicated to this topic here.

DevCycle typically uses the User ID as the primary key for Feature rollouts and randomization. However, in certain scenarios, Features you release are intended to be rolled out to a cohort of users vs an individual user. For example, a new feature in a B2B platform might impact an entire organization rather than a single user within that organization. In such cases, you can randomize and rollout by using a Custom Property.

What are Experiments that Randomize Using a Custom Property?

When running an experiment where you randomize using a Custom Property, the experiment is applied to a set of users (those who possess a Custom Property) rather than individual users. This means that every user who has that Custom Property will experience the same Feature Variation, such as being part of the control or the test variant. This approach allows you to assess the impact of changes on the group as a whole.

Groups in DevCycle are defined using Custom Properties. These groups could be companies, tenants, geographic locations, or any set of users sharing common characteristics.

Randomization Grouping

How to Randomize Using a Custom Property in Experiments

To set this up, create a Targeting Rule that serves a Random Distribution of the Variations.

When you select Random Distribution, Randomize Using field will appear at the bottom of the Targeting Rule under the Schedule section. The dropdown will populate with all existing Custom Properties. Select the Custom Property you wish to use for your random distribution. If you are both randomizing distribution and using a gradual rollout of some form, the Custom Property will be used for both forms of randomization, keeping distribution sticky based off of that property.

Experimentation with Custom Property Randomization

Risks to Experimentation

There are several risks to be aware of when randomizing your Experiments in this way:

  1. Less Statistical Power: In experiments with randomization using a Custom Property, each group is treated as a single data point, reducing the overall statistical power of the experiment. For example, a platform might have millions of users but only a few thousand companies using it. This typically requires running these types of experiments for a longer period to achieve statistically significant results.

  2. Higher Randomization Risk: There's a greater risk of improper randomization when assigning Custom Properties to control or test variants. With fewer data points, any imbalance can significantly skew the results. For example, if a new pricing model is tested across different companies, an imbalance in the distribution of company sizes could lead to inaccurate conclusions about the model’s effectiveness.

  3. Fewer User-Level Insights: Custom Property-targeted experiments provide insights at an aggregate level, potentially obscuring user-level behaviors and preferences. For example, a new feature might increase overall usage within a company, but it might not reveal which specific roles or user types are most engaged with the feature.

  4. Randomization Collisions: Our random distritution system works on a murmurhash, where we purposely limit User IDs to less than 200 characters to reduce the risk of collisions. If you randomize off of a Custom Property where the values are over 200 characters there is a potential for collisions that could impact randomization.

Regardless of the type of risk, if you are worried about the statistical validity of your experiment you should make sure that there is both a significant number of groups as well as good balanced stratification across the groups that you're testing against. These two factors protect you against the most substantial risks.