A/B Testing Guidelines: Discover
Scope
Treatments
In a competitive bake-off, the A/B test should involve no more than two personalization technologies including Discover. While some sites may have the traffic to support additional treatments, the more solutions packed into the test, the harder it is to successfully measure impact—especially within an experiment targeting a single page type.
Success Metrics
If the intent is to assess revenue impact, your primary success metric will be the “attributable order” Revenue Per Visitor (RPV) for which Discover typically delivers 1-2% lift. If transaction volume is more important for your business (perhaps your customers exhibit high lifetime value and therefore you’re willing to sacrifice immediate session revenue to get a sale of any value), then attributable order Conversion (CVR) should be your success metric.
The attributable order versions of RPS, CVR, and AOV use nuanced definitions of “orders” and “revenue” to hone in on Discover’s ability to produce more revenue from the Category page results. Specifically, an “order” refers to a transaction that contains at least one Discover-attributable item and “revenue” refers to the total sales from these orders—attributable and non-attributable items included
Platforms
You will want to test Discover on your highest-trafficked digital channel(s) which are typically Desktop or Desktop, Mobile, and Tablet in aggregate. Keep in the mind that the ability to achieve a statistically significant result is a function of session and sales volumes. If a particular channel does not get much traffic, it is unlikely you will be able to isolate Discover’s performance for that channel without running an elongated A/B.
Design
Traffic Split
To expedite the evaluation, allocate as much traffic as possible to each test group. This usually means executing a 50/50 split on 100% of site traffic (if a two-way A/B). Of course, the split must be random so as not to introduce bias. There are a number of ways of verifying this; we suggest running a brief AA test and comparing the population statistics—e.g., country, channel, browser, and referrer breakdowns—of the resulting groups to ensure there are no substantive differences in representation.
User Experience
It is important to control the experiment and not introduce any unintended variables. When evaluating Discover against another category sort experience, ensure that there’s parity in all aspects that drive or define the user experience— product presentation, catalog data utilized for the sort decisioning, and even page performance (i.e., load times).
Measurement
Duration
Discover A/B tests should run for at least two weeks—the precise timing depending on site traffic and sales volumes. Algonomy can help you estimate how long we need to run the experiment to reach a credible, statistically significant result.
Data Exclusions
It is absolutely critical that the A/B reporting only focuses on traffic that’s relevant for the test. Not doing so can produce results that do not accurately represent what is being tested. The following sessions/visits must be eliminated from the reporting for accurate measurement:
-
Ineligible users: Visits not exposed to the area of the site on which the tested variable exists. For example, a visit that only viewed the Home page should be excluded from the results.
-
Ineligible orders: Orders not containing at least one item found in the Category page results (i.e., an item clicked on in the results and ultimately purchased). Remember, since we are trying to prove that Discover sells more merchandise from the Category page, it is important to focus on orders that meet this criterion.
-
Treatment flippers: Visits that were exposed to more than one treatment. When this happens, it is not possible to attribute the resulting conversion or non-conversion to any particular side of the test.
-
Outliers: Converting visit that resulted in an order whose value is more than 3 standard deviations from the log-normalized mean (i.e., extreme orders). Including these large orders can skew the results due to the size of the order likely not being caused by the tested variable.
Confidence
While we strive for 95% confidence, 85-90% is acceptable in certain situations.