A/B Testing Guidelines: Recommend
Overview
At Algonomy, we have executed many A/B tests against web experiences without recommendations and against other personalization technologies—and we have learned a lot. This document distills our salient A/B methodological best practices into basic guidelines that will help you successfully assess the value of Recommend and reduce the very real risk of achieving an invalid test result. These guidelines are relevant whether you are using Algonomy’s “MVT” tool or 3rd party testing software. That said, MVT makes compliance easier as it automatically handles some of these best practices (for example, data sanitization in the test reporting).
A/B Testing lets you compare results of different kinds of conditions, whether that is Recs On/Recs Off, or Recommend vs another solution, etc., so that you have solid indication of the value each condition supplies.
While A/B testing is not difficult or overly complex, it requires precision in its scope, design, and measurement so as not to end up with a flawed result. Involving Algonomy in the A/B planning and following above methodological guidelines will vastly improve your chances of extracting the insights you seek from the evaluation.
Algonomy provides an MVT tool that allows you to set up A/B tests. There are also 3rd party solutions available outside of Algonomy.
Scope
Treatments
In a competitive bake-off, the A/B test should involve no more than three personalization technologies—ideally two (inclusive of Recommend). The more solutions packed into the test, the harder it is to successfully measure impact.
Success Metrics
If the intent is to assess revenue impact, the primary success metric will be Revenue Per Visitor (RPV) for which Recommend typically delivers 1-3% lift against competing solutions and 2-5% lift against no recommendations. However, for some retailers, Conversion (CVR) is most important due to the high lifetime value (LTV) of their customers; that is, they are willing to sacrifice immediate session revenue in order to get a sale of any value—because once they gain a customer, they’re able to monetize him/her over a long period of time. Recs Sales and CTR should not be used as success metrics as they do not have a positive correlation with incremental revenue.
Page Types
The revenue lift Recommend delivers is from a combination of placements that serve the distinct merchandising objectives at each point in the shopping funnel. The fewer placements you test, the less impressive and insightful the results—and greater the likelihood of unsuccessfully measuring solution value. A minimally viable implementation for measuring Recommend value includes recommendations on the Item Page and at least one of the following: Add-to-Cart or Cart.
Platforms
You will want to test Recommend on your highest-trafficked digital channels which are typically Desktop or Desktop, Mobile, and Tablet in aggregate. Keep in the mind that the ability to achieve a statistically significant result is a function of session and sales volumes. If a particular channel does not get much traffic, it is unlikely you will be able to isolate Recommend’s performance for that channel without running an elongated A/B.
Design
Traffic Split
To expedite the evaluation, allocate as much traffic as possible to each test group. This usually means executing a 50/50 split on 100% of site traffic (if a two-way A/B). Of course, the split must be random so as not to introduce bias. There are a number of ways of verifying this; we suggest running a brief AA test and comparing the population statistics—e.g., country, channel, browser, and referrer breakdowns—of the resulting groups to ensure there are no substantive differences in representation.
User Experience
It is important to control the experiment and not introduce any unintended variables. If evaluating recommendation technologies, ensure that there’s parity in all other aspects that drive or define the user experience—recs looks and feel, the number and location of placements, available catalog data, and even page performance (i.e., load times). If evaluating Recommend against no recs, it is imperative that, prior to testing, we stamp out any technical issues that could impact Recommend’s ability to create the intended shopping experience and deliver both quantitative and qualitative value.
Measurement
Duration
A/B tests typically run for 2-3 weeks—longer if traffic levels necessitate it. Using some basic site statistics, Algonomy can help you estimate how long we need to run the experiment to reach a credible and statistically significant result.
Data Exclusions
It is absolutely critical that the A/B reporting only focuses on traffic that’s relevant for the test. Not doing so can produce results that do not accurately represent what is being tested. The following sessions/visits must be eliminated from the reporting for accurate measurement:
-
Ineligible users: Visits not exposed to the area of the site on which the tested variable exists. For example, if testing an implementation on the Item and Cart pages, a visit that only viewed the Home and Category pages should be excluded from the results.
-
Treatment flippers: Visits that were exposed to more than one treatment. When this happens, it is not possible to attribute the resulting conversion or non-conversion to any particular side of the test.
-
Outliers: Converting visit that resulted in an order whose value is more than 3 standard deviations from the log-normalized mean (i.e., extreme orders). Including these large orders can skew the results due to the size of the order likely not being caused by the tested variable.
Confidence
While we strive for 95% confidence, 85-90% is acceptable in certain situations. In addition to the confidence level, it is important that the A/B is run long enough to achieve a credible result. As stated in the “Duration” section, this is usually 2-3 weeks but can be longer depending on your site’s traffic volume. Again, Algonomy can help you determine the appropriate test duration.