Introduction
“Immensely laborious calculations on inferior data may increase the yield from 95 to 100 percent. A gain of 5 percent, of perhaps a small total. A competent overhauling of the process of collection, or of the experimental design, may often increase the yield ten- or twelve-fold, for the same cost in time and labor. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. To utilize this kind of experience he must be induced to use his imagination, and to foresee in advance the difficulties and uncertainties with which, if they are not foreseen, his investigations will be beset.” (R. A. Fisher’s Presidential address to the 1st Indian Statistical Congress)
While the statistical underpinnings of A/B testing are a century old, building a correct and reliable A/B testing platform and culture at a large scale is still a massive challenge. Mirroring Fisher’s observation above, carefully constructing the building blocks of an A/B platform and ensuring the data collected is correct is critical to guaranteeing correctness of experiment results, but it’s easy to get wrong. Uber went through a similar journey and this blog post describes why and how we rebuilt the A/B testing platform we had at Uber.
Uber had an experimentation platform, called Morpheus, that was built 7+ years ago in the early days to do both feature flagging and A/B testing. Uber outgrew Morpheus significantly since then in terms of scale, users, use cases, etc.
In early 2020, we took a deeper look at this ecosystem. We discovered that a large percentage of the experiments had fatal problems and often needed to be rerun. Obtaining high-quality results required an expert-level understanding of experimentation and statistics, and an inordinate amount of toil (custom analysis, pipelining, etc.). This slowed down decision-making, and re-running poorly conducted experiments was common.
To continue reading this article, click here.