The skill: Reading experiment results honestly. The hardest part isn't the statistics, it's the discipline to not see what you desperately want to see.
Benchmarks
In a Nutshell
- Calculate sample size before you start. Not after. Use a calculator. Know how long the test needs to run.
- 95% confidence is the floor. 80% is a coin flip dressed in a lab coat.
- Practical significance > statistical significance. A statistically significant 0.3% lift is not worth the engineering complexity.
- Run for full weeks. Tuesday traffic is different from Sunday traffic. Partial weeks create false signals.
- Check segments. The aggregate might look flat while mobile users are up 20% and desktop users are down 15%.
- Document everything. Hypothesis, sample size calculation, duration, results, decision, and why. Even for tests you killed early.
- Guardrail metrics are non-negotiable. Define what would make you stop the test before it starts.
The peeking problem
Peeking is the silent killer of experiment quality. You launch a test on Monday, check the dashboard on Wednesday, and see a 15% lift with 90% confidence. Tempting to call it and ship. But early results are misleading.
Here's why: early results are noisy. When you check a test before it reaches its planned sample size, you're running multiple implicit hypothesis tests. Each peek inflates your false positive rate. Check a test 5 times before it's done and your actual false positive rate can be 20-30%, not the 5% you think.
The fix is straightforward. Calculate your required sample size and test duration upfront. Then don't look until it's done. If you absolutely need to monitor for safety (breaking the checkout flow, tanking a key metric), set up automated guardrail alerts that trigger only on large, clearly negative movements. That's different from checking whether the test is "winning."
If your team can't resist peeking, use sequential testing methods that account for multiple looks. They require larger sample sizes, but at least the math is honest.
Do's and Don'ts
Written with ❤️ by a human (still)