🧱 argil.io

How to run your first experiment

3 min read
Last updated March 30, 2026

Use this when: You have enough traffic and a change you believe will improve a metric, and you want to prove it rather than hope it.

You're done when: You have a result, a documented decision (ship, iterate, or kill), and the team understands why the decision was made.

The Sequence

Loading visualization...

Key Terms

Before you touch a calculator, make sure you speak the language.

  • Baseline. The current value of the metric you're testing. If 35% of users complete onboarding today, 35% is your baseline. Everything else is measured relative to this number.
  • Lift. The improvement you're trying to detect, expressed as a percentage of the baseline. If you expect onboarding completion to go from 35% to 42%, that's a 7 percentage point increase, or a 20% relative lift (7 / 35 = 0.20). Always clarify whether you mean absolute (percentage points) or relative (percent of baseline). Mixing them up is how teams celebrate a "20% improvement" that's actually 2 percentage points.
  • Minimum Detectable Effect (MDE). The smallest lift you'd care about. This is a business decision, not a statistical one. If a 1% relative lift doesn't justify the engineering effort, set your MDE higher. Smaller MDE means you need more traffic and more time.
  • Statistical significance. The probability that the result isn't random noise. The standard threshold is 95% (p < 0.05), meaning there's a 5% chance you're seeing a false positive. 80% confidence is not enough. It means 1 in 5 "wins" is noise.
  • Confidence interval. The range where the true effect likely falls. "Lift of 12%, 95% CI [3%, 21%]" means you're fairly sure the real effect is somewhere between 3% and 21%. If the interval includes zero, you can't confidently say the change did anything.
  • Statistical power. The probability of detecting a real effect when one exists. Convention is 80%, meaning you'll catch a real improvement 4 out of 5 times. Lower power means you might miss a real win and incorrectly conclude the change didn't work.
  • Guardrail metrics. Metrics you're not trying to improve but refuse to damage. If your test boosts sign-ups but tanks retention, the guardrail catches it. Define these before you start, with hard thresholds for stopping the test.

Sample Size Reference

Loading visualization...

Template

Loading visualization...

The Peeking Problem

Peeking kills experiments quietly. You launch a test on Monday. On Wednesday, the treatment is up 15%. You stop the test and ship. The problem: at low sample sizes, random variation looks like a real signal. If you check results every day and stop when it looks good, your actual false positive rate is not 5% (which is what 95% significance means). It can be 20-30% or higher.

Evan Miller demonstrated this clearly: if you check a test daily and stop when you see significance, up to 1 in 4 "winning" tests are false positives. You shipped a change that did nothing, or worse, actually hurt the metric. You just caught it at a lucky moment.

The fix is simple and painful: commit to a sample size and runtime before you start, and do not look at the primary metric until the test ends. Monitor guardrail metrics to catch disasters. But leave the primary metric alone. If this feels hard, it is because it is. The discipline of not looking is the entire value of experimentation. Without it, you are just making decisions with extra steps and a false sense of rigor.

For startups with lower traffic, this matters even more. If you need 10,000 users per variant and you get 200 per day, the test needs 50 days per variant (or 25 days if split evenly). Peeking at day 5 means you are basing your decision on 5% of the required data. You would not make a hiring decision after reading 5% of a resume.

Example

A SaaS onboarding tool hypothesized that adding a progress bar to their setup flow would increase completion from 35% to 42% (a 20% relative lift). They calculated the sample size: at 35% baseline and 7 percentage point increase, they needed roughly 1,000 users per variant. At 80 signups per day, that meant about 25 days of runtime. They committed to a hard stop date.

Guardrail metrics: setup start rate (should not drop, meaning the progress bar does not scare people off), time to complete (should not increase significantly), and error rate. They launched the test and did not look at the completion metric for three weeks.

At the end: control completed at 34.8%, treatment at 40.1%. Statistically significant at 95% confidence. The guardrails held. They shipped the progress bar. More importantly, they documented the experiment: hypothesis, setup, results, decision. When a new PM joined three months later and proposed "what if we added a progress bar?", the answer took 30 seconds to find.

Loading visualization...

Written with ❤️ by a human (still)