🧱 argil.io

Principles

12 min read
Last updated March 30, 2026

Bad data, bad reading, or both

Data analysis fails in exactly two ways. Either the data itself is bad, too biased, too noisy, incomplete, measuring the wrong thing, or the interpretation is bad, you drew the wrong conclusion from perfectly fine data. Usually it's some mix of both.

Bad data: your tracking missed a segment, your sample is too small, your attribution model double-counts, your definition of "active user" includes bots. The input is broken, so no amount of clever analysis fixes the output.

Bad interpretation: the data is fine but you averaged away the signal, confused correlation with causation, ignored base rates, or told a causal story from an observational dataset. The input was right but the conclusion was wrong.

The fix is different for each. Bad data requires better instrumentation, cleaner definitions, and data quality checks. Bad interpretation requires better thinking, segmentation, experimentation, base rate awareness, and intellectual honesty about what the data can actually tell you.

The default instinct is to blame the data ("we need better tracking") when the real problem is often interpretation ("we need better thinking"). Knowing which failure mode you're in changes where you invest.

Instrument before you need the data

By the time you realize you need a metric, it's too late to start tracking it. You'll want to know your activation rate three months from now. You'll want to compare retention before and after a redesign. You'll want to see how a cohort from last quarter behaved over time. None of that is possible if you didn't log the events when they happened.

The cost of logging an event is near zero. A few lines of code. A few bytes per user. The cost of not having three months of historical data when you need it is enormous: you're flying blind on the most important decision of your quarter, and the best you can do is start tracking now and wait.

The common pattern: a feature ships, people use it, someone asks "how's it doing?" and the answer is "we don't know, we didn't track it." Then there's a scramble to add tracking, but the damage is done. You've lost the baseline. You can't compare before and after because there is no before.

The fix is simple: ship tracking with every feature, not after. Every pull request that changes user-facing behavior should include the events that measure that behavior. Make it part of the definition of done. A feature without tracking is a feature you can't learn from, and a feature you can't learn from is a feature you're guessing about.

All data is wrong — some data is useful

Every dataset has gaps, biases, and errors. Your event tracking misses edge cases. Your attribution model over-credits the last touch. Your retention number counts users who opened the app accidentally. Waiting for perfect data means never making a decision.

The famous statistical principle captures it: "All models are wrong, but some are useful." The goal is reducing uncertainty enough to act. Did your conversion rate go up or down after last week's change? You don't need three decimal places to answer that.

The startup that acts on 80%-right data beats the one paralyzed waiting for 100%. Here's why: decisions have a time component. A good decision made today is often worth more than a perfect decision made in three months. Your competitor isn't waiting for clean data. They're shipping, measuring with rough data, learning, and iterating. While you're cleaning your dataset, they're on their third experiment.

This doesn't mean data quality doesn't matter. It means you need to calibrate your accuracy requirements to the decision at hand. A pricing change that affects revenue by millions deserves careful analysis. A button color change does not. The right question is always "is this data good enough for this decision?"

How to Measure Anything makes this point: the value of a measurement is the reduction in uncertainty it provides, weighted by the cost of being wrong. For most decisions, rough measurement reduces uncertainty enough to act. For the few decisions where the cost of being wrong is catastrophic, invest in precision.

The map is not the territory

Your dashboard is a model of reality, not reality itself. Metrics are abstractions. Users don't experience your retention rate, they experience your product. When the data says one thing and customer conversations say another, go talk to more customers. The most dangerous moment is when a team stops looking at the real world because the numbers look fine.

This cuts both ways. Teams over-index on dashboards and under-index on qualitative signals. A retention curve can tell you that users are leaving. It can't tell you why. A funnel chart can show you where users drop off. It can't show you the confused face they make when they hit that step. The dashboard is a compression of reality, and compression always loses information.

The practical consequence: never let quantitative data be your only input. Pair every metric with a qualitative signal. Watch session recordings. Read support tickets. Talk to users who churned. The numbers tell you what happened. The conversations tell you why.

This also means you don't always need big data to make a decision. Five customer interviews saying the same thing is a signal. Ten support tickets about the same bug is a pattern. A startup with 200 users can't run A/B tests with statistical significance, but they can still be data-informed. Usage logs, qualitative feedback, and directional signals are enough when the sample is small and the cost of waiting is high. Don't let "we don't have enough data" become an excuse for not paying attention to what's right in front of you.

In practice: A productivity app had strong retention numbers, 60% Day 30, well above industry benchmarks. The team was confident the product was working. Then a PM started doing weekly user calls and discovered something the dashboard couldn't show: most retained users had found one narrow workflow and were ignoring 80% of the product. The retention number was real. The mental model the team had built around it, that users loved the product, was wrong. Users tolerated one feature. The map looked great. The territory was fragile.

Measure the outcome, not the output

Teams gravitate toward metrics they can control: features shipped, emails sent, pages visited, tickets closed, campaigns launched. These are outputs. They feel productive because you can always make the number go up. Ship more features. Send more emails. Launch more campaigns.

But outputs don't tell you if you're winning. The only metrics that matter are outcomes: did revenue grow, did retention improve, did users accomplish what they came to do? A team that shipped twenty features and moved no outcome metrics had a busy quarter.

The distinction sounds obvious. In practice, it is surprisingly sticky. Here's why: output metrics are comforting. They move when you work hard. Outcome metrics are uncomfortable. They don't care how hard you worked. They only care whether the thing you worked on actually mattered. A team that shipped one feature and improved retention by 5% had a better quarter than a team that shipped twenty features and moved nothing.

A related distinction: leading measures versus lagging measures. Lagging measures are the outcomes you care about, revenue, retention, NPS, but they move slowly and you can't directly control them. Leading measures are the activities and behaviors you can influence today that predict the lagging outcome. If your lagging measure is monthly retention, a leading measure might be "percentage of users who complete their first workflow in week one." The leading measure tells you whether you're on track before the lagging measure moves. Teams that only watch lagging measures are always reacting. Teams that identify the right leading measures can course-correct in real time.

The practical test: for every metric on your dashboard, ask "so what?" If the answer is "we shipped more stuff," it's an output. If the answer is "customers are getting more value," it's an outcome. Kill the outputs.

In practice: A SaaS startup tracked "number of integrations built" as a key metric. The team built fifteen integrations in a quarter and celebrated. Nobody checked usage. When they finally looked, three integrations accounted for 95% of all usage. The other twelve were waste. Switching to "percentage of users with at least one active integration" revealed a completely different priority list.

Segment or be misled

Averages hide everything. A 10% conversion rate might be 25% for power users and 2% for everyone else. Blending cohorts, geographies, or acquisition channels into a single number masks the signal you're looking for. The first question for any metric should be "for whom?"

The insight is almost never in the aggregate. Your overall churn rate is a blend of enterprise customers who never leave and small teams who churn in month two. Averaging those together produces a number that describes neither group. You'll build retention features targeting a user that doesn't exist, the "average" customer who's a fiction of your spreadsheet.

Simpson's paradox is the extreme version of this: a trend that appears in aggregated data reverses when you split it into subgroups. A treatment that looks effective overall might be harmful for every individual group, because of how the groups are distributed. It's not a theoretical curiosity. It shows up in A/B tests, marketing attribution, and cohort analysis more often than most teams realize.

The discipline: always break metrics down by at least one dimension before drawing conclusions. By cohort. By plan tier. By acquisition source. By company size. If the breakdown tells the same story as the aggregate, the aggregate is trustworthy. If the story changes, the aggregate was lying to you.

In practice: A marketplace startup reported 15% month-over-month growth in transactions. The board was thrilled. When the data team segmented by geography, the picture flipped: one city was growing at 40%, masking the fact that three other cities were shrinking. The aggregate growth number was real, but the business wasn't "growing." One market was carrying the others. The expansion strategy changed completely once the team saw the segments.

Data without context is noise

A 5% conversion rate means nothing without knowing: 5% of what? Compared to when? Compared to whom? Is that up or down? Is it typical for your industry? Every number needs three things to become information:

  • A baseline — what was it before?
  • A comparison — what is it for similar companies, segments, or time periods?
  • A time range — over what period?

Dashboards full of isolated numbers create the illusion of being data-driven while actually being data-distracted. A big green number that says "10,000 users" feels good. But if you had 12,000 last month, that number is a disaster. If you had 2,000 last month, it's a triumph. The number alone tells you nothing.

This is why every chart on your dashboard should answer one question before all others: compared to what? Storytelling with Data makes this its foundation: the audience needs to know what happened relative to what they expected. A chart without a reference point is a chart that can't drive a decision.

The practical failure mode is the metrics meeting where everyone stares at numbers without context. "Our DAU was 5,000 this week." Silence. Nobody knows if that's good or bad because nobody brought the comparison. That's data-theater.

The fix: every metric gets presented with at least one comparison. Week over week. Month over month. Versus target. Versus the same cohort last quarter. The comparison is what turns a number into insight. Without it, you're just reading digits off a screen.

In practice: A product team celebrated hitting 50,000 monthly active users, a milestone they'd been watching for months. When an analyst added context, the number looked different: 50,000 MAU on 200,000 total signups was a 25% activation rate. Industry benchmark for their category was 40-60%. The "milestone" was actually a red flag about their onboarding. Same number, completely different story once context was added.

Measure adoption before optimizing anything

You can't optimize a feature nobody uses. Before running A/B tests, building funnels, or tuning conversion flows, check whether users even reach the thing you're trying to improve. The most common waste in analytics: spending weeks optimizing a flow that 3% of users ever encounter.

Usage frequency is the first filter for where to invest analytical effort. A checkout flow used by every customer deserves deep funnel analysis. A settings page visited by 2% of users does not. This sounds obvious, but the pattern repeats constantly because teams fall in love with a feature and assume others will too.

The same applies to experiments. An A/B test on a feature with low exposure will never reach significance, no matter how long you run it. You'll wait weeks for results that can't tell you anything. Before asking "how do we make this better?" ask "do people even use this?" If the answer is no, the optimization question is premature. The real question is whether the feature should exist at all.

The sequence matters: adoption first, then engagement, then optimization. Skipping steps means polishing something nobody touches.

In practice: A SaaS team spent six weeks A/B testing their advanced reporting dashboard, trying to increase time-on-page and feature engagement. The test was inconclusive. When someone finally checked, only 8% of users had ever visited the reporting tab. The team had been optimizing a ghost town. They redirected effort toward surfacing report insights in the main feed where users already spent time. Engagement with reporting data tripled without improving the reporting page at all.

Correlation will fool you every time

Confusing correlation with causation is expensive. "Users who use feature X retain better" is the sentence that has launched a thousand misdirected product investments. Maybe feature X causes retention. Or maybe power users, who would retain regardless, just discover more features. Without experimentation or careful causal reasoning, you can't tell.

Getting this wrong means investing three months into the wrong feature. Survivorship bias, selection bias, and confounders are everywhere in startup data:

  • Survivorship bias — you only study users who stuck around, ignoring the ones who left. The survivors look great. But what made them stay might have nothing to do with what you're measuring.
  • Selection bias — users who opt into a feature are different from those who don't. Comparing the two groups tells you about the users, not the feature.
  • Confounders — a third variable explains both the "cause" and the "effect." Ice cream sales and drowning deaths are correlated. Summer causes both.

Thinking, Fast and Slow explains why this mistake is so persistent: the human brain is a narrative machine. It sees two things that happen together and invents a causal story. "Users who connected their CRM integration retained 3x better" becomes "we need to push everyone to connect their CRM." Maybe. Or maybe the users connecting a CRM were already larger teams with a real workflow, the kind of users who would retain regardless. The CRM connection was a signal of commitment, not a cause of it. Building a quarter of product work around increasing CRM connections would have moved nothing.

The antidote is experimentation. A/B tests, when run correctly, are the only reliable way to isolate causation. When you can't experiment (not enough traffic, irreversible change, ethical concerns), be explicit about the limitation. Say "correlated with" instead of "caused by." Say "users who did X also did Y" instead of "X drives Y." The language matters because it changes how people act on the finding.

In practice: A SaaS startup found that customers who integrated their Slack workspace had 3x higher annual retention. The revenue team built an entire quarter around pushing Slack integration adoption, including making it a mandatory onboarding step and tying sales incentives to integration completion. Retention didn't move. The Slack integration was a signal that the customer had a large, active team, the kind of organization that would retain regardless. Smaller teams didn't have meaningful Slack workflows and found the forced integration annoying, increasing early churn. A simple holdback test would have revealed this in two weeks.

Speed of learning beats depth of analysis

The startup that runs ten rough experiments in a quarter learns more than the one that runs one perfect study. Optimize your data practice for cycle time, not precision. Ship a quick analysis, make a decision, see what happens, adjust. Analysis paralysis kills more startups than bad data ever will.

This is counterintuitive for anyone trained in statistics or data science. Rigor matters. But rigor has a cost, and in a startup that cost is time. A two-week analysis that produces a confident answer is worth less than a two-day analysis that produces a directional answer, if the decision can be reversed and the stakes are moderate.

The key distinction is reversibility. For reversible decisions (most product changes, pricing experiments you can roll back, feature launches with a kill switch), speed dominates. Get directional data, make the call, watch what happens. For irreversible decisions (shutting down a product line, signing a multi-year contract, entering a new market), slow down. The cost of being wrong is high enough to justify the analysis time.

The trap is treating every decision like it is irreversible, over-analyzing small bets and under-analyzing big ones. The inversion works better: move fast on the small stuff, be thorough on the few decisions that actually matter.

In practice: A fintech startup debated for three weeks whether to change their onboarding flow, waiting for a full cohort analysis. Meanwhile, a competitor shipped four variations of their onboarding in the same period, found the winner, and moved on. By the time the first team had their "perfect" analysis, the insight was about a version of the product that had already changed. The analysis was rigorous. It was also stale.

Democratize access, centralize definitions

Everyone should be able to query data. Nobody should be able to redefine what "active user" means. The real problem is rarely lack of access, it is three teams using three different definitions of the same metric and getting three different answers. Marketing says there are 50,000 active users. Product says 35,000. Finance says 28,000. They're all "right," just using different definitions. Now try to have a productive meeting.

The fix is two-sided. First, centralize metric definitions. One source of truth for what "active user" means, what counts as "churn," how "MRR" is calculated. This lives in your data model, not in a wiki page nobody reads. When someone queries "active users," they should get the canonical number without having to know which filter to apply. The definition is baked into the data layer.

Second, open access for exploration. Self-serve analytics means PMs, marketers, and engineers can answer their own questions without filing a ticket. This doesn't mean everyone writes raw SQL. It means dashboards are discoverable, metrics are well-labeled, and there's a path from "I have a question" to "I have an answer" that doesn't require the data team as a bottleneck.

The balance matters. Too centralized and the data team becomes a service desk, answering every question because nobody else can. Too decentralized and you get metric anarchy, everyone inventing their own numbers. The sweet spot: locked definitions, open exploration.

In practice: A Series B startup had three different "revenue" numbers floating around, one from Stripe, one from the finance team's spreadsheet, and one from the product analytics dashboard. Each was calculated slightly differently (gross vs net, including trials or not, annualized or monthly). Every board meeting started with fifteen minutes of reconciliation. When the data team built a single metrics layer with canonical definitions, the reconciliation disappeared. Meetings started with "here's where we are" instead of "which number are we using?"

Written with ❤️ by a human (still)