Debugging data quality

The skill: Systematically finding where data breaks instead of guessing. Data quality issues are pipeline bugs, and they need the same rigor as code bugs.

The debugging flow

Loading visualization...

In a Nutshell

Definition first. Before debugging the data, confirm the metric definition hasn't changed. This is the cause more often than you'd think.
Work backwards through the pipeline. Collection, then transformation, then presentation. Not the other way around.
"What changed recently?" is always the first question. Deploys, SDK updates, consent banners, schema migrations.
Check row counts at every layer. If 1M events enter and 800K come out, you've got a broken join or an overly aggressive filter.
Bot traffic is real. Filter it. 20-40% of web events can be non-human. Your signup numbers might be lying.
Automate the boring checks. Freshness monitors, row count thresholds, NULL rate alerts. Catch issues before the CEO does.
Build a root cause log. Every time you debug an issue, write down what broke, why, and how you found it. After 6 months, you'll have a checklist that catches 80% of issues in 5 minutes.

The 5-minute sanity check

When a number looks wrong, run through this before doing any deep investigation:

Minute 1: Check the obvious. Is the dashboard showing the right date range? Are the filters correct? Is the data fresh (check the last updated timestamp)? You'd be surprised how often the answer is "someone changed a filter."

Minute 2: Check the trend. Pull up the metric over the last 30-90 days. Is the change sudden or gradual? Sudden changes point to deploys, tracking breaks, or data pipeline failures. Gradual changes are usually real business shifts.

Minute 3: Check the changelog. Was anything deployed in the last 24-48 hours? Did the tracking SDK update? Did someone modify a dbt model? Did a third-party tool change their API? Cross-reference the timing of the anomaly with the deploy log.

Minute 4: Check the segments. Does the issue affect all users or just one platform, one geography, one cohort? If mobile numbers dropped but desktop is fine, you're looking at a tracking issue, not a business issue.

Minute 5: Check the source. Go to the raw event table and count rows for today vs yesterday vs the same day last week. If raw events look normal, the issue is in the transformation. If raw events are off, the issue is in collection.

Five minutes. This catches the majority of data quality issues without writing a single query more complex than a GROUP BY.

Do's and Don'ts

Loading visualization...

Written with ❤️ by a human (still)