You're here when: You need analytics infrastructure and you're choosing between building custom data pipelines or using off-the-shelf tools like Segment, Amplitude, or Mixpanel.
The Heuristic
The default answer is buy. Custom data infrastructure is a time sink that feels productive but rarely creates competitive advantage at early stage. You build when you've hit a specific wall that buying can't solve.
- Team size under 5 engineers? Buy. You don't have the bandwidth to maintain a custom pipeline and build product at the same time. Every hour an engineer spends debugging Airflow DAGs is an hour they're not shipping features.
- Is your data use case standard? Product analytics, marketing attribution, funnel tracking, these are solved problems. Off-the-shelf tools handle them well. Build only when your use case is genuinely non-standard.
- Is the annual tool cost less than one week of engineer salary? Buy. A $500/month analytics tool costs $6,000/year. An engineer spending one week building and maintaining a custom equivalent costs $4,000-$5,000 in salary alone, and that's before the ongoing maintenance tax.
- Is the tool blocking you from doing something critical? That's the real build signal. When Amplitude can't handle your data volume, when Segment's event schema is too rigid for your domain, when you need cross-platform identity resolution that no tool supports, those are reasons to build. "Netflix does it" is not.
Decision Tree
Quick Example
A Series A SaaS company with 8 engineers spent two months building a custom event pipeline using Kafka, dbt, and a self-hosted Redash instance. It worked, but it required one engineer's full attention to maintain. When they did the math, the custom stack cost about $15,000/month in engineer time. Switching to Segment plus Amplitude cost $2,000/month and freed up an engineer to work on the product. The custom stack wasn't better. It was just more expensive.
The Inflection Point
The real signal to build isn't ambition, it's friction. You switch from buy to build when off-the-shelf tools can't handle one of three things:
- Volume — your event throughput exceeds the tool's pricing tier or performance limits
- Custom logic — you need transformations, identity resolution, or data models that the tool doesn't support
- Data ownership — you need raw data in your own warehouse for regulatory, security, or analytical reasons the tool can't accommodate
This inflection usually happens somewhere between Series A and Series B, when you have 10-20 engineers and enough data volume that the tool's limitations become real constraints. Before that, the "limitations" are usually hypothetical.
The Anti-Pattern
The Custom Stack Trap. Building a custom data pipeline at 500 users because Netflix and Spotify have custom pipelines. The logic sounds reasonable: "we'll need to own our data eventually, might as well start now." In practice, you're spending two months building infrastructure for a scale you may never reach, while neglecting the product work that would get you there. Netflix built custom infrastructure because they process billions of events per day. You process thousands. These are not the same problem.
"But AI makes it easy now." Even with AI coding assistants like Claude Code or Cursor, building your own data stack is still the wrong call at early stage. The hard part was never writing the code. It's handling the hundreds of edge cases, timezone mismatches, deduplication, schema evolution, identity resolution, late-arriving events, that silently corrupt your data. AI makes the initial build faster, but it doesn't reduce the ongoing maintenance burden or shrink the surface area for silent failures. A pipeline that produces wrong numbers is worse than no pipeline at all.
Written with ❤️ by a human (still)