How to build a data model for your startup

Use this when: Your raw event data is a mess of JSON blobs, duplicate rows, and inconsistent naming. Analysts are writing 200-line queries with five joins just to answer basic questions. You need a clean layer between raw data and business decisions.

You're done when: You have staging, intermediate, and mart tables documented and tested, and an analyst can answer "how many active users did we have last week?" with a single query against one table.

The Three Layers

A data model has three layers. Each one does a different job.

Staging is the first stop for raw data. You take whatever your sources give you, and you clean it up — just enough. Rename columns to a consistent convention. Cast types ("2024-01-15" as a string becomes a proper date). Drop true duplicates. Nothing more. One staging model per source table. The staging layer is 1:1 with the source, and it exists so that everything downstream never has to touch raw data directly.

Intermediate is where you do the real transformation work. Join sources together. Build business entities. Resolve identities (the same user showing up in Stripe, Segment, and HubSpot as three different IDs). This is the logic layer — where you encode what "an active user" means, how you stitch anonymous sessions to authenticated accounts, or how you compute a subscription's current state from a stream of events. Intermediates feed the final layer but aren't usually queried directly.

Mart is what analysts and dashboards actually use. One mart, one purpose. mart_users has one row per user. mart_transactions has one row per transaction. These tables are wide (lots of columns), pre-joined, and ready to query. An analyst should be able to answer most business questions with a single SELECT against a mart table.

Think of it like a restaurant kitchen. Staging is the prep station — washing vegetables, portioning proteins, standardizing ingredients. Intermediate is the line cook station — building components like sauces and stocks that go into multiple dishes. Mart is plating — the finished dish the customer gets.

Schema Patterns

How you structure tables within those layers depends on what you're modeling. Two patterns come up constantly.

Star schema is the classic. You have a central fact table (things that happened — orders, sessions, payments) and surrounding dimension tables (context — users, products, dates). The fact table has foreign keys pointing to the dimensions. To analyze orders by user segment and product category, you join the fact table to the user and product dimensions. It's called a star because the diagram looks like one: fact in the middle, dimensions radiating out.

Star schema works well when your business revolves around transactions or events with rich context. B2C e-commerce, SaaS billing, marketing attribution.

Activity schema is newer and takes a different approach. Instead of separate fact and dimension tables, everything is modeled as a single activities table: one row per thing that happened, with a consistent structure (activity_type, ts, entity_id, and a JSON blob for extra details). You build mart views on top of that single table using window functions and pivots. The advantage is flexibility — adding a new event type doesn't require altering the schema, just inserting new rows with a new activity_type value.

Activity schema works well for product analytics and user behavior modeling, where you have many event types and the relationships between them are what matters.

A useful rule: if you're modeling financial or operational data with clear entities and fixed relationships, star schema. If you're modeling user journeys and product behavior with many event types that evolve over time, activity schema is worth considering.

The Sequence

Loading visualization...

Template

Loading visualization...

The Staging Layer Is the Most Underrated Layer

When teams start modeling data, they want to jump straight to the interesting part: mart tables with business metrics. The staging layer feels boring. It's 1:1 with the source. You're just renaming columns and casting types. Why bother?

Because the staging layer is where you absorb change. When Stripe adds a new field to their API, your staging model handles it. When your product team renames an event, the staging layer maps the old name to the new one. When a third-party source changes their schema (and they will, without warning), only one file needs to update.

Without staging, a schema change in Stripe cascades through every query, dashboard, and report that touches billing data. With staging, it cascades to exactly one file. The rest of the model doesn't know or care.

This is the same principle as interface boundaries in software engineering. The staging layer is an abstraction barrier between systems you don't control (source data) and systems you do (your analytics models). Every week you skip building it, you're accumulating debt that compounds.

Example

A B2B SaaS company had three data sources: Segment events (product usage), Stripe (billing), and HubSpot (sales pipeline). Their analysts were writing raw SQL against all three, joining on email addresses that sometimes matched and sometimes didn't.

They built the model in layers. Staging: one model per source table, standardizing column names (created_at everywhere, not createdAt in Segment and create_date in HubSpot) and casting types. Intermediate: an identity stitching model that resolved anonymous Segment IDs to authenticated user IDs, and a subscription history model that joined Stripe charges with HubSpot deal stages. Marts: mart_users (one row per user with signup date, activation date, plan, LTV), mart_transactions (one row per charge with plan context), mart_sessions (one row per product session with feature usage).

The whole build took eight days. The first payoff came two weeks later when Stripe changed their webhook payload format. One staging model updated. Zero downstream breakage. The analyst who used to spend Mondays fixing dashboards spent that Monday actually analyzing data.

Loading visualization...

Written with ❤️ by a human (still)