The skill: A data model organizes raw data into clean, queryable tables that match how the business thinks. Good modeling means an analyst can answer "how many active users did we have last week?" with one simple query instead of a 200-line join across three systems.
In a Nutshell
- Three layers, always. Staging (1:1 with source, cleaned and renamed). Intermediate (joined, deduped, identity-stitched). Marts (one table per business concept, ready for dashboards).
- Staging absorbs change. When a source system renames a field or changes a type, only the staging model updates. Everything downstream is insulated.
- Name by business concept.
mart_users,mart_transactions,mart_sessions. Notstripe_charges_v2_cleaned_final. The prefix tells you the layer:stg_for staging,int_for intermediate,mart_for business-ready tables. - Define the grain explicitly. Every table should have a one-sentence grain statement: "one row per user," "one row per transaction," "one row per daily session." If you can't state the grain, the table isn't ready.
- Denormalize in marts, not before. Marts are optimized for reading. Joining user attributes onto every transaction row is fine in a mart. Doing it in staging creates a mess.
- Test everything. Uniqueness on primary keys, not-null on required columns, row count comparisons to source. If a model breaks, a test should catch it before a stakeholder does.
- Surrogate keys decouple you from source systems. Generate your own keys in the intermediate layer. When a source system resets IDs or merges records, your marts don't break.
The Three Layers in Practice
Staging is boring on purpose. One model per source table. Rename columns to a consistent convention (created_at, not createdAt or create_date). Cast types. Filter out obvious junk (test accounts, null IDs). No joins, no business logic. The staging layer's job is to normalize the interface to each source system so the rest of your models don't have to care where the data came from.
Intermediate is where the real work happens. This is where you join Segment events with Stripe charges, resolve anonymous user IDs to authenticated ones, deduplicate records, and apply business rules. The intermediate layer is where "this user signed up, then activated, then upgraded" becomes a single timeline instead of three disconnected event streams. Keep intermediate models focused: one model per transformation concept (identity stitching, subscription history, session stitching), not one giant model that does everything.
Marts are the interface to the business. One table per business concept, wide and denormalized. mart_users has everything you'd want to know about a user in one place: signup date, plan, activation status, lifetime revenue, last login. mart_transactions has every transaction with user and product attributes already joined in. Analysts should never need to write a join to answer a basic question. That's the standard you're building toward.
When to Invest
If you have one data source and five people, you don't need three layers. A single clean view on top of your production database is fine. The staging-intermediate-mart pattern earns its keep when you have multiple data sources that need to be joined, when more than one person is writing queries, or when schema changes in source systems are breaking downstream dashboards. At that point, the hour you spend building a staging model saves dozens of hours in broken queries and debugging sessions over the next year.
Do's and Don'ts
Written with ❤️ by a human (still)