Leverage Entity-Centered Modeling

This page explains the Entity-Centered Modeling (ECM) approach and how to implement it with FeatureQL.

What is ECM?

Entity-Centered Modeling (ECM) organizes analytical data around business entities (customer, product, store) rather than transactions.

Each entity gets a wide, enriched view containing its attributes and arrays of related facts, eliminating most joins at query time. This is also named OBT (one Big Table) in the industry.

It's a modeling that is very intuitive for the business user and efficient as long as they dispose of the right tools to query the data.

Core Architecture Pattern

In the silver layer, once tables have been cleaned, normalized, enriched, we can build the ECM tables by aggregating the facts related to each entity.

-- Silver Layer: ECM tables built from normalized facts

CREATE TABLE silver.customer_obt_from_orders AS
SELECT
    customer_id,
    -- Immutable attributes
    MIN(order_date) as order_first_date,
    -- Mutable attributes with SCD2 tracking
    MAX_BY(country, order_date) as order_last_country,
    -- Arrays of facts with more or less details depending on the needs
    ARRAY_AGG(
        STRUCT(order_id, order_date, amount)
        ORDER BY order_date DESC
    ) AS order_history_sorted,
    $date_ref as ds
FROM silver.orders_normalized
GROUP BY customer_id;

-- Silver Layer: ECM tables built from normalized facts
CREATE TABLE silver.customer_obt_from_customers AS
SELECT
    customer_id,
    -- Immutable attributes
    registration_date,
    -- Mutable attributes with SCD2 tracking
    country, segment,
    $date_ref as ds
FROM silver.customers_normalized
GROUP BY customer_id;

-- View/table that joins the two tables
CREATE VIEW silver.customer_obt AS
SELECT * FROM silver.customer_obt_from_customers c
LEFT JOIN silver.customer_obt_from_orders o ON c.customer_id = o.customer_id and c.ds=o.ds;
sql

What Goes in Fact Arrays?

The key insight of ECM is to be selective about what goes into these fact arrays.

Only include:

  • IDs for lookup joins (customer_id, product_id, store_id) - integers that compress extremely well
  • Timestamps for filtering (order_date, event_time) - essential for time-based analysis and partition pruning
  • Enum-type attributes you filter on frequently (status, event_type, channel) - low-cardinality values that compress also extremely well

Keep these arrays minimal and highly compressible. Any other potentially useful data (product names, categories, descriptions, store locations) stays in dimension tables and is accessed through inexpensive lookup joins when needed. This way, you get the filtering performance of denormalized data without the storage and decompression overhead of wide tables.

Avoid adding attributes that are computed from other tables as it introduces dependencies and make the table slower to land. Beware of circular dependencies.

The magic happens when you filter first on these compressed arrays (especially timestamps), then join only the filtered subset with dimension tables. Since you're typically filtering to a small percentage of rows (e.g., last 30 days of a year), the subsequent joins are well optimized.

Smart History Management

Single Current Partition: Keep only the latest state as the primary table, rebuilt incrementally each day by merging new facts with yesterday's state. No wasteful daily snapshots.

SCD Type 2 for Attributes: Track changes to slowly-changing dimensions as an array of rows with value and validity dates. This preserves history without duplication.

Arrays for Point-in-Time: Historical facts stored in arrays enable time-travel without snapshots.

Key Design Principles

Additional tables built on normalized ones: ECM tables are materialized tables built in the silver layer, derived from normalized cleaned and conformed tables. This preserves data lineage and enables easy rebuilds.

Incremental Not Full Refresh: After initial build, each day only processes new facts and merges with existing state. Much more efficient than daily full rebuilds.

Use Arrays for Flexibility: Store facts in sorted arrays. This enables both local computations and historical reconstruction without storing multiple snapshots.

Keep Entities Pure: Entity tables should only group by actual IDs. For coarse-grain analysis (such as country or category), use views that flatten and reaggregate:

CREATE VIEW product_category_sales AS
SELECT category, FLATTEN(ARRAY_AGG(sale_history)) as sales
FROM silver.product GROUP BY category
sql

Implementation Recipe

  1. Initial Build: Create full ECM table from all historical silver facts
  2. Daily Incremental:
    • Process new facts only
    • Merge with yesterday's state
    • Update SCD2 records for changed attributes
    • Append to sorted fact arrays
  3. Set Array Boundaries: Keep all history (capped to n facts) or limit to recent period based on needs
  4. Monitor and Optimize: Check how columns are used and optimize them

Addressing Common Concerns

ConcernSolution
"Daily snapshots are wasteful"Don't create them. Keep one current partition + SCD2 for changes + arrays for fact history.
"Schema evolution is complex"Rebuild the array column with ARRAY_AGG GROUP BY. Arrays come from cleaned and conformed tables.
"How do I query historical state?"Filter arrays by date or use SCD2 validity ranges. No snapshot needed.
"Joins are expensive"Modern columnar databases handle ID lookups efficiently. You're joining on pre-filtered data (often <10% of rows), making it even faster.
"This duplicates data across entities"Only IDs and timestamps are duplicated. They compress very well. Additional storage cost is minimal vs compute savings.

When ECM Shines

  • Self-service analytics: Business users write simple filters instead of complex joins
  • Time-travel queries: Reconstruct any historical state from arrays and SCD2
  • ML features: Derived features are fast to compute and denormalization is not needed
  • Dashboard performance: Displaying a chart often requires only one table scan limiting the need for denormalisation

Summary

ECM trades storage for simplicity and speed while being smart about history. One current daily partition with SCD2 tracking and fact arrays gives you everything: current state performance, historical analysis capability, and storage efficiency without wasteful daily snapshots. Build incrementally in silver, preserve history intelligently, and deliver easy to use datasets to your users.


Based on concepts formalized by Maxime Beauchemin in 2023 with performance insights from modern database benchmarks

Last update at: 2025/11/06 07:00:15
Last updated: 2025-11-06 07:00:51