LLMs modeling

Core Philosophy

FeatureQL is a functional feature engineering language based on Entity-Centric Modeling (ECM). Features are pure functions that compose automatically through dependency tracking. When persisted through CREATE FEATURES AS SELECT ..., only definitions are stored, not data (equivalent to a SQL view at the column level).

Mental Model: From Problem to FeatureQL

Step 1: Identify Entities and Their Grain

  • List all entities (CUSTOMER, PRODUCT, ORDER, etc.)
  • Each entity needs one INPUT() that represents its primary key
  • This INPUT defines the "grain" at which features are computed

Step 2: Map External Data to Entities

  • Create one EXTERNAL_SQL per entity and origin table. EXTERNAL_SQL first argument must use the dialect of the backend database you are using.

  • For dimension tables:

EXTERNAL_SQL(
  'SELECT customer_email, customer_country, CAST(customer_date as DATE) as customer_created_at FROM customers', -- Note the optional casting and alias to match the mapped feature
  ON `SELF.customer_id = %CUSTOMER_ID`, -- Note the binding of the primary key column to the INPUT feature `CUSTOMER_ID` representing the entity `CUSTOMERS`
  AS ROW(customer_email VARCHAR, customer_country VARCHAR#COUNTRIES, customer_created_at DATE)
),
null
  • For fact tables, use the primary key as the grouping key and use ARRAY_AGG to create arrays of related records

  • Keep EXTERNAL_SQL queries simple:

    • SELECT with GROUP BY, no filters, no complex logic
    • Filters and business logic belong in features, not in SQL in the EXTERNAL_SQL clause

Step 3: Define Filters as Features

  • Never put WHERE clauses in EXTERNAL_SQL
  • Create boolean features for filters (IS_JULY_2017_ORDER, IS_REFERENCE_PRODUCT)
  • These features can be reused across queries

Step 4: Trace the Query Path

  • Determine the traversal: which entities do you need to walk through?
  • Example: PRODUCT → ORDERS → CUSTOMERS → ORDERS → PRODUCTS
  • At each step, you'll use ADD_FIELDS to enrich with related entity data

Step 5: Build Features by Composing

  • Features with the same INPUT automatically align (no join needed)
  • Use ADD_FIELDS() to enrich arrays with data from related entities
  • Use TRANSFORM() to filter, aggregate, or compute on arrays
  • Access nested fields with bracket notation: ORDERS[amount]

Key Patterns

Pattern 1: Generic, Reusable Base Features

-- Define entities (always generic)
CUSTOMERS := ENTITY(),
PRODUCTS := ENTITY(),

-- Define inputs (always generic)
CUSTOMER_ID := INPUT(BIGINT#CUSTOMERS),
PRODUCT_ID := INPUT(BIGINT#PRODUCTS),

-- External data (generic, no filters)
CUSTOMER_DETAILS := EXTERNAL_SQL(
  'SELECT customer_id, name, email FROM customers',
  AS ROW(customer_id BIGINT, name VARCHAR, email VARCHAR)
),
CUSTOMER_NAME := CUSTOMER_DETAILS[name],

-- Grouped relationships (generic)
CUSTOMER_ORDER_IDS := EXTERNAL_SQL(
  `SELECT customer_id, ARRAY_AGG(order_id) as order_ids
   FROM orders GROUP BY customer_id`,
  AS ROW(customer_id BIGINT, order_ids ARRAY(BIGINT#ORDERS))
)[order_ids],

Pattern 2: Filters as Features

If the filter is complex, you can define it as a boolean feature. If the filter is simple, you can use it directly in the query.

-- At ORDER level
ORDER_DATE := EXTERNAL_SQL(...)[order_date],
IS_JULY_2017 := ORDER_DATE >= '2017-07-01' AND ORDER_DATE < '2017-08-01',

-- At PRODUCT level
PRODUCT_NAME := EXTERNAL_SQL(...)[product_name],
IS_REFERENCE_PRODUCT := PRODUCT_NAME = 'Target Product',

Pattern 5: Array of rows handling

-- ORDER_ITEMS is already ARRAY(ROW(product_id, quantity))
-- Access field directly:
PRODUCT_IDS := ORDER_ITEMS[product_id],

-- If nested deeper:
-- ORDERS has array, each order has ORDER_ITEMS which has array
ALL_PRODUCT_IDS := CUSTOMER_ORDERS[ORDER_ITEMS][product_id],

Pattern 3: Enriching Arrays with ADD_FIELDS

-- CUSTOMER has array of order_ids called CUSTOMER_ORDER_IDS of type ARRAY(BIGINT#ORDERS)
-- Enrich with order-level features (automatic join on order_id)
CUSTOMER_ORDERS_ENRICHED := ADD_FIELDS(
  ORDER_DATE, ORDER_AMOUNT, PRODUCT_IDS  -- Features at ORDER level
  TO ZIP(CUSTOMER_ORDER_IDS as order_id)  -- Transform array into array of rows with an order_id field
  BINDING FIELDS order_id WITH ORDER_ID   -- Join condition (optional as order_id is already annotated with ORDER_ID)
),

-- If you already have an array of rows, no ROW() needed:
-- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
ORDER_ITEMS_ENRICHED := ADD_FIELDS(
  PRODUCT_NAME, PRODUCT_PRICE  -- Features at PRODUCT level
  TO ORDER_ITEMS                -- Already array of rows
  BINDING FIELDS product_id WITH PRODUCT_ID
),

Pattern 4: Operating on Arrays with TRANSFORM

Considering CUSTOMER_ORDERS_ENRICHED of type ARRAY(ROW(order_id BIGINT#ORDERS, order_date DATE, order_amount DECIMAL, product_ids ARRAY(BIGINT#PRODUCTS)))

-- Filter array
JULY_ORDERS := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  SELECT * WHERE ORDER_DATE BETWEEN '2017-07-01' AND '2017-07-31'
),

-- Aggregate on all rows
TOTAL_SPENT := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  SELECT SUM(ORDER_AMOUNT)
),

-- Check existence of a product in the customer's order items
HAS_BOUGHT_REFERENCE := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
  SELECT BOOL_OR(CONTAINS(PRODUCT_IDS, PRODUCT_ID))
),

-- Aggregate by any field or derived field
TOTAL_SPENT_BY_MONTH := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  WITH ORDER_MONTH := EXTRACT(MONTH FROM ORDER_DATE)
  SELECT
    ORDER_MONTH,
    -- Note the GROUP BY as part of the feature definition (contrarily to SQL where the GROUP BY is part of the query)
    COUNT(1) GROUP BY ORDER_MONTH AS total_orders_by_month,
    SUM(ORDER_AMOUNT) GROUP BY ORDER_MONTH AS total_spent_by_month
  ORDER BY ORDER_MONTH ASC
),

-- Get distinct values
UNIQUE_PRODUCT_IDS := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
  SELECT DISTINCT order_customer_id
),

Common Mistakes to Avoid

DO NOT: Put filters in EXTERNAL_SQL

-- WRONG
JULY_ORDERS := EXTERNAL_SQL(
  'SELECT * FROM orders WHERE order_date >= ''2017-07-01''',
  ...
)

DO: Filter with features

-- RIGHT
ALL_ORDERS := EXTERNAL_SQL('SELECT * FROM orders', ...),
JULY_ORDERS := ALL_ORDERS.TRANSFORM(SELECT * WHERE ORDER_DATE >= '2017-07-01'),

DO NOT: Use ROW() when you already have array of rows

-- WRONG: ORDER_ITEMS is already ARRAY(ROW(...))
ADD_FIELDS(
  PRODUCT_NAME
  TO ROW(product_id := ORDER_ITEMS)  -- Unnecessary ROW()
  BINDING FIELDS product_id WITH PRODUCT_ID
)

DO: Use ROW() only for scalar foreign keys

-- RIGHT: Converting scalar array to array of rows
ADD_FIELDS(
  ORDER_DATE
  TO ROW(order_id := CUSTOMER_ORDER_IDS)  -- CUSTOMER_ORDER_IDS is ARRAY(BIGINT)
  BINDING FIELDS order_id WITH ORDER_ID
)

-- RIGHT: Already array of rows
ADD_FIELDS(
  PRODUCT_NAME
  TO ORDER_ITEMS  -- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
  BINDING FIELDS product_id WITH PRODUCT_ID
)

DO NOT: Overcomplicate entity traversal

-- If you need to count something at PRODUCT level based on CUSTOMER qualification:
-- Don't create intermediate flattening/unnesting steps

DO: Keep it simple with direct enrichment

-- Get customers per product, enrich with customer features, aggregate
PRODUCT_CUSTOMER_IDS := EXTERNAL_SQL(...)  -- Array of customer_ids
CUSTOMERS_ENRICHED := ADD_FIELDS(HAS_QUALIFICATION TO ROW(...) BINDING...),
COUNT := CUSTOMERS_ENRICHED.TRANSFORM(SELECT COUNT(*) WHERE HAS_QUALIFICATION),

Problem-Solving Workflow

  1. Identify the target entity (what grain is your final answer?)
  2. List all entities in the path (what do you need to traverse?)
  3. Create generic base features (ENTITY, INPUT, EXTERNAL_SQL with no filters)
  4. Define filter features (boolean conditions at appropriate entity level)
  5. Trace forward from filters (enrich arrays, transform, enrich again)
  6. Aggregate at target entity (final TRANSFORM to compute answer)

Type Annotations

  • Use BIGINT#ENTITY_NAME to annotate foreign keys
  • This enables automatic validation and binding in ADD_FIELDS
  • Example: ARRAY(BIGINT#CUSTOMERS) for array of customer IDs

Understand feature engineering flow VS analytics flow

In feature engineering, we are building features that describe the characteristics of a known entity (a customer, a product, an order, etc.) represented by a primary key (CUSTOMER_ID, PRODUCT_ID, ORDER_ID, etc.).

In analytics, the keys to bind on are defined as the set of keys present in the dataset. The keys internal to the query are less strict and can be defined on the fly (revenue per month, category grouping, top countries...)

Note: Feature engineering queries can be executed Online or Offline. Analytics queries are always executed Offline.

Query Structure for feature engineering

WITH
  -- Entity definitions
  -- Input definitions
  -- External data (generic, reusable)
  -- Filter features
  -- Derived features (composition, enrichment, transformation)
SELECT
  PRIMARY_KEY := BIND_TABLE(...) or BIND_SQL(...),
  feature1,
  feature2...

Query Structure for analytics

WITH
  -- Same as feature engineering
SELECT
  PRIMARY_KEY := BIND_TABLE(...) or BIND_SQL(...),
  feature1,
  feature2
WHERE condition
ORDER BY feature1, feature2...
OFFSET n
LIMIT n;

Notes:

  • feature1 and feature2 can be aggregated features, not just simple features: in that case, a final aggregation will be performed.
  • Some SQL can be added wrapping this query to add more analytics complexity that would be cumbersome in FeatureQL.

Remember

  • Features are functions, not tables
  • Keep EXTERNAL_SQL simple and generic
  • Use ADD_FIELDS for enrichment across entities
  • Use TRANSFORM for operations within an entity's arrays
  • ROW() converts scalars to rows; only use when needed
  • Filters are features, not SQL WHERE clauses
  • Arrays of rows are first-class: access fields directly with brackets
Last update at: 2025/11/06 07:00:15
Last updated: 2025-11-06 07:00:51