LLMs modeling
Core Philosophy
FeatureQL is a functional feature engineering language based on Entity-Centric Modeling (ECM). Features are pure functions that compose automatically through dependency tracking. When persisted through CREATE FEATURES AS SELECT ..., only definitions are stored, not data (equivalent to a SQL view at the column level).
Mental Model: From Problem to FeatureQL
Step 1: Identify Entities and Their Grain
- List all entities (CUSTOMER, PRODUCT, ORDER, etc.)
- Each entity needs one INPUT() that represents its primary key
- This INPUT defines the "grain" at which features are computed
Step 2: Map External Data to Entities
Create one EXTERNAL_SQL per entity and origin table. EXTERNAL_SQL first argument must use the dialect of the backend database you are using.
For dimension tables:
EXTERNAL_SQL(
'SELECT customer_email, customer_country, CAST(customer_date as DATE) as customer_created_at FROM customers', -- Note the optional casting and alias to match the mapped feature
ON `SELF.customer_id = %CUSTOMER_ID`, -- Note the binding of the primary key column to the INPUT feature `CUSTOMER_ID` representing the entity `CUSTOMERS`
AS ROW(customer_email VARCHAR, customer_country VARCHAR#COUNTRIES, customer_created_at DATE)
), For fact tables, use the primary key as the grouping key and use ARRAY_AGG to create arrays of related records
Keep EXTERNAL_SQL queries simple:
- SELECT with GROUP BY, no filters, no complex logic
- Filters and business logic belong in features, not in SQL in the
EXTERNAL_SQLclause
Step 3: Define Filters as Features
- Never put WHERE clauses in EXTERNAL_SQL
- Create boolean features for filters (IS_JULY_2017_ORDER, IS_REFERENCE_PRODUCT)
- These features can be reused across queries
Step 4: Trace the Query Path
- Determine the traversal: which entities do you need to walk through?
- Example: PRODUCT → ORDERS → CUSTOMERS → ORDERS → PRODUCTS
- At each step, you'll use ADD_FIELDS to enrich with related entity data
Step 5: Build Features by Composing
- Features with the same INPUT automatically align (no join needed)
- Use ADD_FIELDS() to enrich arrays with data from related entities
- Use TRANSFORM() to filter, aggregate, or compute on arrays
- Access nested fields with bracket notation: ORDERS[amount]
Key Patterns
Pattern 1: Generic, Reusable Base Features
-- Define entities (always generic)
CUSTOMERS := ENTITY(),
PRODUCTS := ENTITY(),
-- Define inputs (always generic)
CUSTOMER_ID := INPUT(BIGINT#CUSTOMERS),
PRODUCT_ID := INPUT(BIGINT#PRODUCTS),
-- External data (generic, no filters)
CUSTOMER_DETAILS := EXTERNAL_SQL(
'SELECT customer_id, name, email FROM customers',
AS ROW(customer_id BIGINT, name VARCHAR, email VARCHAR)
),
CUSTOMER_NAME := CUSTOMER_DETAILS[name],
-- Grouped relationships (generic)
CUSTOMER_ORDER_IDS := EXTERNAL_SQL(
`SELECT customer_id, ARRAY_AGG(order_id) as order_ids
FROM orders GROUP BY customer_id`,
AS ROW(customer_id BIGINT, order_ids ARRAY(BIGINT#ORDERS))
)[order_ids],
Pattern 2: Filters as Features
If the filter is complex, you can define it as a boolean feature. If the filter is simple, you can use it directly in the query.
-- At ORDER level
ORDER_DATE := EXTERNAL_SQL(...)[order_date],
IS_JULY_2017 := ORDER_DATE >= '2017-07-01' AND ORDER_DATE < '2017-08-01',
-- At PRODUCT level
PRODUCT_NAME := EXTERNAL_SQL(...)[product_name],
IS_REFERENCE_PRODUCT := PRODUCT_NAME = 'Target Product',
Pattern 5: Array of rows handling
-- ORDER_ITEMS is already ARRAY(ROW(product_id, quantity))
-- Access field directly:
PRODUCT_IDS := ORDER_ITEMS[product_id],
-- If nested deeper:
-- ORDERS has array, each order has ORDER_ITEMS which has array
ALL_PRODUCT_IDS := CUSTOMER_ORDERS[ORDER_ITEMS][product_id],
Pattern 3: Enriching Arrays with ADD_FIELDS
-- CUSTOMER has array of order_ids called CUSTOMER_ORDER_IDS of type ARRAY(BIGINT#ORDERS)
-- Enrich with order-level features (automatic join on order_id)
CUSTOMER_ORDERS_ENRICHED := ADD_FIELDS(
ORDER_DATE, ORDER_AMOUNT, PRODUCT_IDS -- Features at ORDER level
TO ZIP(CUSTOMER_ORDER_IDS as order_id) -- Transform array into array of rows with an order_id field
BINDING FIELDS order_id WITH ORDER_ID -- Join condition (optional as order_id is already annotated with ORDER_ID)
),
-- If you already have an array of rows, no ROW() needed:
-- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
ORDER_ITEMS_ENRICHED := ADD_FIELDS(
PRODUCT_NAME, PRODUCT_PRICE -- Features at PRODUCT level
TO ORDER_ITEMS -- Already array of rows
BINDING FIELDS product_id WITH PRODUCT_ID
),
Pattern 4: Operating on Arrays with TRANSFORM
Considering CUSTOMER_ORDERS_ENRICHED of type ARRAY(ROW(order_id BIGINT#ORDERS, order_date DATE, order_amount DECIMAL, product_ids ARRAY(BIGINT#PRODUCTS)))
-- Filter array
JULY_ORDERS := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
SELECT * WHERE ORDER_DATE BETWEEN '2017-07-01' AND '2017-07-31'
),
-- Aggregate on all rows
TOTAL_SPENT := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
SELECT SUM(ORDER_AMOUNT)
),
-- Check existence of a product in the customer's order items
HAS_BOUGHT_REFERENCE := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
SELECT BOOL_OR(CONTAINS(PRODUCT_IDS, PRODUCT_ID))
),
-- Aggregate by any field or derived field
TOTAL_SPENT_BY_MONTH := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
WITH ORDER_MONTH := EXTRACT(MONTH FROM ORDER_DATE)
SELECT
ORDER_MONTH,
-- Note the GROUP BY as part of the feature definition (contrarily to SQL where the GROUP BY is part of the query)
COUNT(1) GROUP BY ORDER_MONTH AS total_orders_by_month,
SUM(ORDER_AMOUNT) GROUP BY ORDER_MONTH AS total_spent_by_month
ORDER BY ORDER_MONTH ASC
),
-- Get distinct values
UNIQUE_PRODUCT_IDS := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
SELECT DISTINCT order_customer_id
),
Common Mistakes to Avoid
DO NOT: Put filters in EXTERNAL_SQL
-- WRONG
JULY_ORDERS := EXTERNAL_SQL(
'SELECT * FROM orders WHERE order_date >= ''2017-07-01''',
...
)
DO: Filter with features
-- RIGHT
ALL_ORDERS := EXTERNAL_SQL('SELECT * FROM orders', ...),
JULY_ORDERS := ALL_ORDERS.TRANSFORM(SELECT * WHERE ORDER_DATE >= '2017-07-01'),
DO NOT: Use ROW() when you already have array of rows
-- WRONG: ORDER_ITEMS is already ARRAY(ROW(...))
ADD_FIELDS(
PRODUCT_NAME
TO ROW(product_id := ORDER_ITEMS) -- Unnecessary ROW()
BINDING FIELDS product_id WITH PRODUCT_ID
)
DO: Use ROW() only for scalar foreign keys
-- RIGHT: Converting scalar array to array of rows
ADD_FIELDS(
ORDER_DATE
TO ROW(order_id := CUSTOMER_ORDER_IDS) -- CUSTOMER_ORDER_IDS is ARRAY(BIGINT)
BINDING FIELDS order_id WITH ORDER_ID
)
-- RIGHT: Already array of rows
ADD_FIELDS(
PRODUCT_NAME
TO ORDER_ITEMS -- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
BINDING FIELDS product_id WITH PRODUCT_ID
)
DO NOT: Overcomplicate entity traversal
-- If you need to count something at PRODUCT level based on CUSTOMER qualification:
-- Don't create intermediate flattening/unnesting steps
DO: Keep it simple with direct enrichment
-- Get customers per product, enrich with customer features, aggregate
PRODUCT_CUSTOMER_IDS := EXTERNAL_SQL(...) -- Array of customer_ids
CUSTOMERS_ENRICHED := ADD_FIELDS(HAS_QUALIFICATION TO ROW(...) BINDING...),
COUNT := CUSTOMERS_ENRICHED.TRANSFORM(SELECT COUNT(*) WHERE HAS_QUALIFICATION),
Problem-Solving Workflow
- Identify the target entity (what grain is your final answer?)
- List all entities in the path (what do you need to traverse?)
- Create generic base features (ENTITY, INPUT, EXTERNAL_SQL with no filters)
- Define filter features (boolean conditions at appropriate entity level)
- Trace forward from filters (enrich arrays, transform, enrich again)
- Aggregate at target entity (final TRANSFORM to compute answer)
Type Annotations
- Use
BIGINT#ENTITY_NAMEto annotate foreign keys - This enables automatic validation and binding in ADD_FIELDS
- Example:
ARRAY(BIGINT#CUSTOMERS)for array of customer IDs
Understand feature engineering flow VS analytics flow
In feature engineering, we are building features that describe the characteristics of a known entity (a customer, a product, an order, etc.) represented by a primary key (CUSTOMER_ID, PRODUCT_ID, ORDER_ID, etc.).
In analytics, the keys to bind on are defined as the set of keys present in the dataset. The keys internal to the query are less strict and can be defined on the fly (revenue per month, category grouping, top countries...)
Note: Feature engineering queries can be executed Online or Offline. Analytics queries are always executed Offline.
Query Structure for feature engineering
WITH
-- Entity definitions
-- Input definitions
-- External data (generic, reusable)
-- Filter features
-- Derived features (composition, enrichment, transformation)
SELECT
PRIMARY_KEY := BIND_TABLE(...) or BIND_SQL(...),
feature1,
feature2...
Query Structure for analytics
WITH
-- Same as feature engineering
SELECT
PRIMARY_KEY := BIND_TABLE(...) or BIND_SQL(...),
feature1,
feature2
WHERE condition
ORDER BY feature1, feature2...
OFFSET n
LIMIT n;
Notes:
- feature1 and feature2 can be aggregated features, not just simple features: in that case, a final aggregation will be performed.
- Some SQL can be added wrapping this query to add more analytics complexity that would be cumbersome in FeatureQL.
Remember
- Features are functions, not tables
- Keep EXTERNAL_SQL simple and generic
- Use ADD_FIELDS for enrichment across entities
- Use TRANSFORM for operations within an entity's arrays
- ROW() converts scalars to rows; only use when needed
- Filters are features, not SQL WHERE clauses
- Arrays of rows are first-class: access fields directly with brackets