Semantic Model¶
Starting from version 0.2.4, semantic models serve as schema extensions that enrich database tables with semantic information. They define table structures (columns, dimensions, measures, entities) to help the agent better understand data for ad-hoc query generation.
Core Value¶
Enhances database schema understanding for better SQL generation: - Rich column descriptions: Augment DDL comments with usage patterns and filter examples - Dimension & measure classification: Explicitly mark columns as dimensions (for grouping) or measures (for aggregation) - Entity relationships: Define foreign key relationships between tables for accurate joins - Query pattern insights: Capture how columns are typically filtered (LIKE, IN, FIND_IN_SET, etc.) from historical queries
How It Works¶
Semantic models define the foundational schema layer. Starting from version 0.2.4, they operate independently:
- Semantic Models (this document): Schema extensions with dimensions, measures, and entity relationships
- Storage:
semantic_modeltable in LanceDB -
Purpose: Help agent understand table structures for ad-hoc SQL generation
-
Metrics (see metrics.md): Business calculations built on semantic models
- Storage:
metricstable in LanceDB - Purpose: Standardized KPIs queryable via MetricFlow
Semantic models provide the building blocks (dimensions, measures) that metrics reference.
Storage Structure¶
Semantic model objects are stored at field level:
# Stored objects (kind field):
- "table": Table-level metadata
- "column": Column-level metadata with semantic flags
- "entity": Entity definitions for relationships
# Semantic flags for columns:
- is_dimension: Column used for grouping/filtering
- is_measure: Column used for aggregation
- is_entity_key: Column used for table joins
Usage¶
Prerequisites: This command relies on datus-semantic-metricflow, install it first with pip install datus-semantic-metricflow.
Basic Command¶
# From CSV (historical SQLs)
datus-agent bootstrap-kb \
--namespace <your_namespace> \
--components semantic_model \
--success_story path/to/success_story.csv
# From YAML (existing semantic model files)
datus-agent bootstrap-kb \
--namespace <your_namespace> \
--components semantic_model \
--semantic_yaml path/to/semantic_model.yaml
Key Parameters¶
| Parameter | Required | Description | Example |
|---|---|---|---|
--namespace |
✅ | Database namespace | sales_db |
--components |
✅ | Components to initialize | semantic_model |
--success_story |
⚠️ | CSV file with historical SQLs (required if no --semantic_yaml) |
success_story.csv |
--semantic_yaml |
⚠️ | Semantic model YAML file (required if no --success_story) |
semantic_model.yaml |
--kb_update_strategy |
✅ | Update strategy | overwrite/incremental |
Data Source Formats¶
CSV Format¶
question,sql
How many orders per customer?,SELECT customer_id, COUNT(*) as order_count FROM orders GROUP BY customer_id;
What is the average order amount?,SELECT AVG(amount) FROM orders WHERE status = 'completed';
The agent analyzes these SQLs to: - Extract table structures - Identify dimensions (GROUP BY columns) and measures (aggregated columns) - Discover column usage patterns (WHERE clause filters) - Infer entity relationships (JOIN patterns)
YAML Format¶
MetricFlow-compatible semantic model YAML:
data_source:
name: orders
description: "Order transactions table"
identifiers:
- name: order
type: PRIMARY
expr: id
description: "Unique order identifier"
- name: customer
type: FOREIGN
expr: customer_id
description: "References customers table"
dimensions:
- name: status
type: CATEGORICAL
expr: status
description: "Order status. Use IN for filtering. Common values: 'pending', 'completed', 'cancelled'"
- name: created_at
type: TIME
expr: created_at
description: "Order creation timestamp"
measures:
- name: amount
agg: sum
expr: amount
description: "Order amount in USD"
- name: order_count
agg: count
expr: id
description: "Count of orders"
How Semantic Models Enhance SQL Generation¶
When generating SQL, the agent uses semantic models to:
- Schema Linking: Find relevant tables and columns using vector search on descriptions
- Proper JOIN Construction: Use entity relationships to generate correct JOIN conditions
- Smart Filtering: Apply usage patterns (e.g., use
FIND_IN_SET()for comma-separated tag columns) - Accurate Aggregation: Select appropriate measures for aggregation functions
Example query flow:
User: "Show me total revenue by customer status"
Agent process:
1. Search semantic objects: "revenue", "customer", "status"
2. Find: orders.amount (measure, agg=sum), customers.status (dimension)
3. Use entity relationship: orders.customer_id → customers.id
4. Generate SQL:
SELECT c.status, SUM(o.amount) as revenue
FROM orders o
JOIN customers c ON o.customer_id = c.id
GROUP BY c.status
Integration with Context Search¶
Semantic models are searchable via @subject context command:
# In CLI chat mode
@subject <domain>/<layer1>/<layer2>
# Search for semantic objects
search_semantic_objects(query="customer revenue", kinds=["table", "column"])
Summary¶
Semantic models extend database schemas with rich semantic information, enabling more accurate ad-hoc SQL generation. Starting from version 0.2.4, they operate independently from metrics:
Key characteristics: - Schema extensions: Augment physical schemas with business semantics - Field-level storage: Store tables, columns, entities separately for flexible search - Usage pattern enrichment: Capture real-world query patterns from historical SQLs - Independent from metrics: Focus on schema understanding, not metric calculations
This separation ensures semantic models remain focused on their core purpose: helping the agent understand your data structure for generating correct, efficient SQL queries.