Builtin Subagents¶
Overview¶
The Builtin Subagents are specialized AI assistants integrated within the Datus Agent system. Each subagent focuses on a specific aspect of data engineering automation — analyzing SQL, generating semantic models, and converting queries into reusable metrics — together forming a closed-loop workflow from raw SQL to knowledge-aware data products.
This document covers three core subagents:
- gen_sql_summary — Summarizes and classifies SQL queries
- gen_semantic_model — Generates MetricFlow semantic models
- gen_metrics — Generates MetricFlow metric definitions
Prerequisites¶
Configuration Required
Before using builtin subagents, you must configure them in your agent.yml file.
Steps to enable builtin subagents:
-
Copy configuration from example file
The builtin subagents are defined in
conf/agent.yml.exampleunder theagentic_nodessection. Copy the relevant subagent configurations to yourconf/agent.yml: -
Add to your agent.yml
Copy the
agentic_nodessection fromconf/agent.yml.exampleto yourconf/agent.yml:agent: # ... other configuration ... agentic_nodes: gen_semantic_model: model: anthropic system_prompt: gen_semantic_model prompt_version: "1.0" tools: db_tools.*, generation_tools.*, filesystem_tools.* hooks: generation_hooks mcp: metricflow_mcp workspace_root: ~/.datus/data/semantic_models agent_description: "Semantic model generation assistant" # ... more configuration ... gen_metrics: model: anthropic system_prompt: gen_metrics # ... more configuration ... gen_sql_summary: model: deepseek_v3 system_prompt: gen_sql_summary # ... more configuration ... -
Verify configuration
Start Datus CLI and check if subagents are available:
Quick Start
See the complete example configuration at conf/agent.yml.example
gen_sql_summary¶
Overview¶
The SQL Summary feature helps you analyze, classify, and catalog SQL queries for knowledge reuse. It automatically generates structured YAML summaries that are stored in a searchable Knowledge Base, making it easy to find and reuse similar queries in the future.
What is a SQL Summary?¶
A SQL summary is a structured YAML document that captures:
- Query Text: The complete SQL query
- Business Context: Domain, categories, and tags
- Semantic Summary: Detailed explanation for vector search
- Metadata: Name, comment, file path
Quick Start¶
Launch the SQL summary generation subagent:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales GROUP BY region. (You can also add some description on this SQL)
Generation Workflow¶
graph LR
A[User provides SQL + description] --> B[Agent analyzes query]
B --> C[Retrieves context]
C --> D[Generates unique ID]
D --> E[Creates YAML]
E --> F[Saves file]
F --> G[User confirms]
G --> H[Syncs to Knowledge Base]
Detailed Steps:
- Understand SQL: The AI analyzes your query structure and business logic
- Get Context: Calls
prepare_sql_summary_context()to retrieve: - Existing taxonomy (domains, categories, tags)
- Similar SQL summaries for classification reference
- Generate Unique ID: Uses
generate_sql_summary_id()based on SQL + comment - Create Unique Name: Generates a descriptive name (max 20 characters)
- Classify Query: Assigns domain, layer1, layer2, and tags following existing patterns
- Generate YAML: Creates structured summary document
- Save File: Writes YAML to workspace using
write_file() - User Confirmation: Shows the generated YAML and prompts for approval
- Sync to Knowledge Base: Stores in LanceDB for semantic search
Interactive Confirmation¶
After generation, you'll see:
==========================================================
Generated Reference SQL YAML
File: /path/to/sql_summary.yml
==========================================================
[YAML content with syntax highlighting]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Configuration¶
Agent Configuration¶
In agent.yml:
agentic_nodes:
gen_sql_summary:
model: deepseek # LLM model for analysis
system_prompt: gen_sql_summary # Prompt template
prompt_version: "1.0"
tools: generation_tools.prepare_sql_summary_context, generation_tools.generate_sql_summary_id, filesystem_tools.write_file
hooks: generation_hooks # Enable confirmation workflow
workspace_root: /path/to/reference_sql # Directory to save YAML files
agent_description: "reference SQL analysis assistant"
Key Configuration Options¶
| Parameter | Description | Example |
|---|---|---|
model |
LLM model for SQL analysis | deepseek, claude, openai, kimi |
workspace_root |
Directory for SQL summary YAML files | /Users/you/.datus/data/reference_sql |
tools |
Required tools for the workflow | See tools section below |
hooks |
Enable interactive confirmation | generation_hooks |
YAML Structure¶
The generated SQL summary follows this structure:
id: "abc123def456..." # Auto-generated MD5 hash
name: "Revenue by Region" # Descriptive name (max 20 chars)
sql: | # Complete SQL query
SELECT
region,
SUM(revenue) as total_revenue
FROM sales
GROUP BY region
comment: "Calculate total revenue grouped by region"
summary: "This query aggregates total revenue from the sales table, grouping results by geographic region. It uses SUM aggregation to calculate revenue totals for each region."
filepath: "/Users/you/.datus/data/reference_sql/revenue_by_region.yml"
domain: "Sales" # Business domain
layer1: "Reporting" # Primary category
layer2: "Revenue Analysis" # Secondary category
tags: "revenue, region, aggregation" # Comma-separated tags
Field Descriptions¶
| Field | Required | Description | Example |
|---|---|---|---|
id |
Yes | Unique hash (auto-generated) | abc123def456... |
name |
Yes | Short descriptive name (max 20 chars) | Revenue by Region |
sql |
Yes | Complete SQL query | SELECT ... |
comment |
Yes | Brief one-line description | User's message or generated summary |
summary |
Yes | Detailed explanation (for search) | Comprehensive query description |
filepath |
Yes | Actual file path | /path/to/file.yml |
domain |
Yes | Business domain | Sales, Marketing, Finance |
layer1 |
Yes | Primary category | Reporting, Analytics, ETL |
layer2 |
Yes | Secondary category | Revenue Analysis, Customer Insights |
tags |
Optional | Comma-separated keywords | revenue, region, aggregation |
gen_semantic_model¶
Overview¶
The semantic model generation feature helps you create MetricFlow semantic models from database tables through an AI-powered assistant. The assistant analyzes your table structure and generates comprehensive YAML configuration files that define metrics, dimensions, and relationships.
What is a Semantic Model?¶
A semantic model is a YAML configuration that defines:
- Measures: Metrics and aggregations (SUM, COUNT, AVERAGE, etc.)
- Dimensions: Categorical and time-based attributes
- Identifiers: Primary and foreign keys for relationships
- Data Source: Connection to your database table
Quick Start¶
Start Datus CLI with datus --namespace <namespace>, and begin with a subagent command:
How It Works¶
Interactive Generation¶
When you request a semantic model, the AI assistant:
- Retrieves your table's DDL (structure)
- Checks if a semantic model already exists
- Generates a comprehensive YAML file
- Validates the configuration using MetricFlow
- Prompts you to save it to the Knowledge Base
Generation Workflow¶
graph LR
A[User Request] --> B[DDL Analysis]
B --> C[YAML Generation]
C --> D[Validation]
D --> E[User Confirmation]
E --> F[Storage]
Interactive Confirmation¶
After generating the semantic model, you'll see:
=============================================================
Generated YAML: table_name.yml
Path: /path/to/file.yml
=============================================================
[YAML content with syntax highlighting]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Options:
- Option 1: Saves the semantic model to your Knowledge Base (RAG storage) for AI-powered queries
- Option 2: Keeps the YAML file only without syncing to the Knowledge Base
Configuration¶
Agent Configuration¶
In agent.yml, configure the semantic model generation node:
agentic_nodes:
gen_semantic_model:
model: claude # LLM model to use
system_prompt: gen_semantic_model
prompt_version: "1.0"
tools: db_tools.*, generation_tools.*, filesystem_tools.*
hooks: generation_hooks # Enables user confirmation workflow
mcp: metricflow_mcp # MetricFlow validation server
workspace_root: /path/to/semantic_models
agent_description: "Semantic model generation assistant"
rules:
- Use get_table_ddl tool to get complete table DDL
- Generate comprehensive semantic models
- Validate using metricflow_mcp
Key Configuration Options¶
| Parameter | Description | Example |
|---|---|---|
model |
LLM model for generation | claude, deepseek (claude is recommended) |
workspace_root |
Directory to save YAML files | /Users/you/.datus/data/semantic_models |
tools |
Available tools for the assistant | db_tools.*, filesystem_tools.* |
hooks |
Enable user confirmation | generation_hooks |
mcp |
MetricFlow validation server | metricflow_mcp |
Semantic Model Structure¶
Basic Template¶
data_source:
name: table_name # Required: lowercase with underscores
description: "Table description"
sql_table: schema.table_name # For databases with schemas
# OR
sql_query: | # For custom queries
SELECT * FROM table_name
measures:
- name: total_amount # Required
agg: SUM # Required: SUM|COUNT|AVERAGE|etc.
expr: amount_column # Column or SQL expression
create_metric: true # Auto-create queryable metric
description: "Total transaction amount"
dimensions:
- name: created_date
type: TIME # Required: TIME|CATEGORICAL
type_params:
is_primary: true # One primary time dimension required
time_granularity: DAY # Required for TIME: DAY|WEEK|MONTH|etc.
- name: status
type: CATEGORICAL
description: "Order status"
identifiers:
- name: order_id
type: PRIMARY # PRIMARY|FOREIGN|UNIQUE|NATURAL
expr: order_id
- name: customer
type: FOREIGN
expr: customer_id
Summary¶
The semantic model generation feature provides:
- ✅ Automated YAML generation from table DDL
- ✅ Interactive validation and error fixing
- ✅ User confirmation before storage
- ✅ Knowledge Base integration
- ✅ Duplicate prevention
- ✅ MetricFlow compatibility
gen_metrics¶
Overview¶
The metrics generation feature helps you convert SQL queries into reusable MetricFlow metric definitions. Using an AI assistant, you can analyze SQL business logic and automatically generate standardized YAML metric configurations that can be queried consistently across your organization.
What is a Metric?¶
A metric is a reusable business calculation built on top of semantic models. Metrics provide:
- Consistent Business Logic: One definition, used everywhere
- Type Safety: Validated structure and measure references
- Metadata: Display names, formats, business context
- Composability: Build complex metrics from simpler ones
Example: Instead of writing SELECT SUM(revenue) / COUNT(DISTINCT customer_id) repeatedly, define an avg_customer_revenue metric once.
Quick Start¶
Start Datus CLI with datus --namespace <namespace>, and use the metrics generation subagent:
/gen_metrics Generate a metric from this SQL: SELECT SUM(amount) FROM transactions, the corresponding question is total amount of all transactions
How It Works¶
Generation Workflow¶
graph LR
A[User provides SQL and question] --> B[Agent analyzes logic]
B --> C[Finds semantic model]
C --> D[Reads measures]
D --> E[Checks for duplicates]
E --> F[Generates metric YAML]
F --> G[Appends to file]
G --> H[Validates]
H --> I[User confirms]
I --> J[Syncs to Knowledge Base]
Important Limitations¶
⚠️ Single Table Queries Only
The current version only supports generating metrics from single-table SQL queries. Multi-table JOINs are not supported.
Supported:
SELECT SUM(revenue) FROM transactions WHERE status = 'completed'
SELECT COUNT(DISTINCT customer_id) / COUNT(*) FROM orders
Not Supported:
Interactive Confirmation¶
After generation, you'll see:
==========================================================
Generated YAML: transactions.yml
Path: /Users/you/.datus/data/semantic_models/transactions.yml
==========================================================
[YAML content with syntax highlighting showing the new metric]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Options: - Option 1: Syncs the metric to your Knowledge Base for AI-powered semantic search - Option 2: Keeps the YAML file only without syncing to the Knowledge Base
Configuration¶
Agent Configuration¶
In agent.yml, configure the metrics generation node:
agentic_nodes:
gen_metrics:
model: claude # LLM model for metric generation
system_prompt: gen_metrics # Prompt template name
prompt_version: "1.0" # Template version
tools: generation_tools.*, filesystem_tools.*
hooks: generation_hooks # Enable user confirmation workflow
mcp: metricflow_mcp # MetricFlow validation server
max_turns: 40 # Max conversation turns
workspace_root: /path/to/semantic_models
agent_description: "Metric definition generation assistant"
rules:
- Analyze user-provided SQL queries to generate MetricFlow metrics
- Use list_allowed_directories to find existing semantic model files
- Use read_file to read semantic model and understand measures
- Use check_metric_exists tool to avoid duplicate generation
- Use edit_file to append metrics (DO NOT use write_file)
- Use mf validate-configs to validate configurations
- After validation, call end_generation to trigger confirmation
Key Configuration Options¶
| Parameter | Description | Example |
|---|---|---|
model |
LLM model for generation | claude, openai, deepseek (claude is recommended) |
workspace_root |
Directory containing semantic model YAML files | /Users/you/.datus/data/semantic_models |
tools |
Available tools (generation + filesystem) | generation_tools.*, filesystem_tools.* |
hooks |
Enable interactive confirmation workflow | generation_hooks |
mcp |
MetricFlow validation server | metricflow_mcp |
max_turns |
Max conversation turns for complex queries | 40 |
Usage Examples¶
Example 1: Simple Aggregation¶
User Input:
Agent Actions:
1. Finds orders.yml semantic model
2. Reads file to discover order_count measure
3. Generates MEASURE_PROXY metric:
---
metric:
name: total_orders
description: Total number of orders
type: measure_proxy
type_params:
measure: order_count
locked_metadata:
display_name: "Total Orders"
increase_is_good: true
Example 2: Conversion Rate¶
User Input:
/gen_metrics Create a metric from this SQL:
SELECT
COUNT(DISTINCT CASE WHEN status = 'completed' THEN order_id END) /
COUNT(DISTINCT order_id) AS completion_rate
FROM orders
Agent Actions:
1. Analyzes SQL logic (ratio with conditional counting)
2. Finds orders.yml and reads measures
3. Generates RATIO metric with constraint:
---
metric:
name: order_completion_rate
description: Percentage of orders that reached completed status
type: ratio
type_params:
numerator:
name: order_count
constraint: status = 'completed'
denominator: order_count
locked_metadata:
display_name: "Order Completion Rate"
value_format: ".2%"
increase_is_good: true
Example 3: Complex Calculation¶
User Input:
/gen_metrics Generate average basket size metric:
SELECT SUM(total_amount) / COUNT(DISTINCT order_id)
FROM order_items
Agent Actions:
1. Locates order_items.yml
2. Identifies this as a RATIO (average)
3. Generates metric:
---
metric:
name: avg_basket_size
description: Average order value (basket size)
type: ratio
type_params:
numerator: total_amount
denominator: order_count
locked_metadata:
display_name: "Average Basket Size"
value_format: "$$,.2f"
unit: "dollars"
increase_is_good: true
How Metrics Are Stored¶
File Organization¶
Metrics are appended to existing semantic model files using the YAML document separator ---:
# Existing semantic model
data_source:
name: transactions
sql_table: transactions
measures:
- name: revenue
agg: SUM
expr: amount
dimensions:
- name: transaction_date
type: TIME
---
# First metric (appended)
metric:
name: total_revenue
type: measure_proxy
type_params:
measure: revenue
---
# Second metric (appended)
metric:
name: avg_transaction_value
type: ratio
type_params:
numerator: revenue
denominator: transaction_count
Why append instead of separate files? - Keeps related metrics close to their semantic model - Easier maintenance and validation - MetricFlow can validate all definitions together
Knowledge Base Storage¶
When you choose "1. Yes - Save to Knowledge Base", the metric is stored in a Vector Database with:
- Metadata: Name, description, type, domain/layer classification
- LLM Text: Natural language representation for semantic search
- References: Associated semantic model name
- Timestamp: Creation date
Summary¶
The metrics generation feature provides:
- ✅ SQL-to-Metric Conversion: Analyze SQL queries and generate MetricFlow metrics
- ✅ Intelligent Type Detection: Automatically selects the right metric type
- ✅ Duplicate Prevention: Checks for existing metrics before generation
- ✅ Validation: MetricFlow validation ensures correctness
- ✅ Interactive Workflow: Review and approve before syncing
- ✅ Knowledge Base Integration: Semantic search for metric discovery
- ✅ File Management: Appends to existing semantic model files safely
Summary¶
| Subagent | Purpose | Output | Stored In | Highlights |
|---|---|---|---|---|
gen_sql_summary |
Summarize and classify SQL queries | YAML (SQL summary) | /data/reference_sql |
Taxonomy-based classification |
gen_semantic_model |
Generate semantic model from tables | YAML (semantic model) | /data/semantic_models |
DDL → MetricFlow compatible model |
gen_metrics |
Generate metrics from SQL | YAML (metric) | /data/semantic_models |
SQL → MetricFlow metric |
Together, these subagents automate the data engineering knowledge pipeline — from query understanding → model definition → metric generation → searchable Knowledge Base.