Builtin Subagent¶
Overview¶
The Builtin Subagent are specialized AI assistants integrated within the Datus Agent system. Each subagent focuses on a specific aspect of data engineering automation — analyzing SQL, generating semantic models, and converting queries into reusable metrics — together forming a closed-loop workflow from raw SQL to knowledge-aware data products.
This document covers three core subagents:
- gen_sql_summary — Summarizes and classifies SQL queries
- gen_semantic_model — Generates MetricFlow semantic models
- gen_metrics — Generates MetricFlow metric definitions
Configuration¶
Builtin subagents work out of the box with minimal configuration. Most settings (tools, hooks, MCP servers, system prompts) are built-in. You can optionally customize them in your agent.yml file:
agent:
agentic_nodes:
gen_semantic_model:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_metrics:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_sql_summary:
model: deepseek # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
Optional configuration parameters:
model: The AI model to use (e.g.,claude,deepseek). Defaults to your configured model.max_turns: Maximum conversation turns (default: 30)
Built-in configurations (no setup needed):
- Tools: Automatically configured based on subagent type
- Hooks: User confirmation workflow in interactive mode
- MCP Servers: MetricFlow validation (for gen_semantic_model and gen_metrics)
- System Prompts: Built-in templates version 1.0
- Workspace: ~/.datus/data/{namespace}/ with subagent-specific subdirectories
gen_sql_summary¶
Overview¶
The SQL Summary feature helps you analyze, classify, and catalog SQL queries for knowledge reuse. It automatically generates structured YAML summaries that are stored in a searchable Knowledge Base, making it easy to find and reuse similar queries in the future.
What is a SQL Summary?¶
A SQL summary is a structured YAML document that captures:
- Query Text: The complete SQL query
- Business Context: Domain, categories, and tags
- Semantic Summary: Detailed explanation for vector search
- Metadata: Name, comment, file path
Quick Start¶
Launch the SQL summary generation subagent:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales GROUP BY region. (You can also add some description on this SQL)
Generation Workflow¶
graph LR
A[User provides SQL + description] --> B[Agent analyzes query]
B --> C[Retrieves context]
C --> D[Generates unique ID]
D --> E[Creates YAML]
E --> F[Saves file]
F --> G[User confirms]
G --> H[Syncs to Knowledge Base]
Detailed Steps:
- Understand SQL: The AI analyzes your query structure and business logic
- Get Context: Automatically retrieves from Knowledge Base:
- Existing subject trees (domain/layer1/layer2 combinations)
- Similar SQL summaries (top 5 most similar queries) for classification reference
- Generate Unique ID: Uses
generate_sql_summary_id()tool based on SQL + comment - Create Unique Name: Generates a descriptive name (max 20 characters)
- Classify Query: Assigns domain, layer1, layer2, and tags following existing patterns
- Generate YAML: Creates structured summary document
- Save File: Writes YAML to workspace using
write_file()tool - User Confirmation: Shows the generated YAML and prompts for approval
- Sync to Knowledge Base: Stores in LanceDB for semantic search
Interactive Confirmation¶
After generation, you'll see:
==========================================================
Generated Reference SQL YAML
File: /path/to/sql_summary.yml
==========================================================
[YAML content with syntax highlighting]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Subject Tree Categorization¶
Subject tree allows organizing SQL summaries by domain and layers. In CLI mode, include it in your question:
Example with subject_tree:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales, subject_tree: sales/reporting/revenue_analysis
Example without subject_tree:
When not provided, the agent suggests categories based on existing subject trees and similar queries in the Knowledge Base.
YAML Structure¶
The generated SQL summary follows this structure:
id: "abc123def456..." # Auto-generated MD5 hash
name: "Revenue by Region" # Descriptive name (max 20 chars)
sql: | # Complete SQL query
SELECT
region,
SUM(revenue) as total_revenue
FROM sales
GROUP BY region
comment: "Calculate total revenue grouped by region"
summary: "This query aggregates total revenue from the sales table, grouping results by geographic region. It uses SUM aggregation to calculate revenue totals for each region."
filepath: "/Users/you/.datus/data/reference_sql/revenue_by_region.yml"
domain: "Sales" # Business domain
layer1: "Reporting" # Primary category
layer2: "Revenue Analysis" # Secondary category
tags: "revenue, region, aggregation" # Comma-separated tags
Field Descriptions¶
| Field | Required | Description | Example |
|---|---|---|---|
id |
Yes | Unique hash (auto-generated) | abc123def456... |
name |
Yes | Short descriptive name (max 20 chars) | Revenue by Region |
sql |
Yes | Complete SQL query | SELECT ... |
comment |
Yes | Brief one-line description | User's message or generated summary |
summary |
Yes | Detailed explanation (for search) | Comprehensive query description |
filepath |
Yes | Actual file path | /path/to/file.yml |
domain |
Yes | Business domain | Sales, Marketing, Finance |
layer1 |
Yes | Primary category | Reporting, Analytics, ETL |
layer2 |
Yes | Secondary category | Revenue Analysis, Customer Insights |
tags |
Optional | Comma-separated keywords | revenue, region, aggregation |
gen_semantic_model¶
Overview¶
The semantic model generation feature helps you create MetricFlow semantic models from database tables through an AI-powered assistant. The assistant analyzes your table structure and generates comprehensive YAML configuration files that define metrics, dimensions, and relationships.
What is a Semantic Model?¶
A semantic model is a YAML configuration that defines:
- Measures: Metrics and aggregations (SUM, COUNT, AVERAGE, etc.)
- Dimensions: Categorical and time-based attributes
- Identifiers: Primary and foreign keys for relationships
- Data Source: Connection to your database table
Quick Start¶
Start Datus CLI with datus --namespace <namespace>, and begin with a subagent command:
How It Works¶
Interactive Generation¶
When you request a semantic model, the AI assistant:
- Retrieves your table's DDL (structure)
- Checks if a semantic model already exists
- Generates a comprehensive YAML file
- Validates the configuration using MetricFlow
- Prompts you to save it to the Knowledge Base
Generation Workflow¶
graph LR
A[User Request] --> B[DDL Analysis]
B --> C[YAML Generation]
C --> D[Validation]
D --> E[User Confirmation]
E --> F[Storage]
Interactive Confirmation¶
After generating the semantic model, you'll see:
=============================================================
Generated YAML: table_name.yml
Path: /path/to/file.yml
=============================================================
[YAML content with syntax highlighting]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Options:
- Option 1: Saves the semantic model to your Knowledge Base (RAG storage) for AI-powered queries
- Option 2: Keeps the YAML file only without syncing to the Knowledge Base
Semantic Model Structure¶
Basic Template¶
data_source:
name: table_name # Required: lowercase with underscores
description: "Table description"
sql_table: schema.table_name # For databases with schemas
# OR
sql_query: | # For custom queries
SELECT * FROM table_name
measures:
- name: total_amount # Required
agg: SUM # Required: SUM|COUNT|AVERAGE|etc.
expr: amount_column # Column or SQL expression
create_metric: true # Auto-create queryable metric
description: "Total transaction amount"
dimensions:
- name: created_date
type: TIME # Required: TIME|CATEGORICAL
type_params:
is_primary: true # One primary time dimension required
time_granularity: DAY # Required for TIME: DAY|WEEK|MONTH|etc.
- name: status
type: CATEGORICAL
description: "Order status"
identifiers:
- name: order_id
type: PRIMARY # PRIMARY|FOREIGN|UNIQUE|NATURAL
expr: order_id
- name: customer
type: FOREIGN
expr: customer_id
Summary¶
The semantic model generation feature provides:
- ✅ Automated YAML generation from table DDL
- ✅ Built-in tools, hooks, and MCP server integration
- ✅ Interactive validation and error fixing
- ✅ User confirmation before storage
- ✅ Knowledge Base integration
- ✅ Duplicate prevention
- ✅ MetricFlow compatibility
gen_metrics¶
Overview¶
The metrics generation feature helps you convert SQL queries into reusable MetricFlow metric definitions. Using an AI assistant, you can analyze SQL business logic and automatically generate standardized YAML metric configurations that can be queried consistently across your organization.
What is a Metric?¶
A metric is a reusable business calculation built on top of semantic models. Metrics provide:
- Consistent Business Logic: One definition, used everywhere
- Type Safety: Validated structure and measure references
- Metadata: Display names, formats, business context
- Composability: Build complex metrics from simpler ones
Example: Instead of writing SELECT SUM(revenue) / COUNT(DISTINCT customer_id) repeatedly, define an avg_customer_revenue metric once.
Quick Start¶
Start Datus CLI with datus --namespace <namespace>, and use the metrics generation subagent:
/gen_metrics Generate a metric from this SQL: SELECT SUM(amount) FROM transactions, the corresponding question is total amount of all transactions
How It Works¶
Generation Workflow¶
graph LR
A[User provides SQL and question] --> B[Agent analyzes logic]
B --> C[Finds semantic model]
C --> D[Reads measures]
D --> E[Checks for duplicates]
E --> F[Generates metric YAML]
F --> G[Appends to file]
G --> H[Validates]
H --> I[User confirms]
I --> J[Syncs to Knowledge Base]
Important Limitations¶
⚠️ Single Table Queries Only
The current version only supports generating metrics from single-table SQL queries. Multi-table JOINs are not supported.
Supported:
SELECT SUM(revenue) FROM transactions WHERE status = 'completed'
SELECT COUNT(DISTINCT customer_id) / COUNT(*) FROM orders
Not Supported:
Interactive Confirmation¶
After generation, you'll see:
==========================================================
Generated YAML: transactions.yml
Path: /Users/you/.datus/data/semantic_models/transactions.yml
==========================================================
[YAML content with syntax highlighting showing the new metric]
SYNC TO KNOWLEDGE BASE?
1. Yes - Save to Knowledge Base
2. No - Keep file only
Please enter your choice: [1/2]
Options: - Option 1: Syncs the metric to your Knowledge Base for AI-powered semantic search - Option 2: Keeps the YAML file only without syncing to the Knowledge Base
Subject Tree Categorization¶
Subject tree allows organizing metrics by domain and layers. In CLI mode, include it in your question:
Example with subject_tree:
/gen_metrics Generate a metric from this SQL: SELECT SUM(amount) FROM transactions, subject_tree: finance/revenue/transactions
Example without subject_tree:
When not provided, the agent operates in learning mode and suggests categories based on existing metrics in the Knowledge Base.
Usage Examples¶
Example 1: Simple Aggregation¶
User Input:
Agent Actions:
1. Finds orders.yml semantic model
2. Reads file to discover order_count measure
3. Generates MEASURE_PROXY metric:
---
metric:
name: total_orders
description: Total number of orders
type: measure_proxy
type_params:
measure: order_count
locked_metadata:
display_name: "Total Orders"
increase_is_good: true
Example 2: Conversion Rate¶
User Input:
/gen_metrics Create a metric from this SQL:
SELECT
COUNT(DISTINCT CASE WHEN status = 'completed' THEN order_id END) /
COUNT(DISTINCT order_id) AS completion_rate
FROM orders
Agent Actions:
1. Analyzes SQL logic (ratio with conditional counting)
2. Finds orders.yml and reads measures
3. Generates RATIO metric with constraint:
---
metric:
name: order_completion_rate
description: Percentage of orders that reached completed status
type: ratio
type_params:
numerator:
name: order_count
constraint: status = 'completed'
denominator: order_count
locked_metadata:
display_name: "Order Completion Rate"
value_format: ".2%"
increase_is_good: true
Example 3: Complex Calculation¶
User Input:
/gen_metrics Generate average basket size metric:
SELECT SUM(total_amount) / COUNT(DISTINCT order_id)
FROM order_items
Agent Actions:
1. Locates order_items.yml
2. Identifies this as a RATIO (average)
3. Generates metric:
---
metric:
name: avg_basket_size
description: Average order value (basket size)
type: ratio
type_params:
numerator: total_amount
denominator: order_count
locked_metadata:
display_name: "Average Basket Size"
value_format: "$$,.2f"
unit: "dollars"
increase_is_good: true
How Metrics Are Stored¶
File Organization¶
Metrics are appended to existing semantic model files using the YAML document separator ---:
# Existing semantic model
data_source:
name: transactions
sql_table: transactions
measures:
- name: revenue
agg: SUM
expr: amount
dimensions:
- name: transaction_date
type: TIME
---
# First metric (appended)
metric:
name: total_revenue
type: measure_proxy
type_params:
measure: revenue
---
# Second metric (appended)
metric:
name: avg_transaction_value
type: ratio
type_params:
numerator: revenue
denominator: transaction_count
Why append instead of separate files? - Keeps related metrics close to their semantic model - Easier maintenance and validation - MetricFlow can validate all definitions together
Knowledge Base Storage¶
When you choose "1. Yes - Save to Knowledge Base", the metric is stored in a Vector Database with:
- Metadata: Name, description, type, domain/layer classification
- LLM Text: Natural language representation for semantic search
- References: Associated semantic model name
- Timestamp: Creation date
Summary¶
The metrics generation feature provides:
- ✅ SQL-to-Metric Conversion: Analyze SQL queries and generate MetricFlow metrics
- ✅ Intelligent Type Detection: Automatically selects the right metric type
- ✅ Duplicate Prevention: Checks for existing metrics before generation
- ✅ Subject Tree Support: Organize by domain/layer1/layer2 with predefined or learned categories
- ✅ Validation: MetricFlow validation ensures correctness
- ✅ Interactive Workflow: Review and approve before syncing
- ✅ Knowledge Base Integration: Semantic search for metric discovery
- ✅ File Management: Appends to existing semantic model files safely
Summary¶
| Subagent | Purpose | Output | Stored In | Key Features |
|---|---|---|---|---|
gen_sql_summary |
Summarize and classify SQL queries | YAML (SQL summary) | /data/reference_sql |
Subject tree categorization, auto context retrieval |
gen_semantic_model |
Generate semantic model from tables | YAML (semantic model) | /data/semantic_models |
DDL → MetricFlow model, built-in validation |
gen_metrics |
Generate metrics from SQL | YAML (metric) | /data/semantic_models |
SQL → MetricFlow metric, subject tree support |
Built-in Features Across All Subagents:
- Minimal configuration required (only model and max_turns optional)
- Automatic tool setup, hooks, and MCP server integration
- Built-in system prompts (version 1.0)
- User confirmation workflow in interactive mode
- Knowledge Base integration for semantic search
- Automatic workspace management
Together, these subagents automate the data engineering knowledge pipeline — from query understanding → model definition → metric generation → searchable Knowledge Base.