Builtin Subagent¶
Overview¶
The Builtin Subagent are specialized AI assistants integrated within the Datus Agent system. Each subagent focuses on a specific aspect of data engineering automation — analyzing SQL, generating semantic models, and converting queries into reusable metrics — together forming a closed-loop workflow from raw SQL to knowledge-aware data products.
This document covers thirteen core subagents:
- gen_sql_summary — Summarizes and classifies SQL queries
- gen_semantic_model — Generates MetricFlow semantic models
- gen_metrics — Generates MetricFlow metric definitions
- gen_ext_knowledge — Generates business concept definitions
- explore — Read-only data exploration and context gathering
- gen_sql — Specialized SQL generation with deep expertise
- gen_report — Flexible report generation with configurable tools
- gen_table — Database table creation via CTAS or natural language
- gen_job — Data pipeline execution (single-database ETL AND cross-database migration with reconciliation)
- gen_skill — Skill creation and optimization
- gen_dashboard — BI dashboard CRUD for Superset and Grafana
- scheduler — Airflow job lifecycle management
Configuration¶
Builtin subagents work out of the box with minimal configuration. Most settings (tools, hooks, MCP servers, system prompts) are built-in. You can optionally customize them in your agent.yml file:
agent:
agentic_nodes:
gen_semantic_model:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_metrics:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_sql_summary:
model: deepseek # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_ext_knowledge:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
explore:
model: haiku # Recommended: smaller model for tool-calling tasks
max_turns: 15 # Optional: defaults to 15
gen_sql:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
gen_report:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
tools: "semantic_tools.*, context_search_tools.list_subject_tree" # Optional: defaults to semantic + context tools
gen_table:
max_turns: 20 # Optional: defaults to 20
gen_job:
max_turns: 30 # Optional: defaults to 30
gen_skill:
max_turns: 30 # Optional: defaults to 30
gen_dashboard:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
bi_platform: superset # Optional: explicit platform (auto-detected when only one BI platform is configured)
scheduler:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
Optional configuration parameters:
model: The AI model to use (e.g.,claude,deepseek). Defaults to your configured model.max_turns: Maximum conversation turns (default: 30)
Built-in configurations (no setup needed):
- Tools: Automatically configured based on subagent type
- Hooks: Workflow-specific validation and Knowledge Base sync
- MCP Servers: MetricFlow validation (for gen_semantic_model and gen_metrics)
- System Prompts: Built-in templates; latest versions are used unless prompt_version is set
- Workspace: ~/.datus/data/{datasource}/ with subagent-specific subdirectories
gen_sql_summary¶
Overview¶
The SQL Summary feature helps you analyze, classify, and catalog SQL queries for knowledge reuse. It automatically generates structured YAML summaries that are stored in a searchable Knowledge Base, making it easy to find and reuse similar queries in the future.
What is a SQL Summary?¶
A SQL summary is a structured YAML document that captures:
- Query Text: The complete SQL query
- Business Context: Domain, categories, and tags
- Semantic Summary: Detailed explanation for vector search
- Metadata: Name, comment, file path
Quick Start¶
Launch the SQL summary generation subagent:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales GROUP BY region. (You can also add some description on this SQL)
Generation Workflow¶
graph LR
A[User provides SQL + description] --> B[Agent analyzes query]
B --> C[Retrieves context]
C --> D[Generates unique ID]
D --> E[Creates YAML]
E --> F[Saves file]
F --> G[Syncs to Knowledge Base]
Detailed Steps:
- Understand SQL: The AI analyzes your query structure and business logic
- Get Context: Automatically retrieves from Knowledge Base:
- Existing subject trees (domain/layer1/layer2 combinations)
- Similar SQL summaries (top 5 most similar queries) for classification reference
- Generate Unique ID: Uses
generate_sql_summary_id()tool based on SQL + comment - Create Unique Name: Generates a descriptive name (max 20 characters)
- Classify Query: Assigns domain, layer1, layer2, and tags following existing patterns
- Generate YAML: Creates structured summary document
- Save File: Writes YAML to workspace using
write_file()tool - Sync to Knowledge Base: Stores in LanceDB for semantic search
Sync Behavior¶
In interactive mode, after the YAML file is written successfully, the generation hook syncs it to the Knowledge Base automatically. In workflow/API mode, use the corresponding explicit sync step or tool.
Subject Tree Categorization¶
Subject tree allows organizing SQL summaries by domain and layers. In CLI mode, include it in your question:
Example with subject_tree:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales, subject_tree: sales/reporting/revenue_analysis
Example without subject_tree:
When not provided, the agent suggests categories based on existing subject trees and similar queries in the Knowledge Base.
YAML Structure¶
The generated SQL summary follows this structure:
id: "abc123def456..." # Auto-generated MD5 hash
name: "Revenue by Region" # Descriptive name (max 20 chars)
sql: | # Complete SQL query
SELECT
region,
SUM(revenue) as total_revenue
FROM sales
GROUP BY region
comment: "Calculate total revenue grouped by region"
summary: "This query aggregates total revenue from the sales table, grouping results by geographic region. It uses SUM aggregation to calculate revenue totals for each region."
filepath: "/Users/you/.datus/data/reference_sql/revenue_by_region.yml"
domain: "Sales" # Business domain
layer1: "Reporting" # Primary category
layer2: "Revenue Analysis" # Secondary category
tags: "revenue, region, aggregation" # Comma-separated tags
Field Descriptions¶
| Field | Required | Description | Example |
|---|---|---|---|
id |
Yes | Unique hash (auto-generated) | abc123def456... |
name |
Yes | Short descriptive name (max 20 chars) | Revenue by Region |
sql |
Yes | Complete SQL query | SELECT ... |
comment |
Yes | Brief one-line description | User's message or generated summary |
summary |
Yes | Detailed explanation (for search) | Comprehensive query description |
filepath |
Yes | Actual file path | /path/to/file.yml |
domain |
Yes | Business domain | Sales, Marketing, Finance |
layer1 |
Yes | Primary category | Reporting, Analytics, ETL |
layer2 |
Yes | Secondary category | Revenue Analysis, Customer Insights |
tags |
Optional | Comma-separated keywords | revenue, region, aggregation |
gen_semantic_model¶
Overview¶
The semantic model generation feature helps you create MetricFlow semantic models from database tables through an AI-powered assistant. The assistant analyzes your table structure and generates comprehensive YAML configuration files that define metrics, dimensions, and relationships.
What is a Semantic Model?¶
A semantic model is a YAML configuration that defines:
- Measures: Metrics and aggregations (SUM, COUNT, AVERAGE, etc.)
- Dimensions: Categorical and time-based attributes
- Identifiers: Primary and foreign keys for relationships
- Data Source: Connection to your database table
Quick Start¶
Start Datus CLI with datus --datasource <datasource>, and begin with a subagent command:
How It Works¶
Interactive Generation¶
When you request a semantic model, the AI assistant:
- Retrieves your table's DDL (structure)
- Checks if a semantic model already exists
- Generates a comprehensive YAML file
- Validates the configuration using MetricFlow
- Syncs it to the Knowledge Base after validation passes
Generation Workflow¶
graph LR
A[User Request] --> B[DDL Analysis]
B --> C[YAML Generation]
C --> D[Validation]
D --> E[Knowledge Base Sync]
Validation and Sync¶
The agent calls validate_semantic() before publishing. If validation fails, it edits the YAML and retries. In interactive mode, once validation passes, end_semantic_model_generation triggers automatic Knowledge Base sync; in workflow/API mode, use the explicit semantic sync step or tool.
Semantic Model Structure¶
Basic Template¶
data_source:
name: table_name # Required: lowercase with underscores
description: "Table description"
sql_table: schema.table_name # For databases with schemas
# OR
sql_query: | # For custom queries
SELECT * FROM table_name
measures:
- name: total_amount # Required
agg: SUM # Required: SUM|COUNT|AVERAGE|etc.
expr: amount_column # Column or SQL expression
create_metric: true # Auto-create queryable metric
description: "Total transaction amount"
dimensions:
- name: created_date
type: TIME # Required: TIME|CATEGORICAL
type_params:
is_primary: true # One primary time dimension required
time_granularity: DAY # Required for TIME: DAY|WEEK|MONTH|etc.
- name: status
type: CATEGORICAL
description: "Order status"
identifiers:
- name: order_id
type: PRIMARY # PRIMARY|FOREIGN|UNIQUE|NATURAL
expr: order_id
- name: customer
type: FOREIGN
expr: customer_id
Summary¶
The semantic model generation feature provides:
- ✅ Automated YAML generation from table DDL
- ✅ Built-in tools, hooks, and MCP server integration
- ✅ Interactive validation and error fixing
- ✅ Validation-gated Knowledge Base sync
- ✅ Interactive auto-sync and workflow/API explicit sync
- ✅ Duplicate prevention
- ✅ MetricFlow compatibility
gen_metrics¶
Overview¶
The metrics generation feature helps you convert SQL queries into reusable MetricFlow metric definitions. Using an AI assistant, you can analyze SQL business logic and automatically generate standardized YAML metric configurations that can be queried consistently across your organization.
What is a Metric?¶
A metric is a reusable business calculation built on top of semantic models. Metrics provide:
- Consistent Business Logic: One definition, used everywhere
- Type Safety: Validated structure and measure references
- Metadata: Display names, formats, business context
- Composability: Build complex metrics from simpler ones
Example: Instead of writing SELECT SUM(revenue) / COUNT(DISTINCT customer_id) repeatedly, define an avg_customer_revenue metric once.
Quick Start¶
Start Datus CLI with datus --datasource <datasource>, and use the metrics generation subagent:
/gen_metrics Generate a metric from this SQL: SELECT SUM(amount) FROM transactions, the corresponding question is total amount of all transactions
How It Works¶
Generation Workflow¶
graph LR
A[User provides SQL and question] --> B[Agent analyzes logic]
B --> C[Finds semantic model]
C --> D[Reads measures]
D --> E[Checks for duplicates]
E --> F[Generates metric YAML]
F --> G[Writes metric to metrics file]
G --> H[Validates]
H --> I[Dry-run SQL]
I --> J[Syncs to Knowledge Base]
Important Limitations¶
⚠️ Single Table Queries Only
The current version only supports generating metrics from single-table SQL queries. Multi-table JOINs are not supported.
Supported:
SELECT SUM(revenue) FROM transactions WHERE status = 'completed'
SELECT COUNT(DISTINCT customer_id) / COUNT(*) FROM orders
Not Supported:
Validation and Sync¶
Before publishing, the agent validates the YAML with validate_semantic() and compiles SQL with query_metrics(..., dry_run=True). In interactive mode, after both checks pass, end_metric_generation triggers automatic Knowledge Base sync; in workflow/API mode, use the explicit metric sync step or tool.
Subject Tree Categorization¶
Subject tree allows organizing metrics by domain and layers. In CLI mode, include it in your question:
Example with subject_tree:
/gen_metrics Generate a metric from this SQL: SELECT SUM(amount) FROM transactions, subject_tree: finance/revenue/transactions
Example without subject_tree:
When not provided, the agent operates in learning mode and suggests categories based on existing metrics in the Knowledge Base.
Usage Examples¶
Example 1: Simple Aggregation¶
User Input:
Agent Actions:
1. Finds orders.yml semantic model
2. Reads file to discover order_count measure
3. Generates MEASURE_PROXY metric:
---
metric:
name: total_orders
description: Total number of orders
type: measure_proxy
type_params:
measure: order_count
locked_metadata:
display_name: "Total Orders"
increase_is_good: true
Example 2: Conversion Rate¶
User Input:
/gen_metrics Create a metric from this SQL:
SELECT
COUNT(DISTINCT CASE WHEN status = 'completed' THEN order_id END) /
COUNT(DISTINCT order_id) AS completion_rate
FROM orders
Agent Actions:
1. Analyzes SQL logic (ratio with conditional counting)
2. Finds orders.yml and reads measures
3. Generates RATIO metric with constraint:
---
metric:
name: order_completion_rate
description: Percentage of orders that reached completed status
type: ratio
type_params:
numerator:
name: order_count
constraint: status = 'completed'
denominator: order_count
locked_metadata:
display_name: "Order Completion Rate"
value_format: ".2%"
increase_is_good: true
Example 3: Complex Calculation¶
User Input:
/gen_metrics Generate average basket size metric:
SELECT SUM(total_amount) / COUNT(DISTINCT order_id)
FROM order_items
Agent Actions:
1. Locates order_items.yml
2. Identifies this as a RATIO (average)
3. Generates metric:
---
metric:
name: avg_basket_size
description: Average order value (basket size)
type: ratio
type_params:
numerator: total_amount
denominator: order_count
locked_metadata:
display_name: "Average Basket Size"
value_format: "$$,.2f"
unit: "dollars"
increase_is_good: true
How Metrics Are Stored¶
File Organization¶
Metrics are stored in dedicated files, separate from their semantic models:
- Semantic Model:
{table_name}.yml—data_sourcedefinition (measures, dimensions, identifiers) - Metrics:
metrics/{table_name}_metrics.yml— one or more metric definitions, separated by the YAML document separator---
Semantic Model File (transactions.yml):
data_source:
name: transactions
sql_table: transactions
measures:
- name: revenue
agg: SUM
expr: amount
dimensions:
- name: transaction_date
type: TIME
Metrics File (metrics/transactions_metrics.yml):
metric:
name: total_revenue
type: measure_proxy
type_params:
measure: revenue
---
metric:
name: avg_transaction_value
type: ratio
type_params:
numerator: revenue
denominator: transaction_count
Why separate files? - Clear separation between schema definitions and business metrics - Metrics can be maintained independently from the underlying semantic model - MetricFlow validates all YAML documents under the semantic model directory together
See gen_metrics for full details.
Knowledge Base Storage¶
After validation and dry-run SQL succeed, the metric is synced to the Knowledge Base with:
- Metadata: Name, description, type, domain/layer classification
- LLM Text: Natural language representation for semantic search
- References: Associated semantic model name
- Timestamp: Creation date
Summary¶
The metrics generation feature provides:
- ✅ SQL-to-Metric Conversion: Analyze SQL queries and generate MetricFlow metrics
- ✅ Intelligent Type Detection: Automatically selects the right metric type
- ✅ Duplicate Prevention: Checks for existing metrics before generation
- ✅ Subject Tree Support: Organize by domain/layer1/layer2 with predefined or learned categories
- ✅ Validation: MetricFlow validation ensures correctness
- ✅ Publish Gate: Syncs only after semantic validation and dry-run SQL succeed
- ✅ Knowledge Base Integration: Semantic search for metric discovery
- ✅ File Management: Maintains dedicated per-table metrics files under
metrics/
gen_ext_knowledge¶
Overview¶
The external knowledge generation feature helps you create and manage business concepts and domain-specific definitions. Using an AI assistant, you can document business knowledge in a structured format that becomes searchable in the Knowledge Base, enabling better context retrieval for SQL generation and data analysis tasks.
What is External Knowledge?¶
External knowledge captures business-specific information that isn't directly stored in database schemas:
- Business Rules: Calculation logic and business constraints
- Domain Concepts: Industry or company-specific knowledge
- Data Interpretations: How to understand specific data fields or values
This knowledge helps the AI agent understand your business context when generating SQL queries or analyzing data.
Quick Start¶
Launch the external knowledge generation subagent:
/gen_ext_knowledge Extract knowledge from this sql
-- Question: What is the highest eligible free rate for K-12 students in the schools in Alameda County?
-- SQL:
SELECT
`Free Meal Count (K-12)` / `Enrollment (K-12)`
FROM
frpm
WHERE
`County Name` = 'Alameda'
ORDER BY
(
CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`
) DESC
LIMIT 1
Generation Workflow¶
The workflow follows a knowledge gap discovery approach: the agent first attempts to solve the problem independently, then compares with the reference SQL to identify implicit business knowledge.
graph LR
A[User provides Question + SQL] --> B[Agent attempts to solve]
B --> C[Compare with reference SQL]
C --> D[Identify knowledge gaps]
D --> E[Check for duplicates]
E --> F[Generate YAML]
F --> G[Save file]
G --> H[Sync to Knowledge Base]
Detailed Steps:
- Understand the Problem: Read the question from SQL comments and understand the goal
- Attempt to Solve: The agent uses available tools to try solving the problem
- Compare with Reference SQL: Find gaps between attempt and reference sql
- Extract Knowledge from Gaps: Discovering hidden business concepts in gaps
- Check for Duplicates: Use
search_knowledgeto verify extracted knowledge doesn't already exist - Generate YAML: Create structured knowledge entries with unique IDs via
generate_ext_knowledge_id() - Save File: Write YAML using
write_file(path, content, file_type="ext_knowledge") - Sync to Knowledge Base: Store in vector database for semantic search
Important: If no knowledge gaps are found (agent's attempt matches reference SQL), no knowledge file is generated.
Sync Behavior¶
In interactive mode, after the YAML file is written successfully, the generation hook syncs it to the Knowledge Base automatically. In workflow/API mode, use the explicit sync step or tool.
Subject Path Categorization¶
Subject path allows organizing external knowledge hierarchically. In CLI mode, include it in your question:
Example with subject_path:
/gen_ext_knowledge Extract knowledge from this sql
Question: What is the highest eligible free rate for K-12 students in the schools in Alameda County?
subject_tree: education/schools/data_integration
SQL: ***
Example without subject_path:
/gen_ext_knowledge Extract knowledge from this sql
Question: What is the highest eligible free rate for K-12 students in the schools in Alameda County?
SQL: ***
When not provided, the agent operates in learning mode and suggests categories based on existing subject trees in the Knowledge Base.
YAML Structure¶
The generated external knowledge follows this structure:
id: education/schools/data_integration/CDS code school identifier California Department of Education join tables
name: CDS Code as School Identifier
search_text: CDS code school identifier California Department of Education join tables
explanation: |
The CDS (California Department of Education) code serves as the primary identifier for linking educational datasets in California. Use `cds` field in SAT scores table to join
subject_path: education/schools/data_integration
Field Descriptions¶
| Field | Required | Description | Example |
|---|---|---|---|
id |
Yes | Unique ID generate_ext_knowledge_id()) |
education/schools/data_integration/CDS code school identifier California Department of Education join tables |
name |
Yes | Short identifier name (max 30 chars) | Free Meal Rate, GMV |
search_text |
Yes | Search keywords for retrieval (vector/inverted index) | eligible free rate K-12 |
explanation |
Yes | Concise explanation (2-4 sentences): what it is + when/how to apply | Business rule, calculation logic |
subject_path |
Yes | Hierarchical classification (slash-separated) | Education/School Metrics/FRPM |
explore¶
Overview¶
The explore subagent is a lightweight, read-only assistant designed for fast context gathering. It helps collect schema information, data samples, and knowledge base context to support downstream SQL generation — either by the chat agent or the gen_sql subagent.
Key Features¶
- Strictly read-only: Never modifies data, files, or database records. Only SELECT queries are allowed, always with LIMIT.
- Fast exploration: Limited to 15 conversation turns for quick context gathering.
- Three exploration directions:
- Schema+Sample: Discover tables, columns, types, constraints, and sample data
- Knowledge: Search metrics, reference SQL, business rules, and domain knowledge
- File: Browse workspace SQL files and documentation
Configuration¶
agent:
agentic_nodes:
explore:
model: haiku # Recommended: use a smaller model (haiku, gpt-4o-mini)
max_turns: 15 # Optional: defaults to 15
Tip: The explore subagent is optimized for tool-calling rather than reasoning. Using a smaller, faster model (e.g.,
haiku,gpt-4o-mini) is recommended to reduce cost and improve speed without sacrificing quality.
Available Tools¶
| Tool Category | Tools | Purpose |
|---|---|---|
| Database | list_databases, list_schemas, list_tables, search_table, describe_table, get_table_ddl, read_query |
Schema discovery and data sampling (read-only) |
| Context Search | search_metrics, search_reference_sql, search_knowledge, search_semantic_objects, list_subject_tree, get_metrics, get_reference_sql, get_knowledge |
Knowledge base retrieval |
| Filesystem | read_file, glob, grep |
Read-only file browsing |
| Date Parsing | get_current_date, parse_temporal_expressions |
Date context |
Output Format¶
The explore subagent returns a concise, structured summary optimized for consumption by other agents:
- Tables: Relevant tables with key columns, types, and notes
- Joins: Join paths between tables (one line per relationship)
- Data patterns: Non-obvious data patterns (delimiters, encodings, NULL prevalence)
- Context: Relevant metrics, reference SQL, and business rules found
- Recommendation: Which tables to use, how to join, and key caveats
Usage¶
The explore subagent is typically invoked automatically by the chat agent via task(type="explore"), but can also be launched manually:
gen_sql¶
Overview¶
The gen_sql subagent is a specialized SQL expert that generates optimized, validated SQL queries. It handles complex SQL generation tasks that require multi-step reasoning, intricate joins, or domain-specific logic.
Key Features¶
- Deep SQL expertise: Specialized in writing complex, production-quality SQL
- Automatic validation: Validates SQL executability before returning results
- File-based output: For complex queries (50+ lines), outputs SQL to a file with a preview
- Modification support: Returns unified diff format when modifying existing queries
Configuration¶
agent:
agentic_nodes:
gen_sql:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
How It Works¶
graph LR
A[User question + context] --> B[Analyze requirements]
B --> C[Discover schema]
C --> D[Search knowledge base]
D --> E[Generate SQL]
E --> F[Validate executability]
F --> G[Return result]
Output Format¶
The gen_sql subagent returns results in one of two formats:
Inline SQL (for shorter queries):
{
"sql": "SELECT region, SUM(revenue) FROM sales GROUP BY region",
"response": "Explanation of the query...",
"tokens_used": 1234
}
File-based SQL (for complex queries 50+ lines):
{
"sql_file_path": "/path/to/generated_query.sql",
"sql_preview": "First few lines of the query...",
"response": "Explanation of the query...",
"tokens_used": 5678
}
Usage¶
The gen_sql subagent is typically invoked automatically by the chat agent via task(type="gen_sql") for complex queries, but can also be launched manually:
gen_report¶
Overview¶
The gen_report subagent is a flexible report generation assistant that combines semantic tools, database tools, and context search capabilities to produce structured reports. It can be used directly or extended by specialized report nodes for domain-specific reporting tasks (e.g., attribution analysis).
Key Features¶
- Configurable tools: Supports
semantic_tools.*,db_tools.*, andcontext_search_tools.*via configuration - Flexible output: Generates structured report content with SQL queries and analysis
- Extensible: Can be subclassed for specialized report types
- Configuration-driven: Tool setup and system prompts are driven by
agent.ymlconfiguration
Configuration¶
agent:
agentic_nodes:
gen_report:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
tools: "semantic_tools.*, db_tools.*, context_search_tools.list_subject_tree" # Optional: customize available tools
Tool patterns:
| Pattern | Description |
|---|---|
semantic_tools.* |
All semantic tools (search metrics, semantic objects, etc.) |
db_tools.* |
All database tools (list tables, describe table, read query, etc.) |
context_search_tools.* |
All context search tools (search knowledge, reference SQL, etc.) |
semantic_tools.search_metrics |
A specific semantic tool method |
context_search_tools.list_subject_tree |
A specific context search method |
Default tools (when not configured): semantic_tools.*, context_search_tools.list_subject_tree
How It Works¶
graph LR
A[User question + context] --> B[Analyze requirements]
B --> C[Search knowledge base]
C --> D[Query database]
D --> E[Generate report]
E --> F[Return structured result]
Output Format¶
The gen_report subagent returns results as a structured report:
{
"report": "Structured report content with analysis...",
"response": "Summary explanation...",
"tokens_used": 2345
}
Usage¶
The gen_report subagent can be launched manually or invoked by the chat agent via task(type="gen_report"):
Custom Report Subagents¶
You can create custom subagents that use the gen_report node class by configuring them in agent.yml:
agent:
agentic_nodes:
attribution_report:
node_class: gen_report
tools: "semantic_tools.*, db_tools.*, context_search_tools.*"
max_turns: 30
Then use it via /attribution_report Analyze the conversion attribution for campaign X.
gen_skill¶
Overview¶
The gen_skill subagent guides users through creating or optimizing Datus skills. It can inspect existing skills, scaffold new skill directories, edit SKILL.md, validate the result, and search prior sessions for usage patterns.
Key Features¶
- Interactive skill authoring: interview-style workflow with
ask_user - Scoped filesystem access: read-only workspace tools plus write access to the configured skills directory
- Skill-aware editing: load existing skills before revising them
- Validation support: built-in
validate_skillchecks before finishing
Configuration¶
Output Format¶
{
"response": "Created a new validation skill for finance dashboards.",
"skill_name": "finance-dashboard-validation",
"skill_path": "/path/to/skills/finance-dashboard-validation",
"tokens_used": 1980
}
Usage¶
Launch it directly:
Or let the chat agent delegate via task(type="gen_skill").
gen_dashboard¶
Overview¶
The gen_dashboard subagent creates, updates, and manages BI dashboards on Superset and Grafana. It is invoked by the chat agent via task(type="gen_dashboard") and uses BI tools from BIFuncTool to build dashboard assets on existing serving tables or SQL datasets.
Key Features¶
- Multi-platform: Supports Apache Superset and Grafana; platform is explicit via
bi_platformor auto-detected fromagent.services.bi_platforms - Dynamic tool exposure: Tools are exposed based on adapter Mixin capabilities — only operations the platform actually supports appear as LLM tools
- Existing serving data only: data preparation is handled separately by
gen_job/scheduler; gen_dashboard builds BI dataset / chart / dashboard assets - Skill-guided workflows: Platform skills (
superset-dashboard,grafana-dashboard) provide step-by-step workflow guidance;bi-validationruns automatically after creation
Configuration¶
agent:
services:
datasources:
serving_pg:
type: postgresql
host: 127.0.0.1
port: 5433
database: superset_examples
schema: bi_public
username: "${SERVING_WRITE_USER}"
password: "${SERVING_WRITE_PASSWORD}"
bi_platforms:
superset:
type: superset
api_base_url: "http://localhost:8088"
username: "${SUPERSET_USER}"
password: "${SUPERSET_PASSWORD}"
dataset_db:
datasource_ref: serving_pg
bi_database_name: examples
agentic_nodes:
gen_dashboard:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
bi_platform: superset # Optional: auto-detected when only one BI platform is configured
Requirements:
- agent.services.bi_platforms section in agent.yml with platform credentials
- datus-bi-superset or datus-bi-grafana package installed
How It Works¶
graph LR
A[task gen_dashboard] --> C[GenDashboardAgenticNode]
C --> D[BIFuncTool.available_tools]
D --> E[LLM Function Calling]
E -->|Superset| F[list_bi_databases → create_dataset → create_chart → create_dashboard → add_chart_to_dashboard]
E -->|Grafana| G[create_dashboard → create_chart]
F --> H[ValidationHook.on_end]
G --> H
Available Tools¶
Tools are exposed dynamically based on which Mixins the platform adapter implements:
| Tool | Required Capability | Description |
|---|---|---|
list_dashboards |
All adapters | List and search dashboards |
get_dashboard |
All adapters | Get dashboard details |
list_charts |
All adapters | List charts in a dashboard |
get_chart |
All adapters | Get details for a specific chart or panel |
get_chart_data |
Supported adapters | Get chart query result rows for numeric validation |
list_datasets |
All adapters | List datasets/datasources |
create_dashboard |
DashboardWriteMixin |
Create a new dashboard |
update_dashboard |
DashboardWriteMixin |
Update dashboard title/description |
delete_dashboard |
DashboardWriteMixin |
Delete a dashboard |
create_chart |
ChartWriteMixin |
Create a chart or panel |
update_chart |
ChartWriteMixin |
Update chart configuration |
add_chart_to_dashboard |
ChartWriteMixin |
Add chart to a dashboard |
delete_chart |
ChartWriteMixin |
Delete a chart |
create_dataset |
DatasetWriteMixin |
Register a dataset |
list_bi_databases |
DatasetWriteMixin |
List BI platform database connections |
delete_dataset |
DatasetWriteMixin |
Delete a dataset |
get_bi_serving_target |
dataset_db configured |
Return serving DB contract for orchestrator hand-off |
Output Format¶
{
"response": "Created sales dashboard with 3 revenue trend charts.",
"dashboard_result": {
"dashboard_id": 42,
"url": "http://localhost:8088/superset/dashboard/42/"
},
"tokens_used": 3210
}
Usage¶
The gen_dashboard subagent is invoked automatically by the chat agent via task(type="gen_dashboard"), or launched manually:
You can also create a custom subagent using the gen_dashboard node class:
agent:
agentic_nodes:
sales_dashboard:
node_class: gen_dashboard
bi_platform: superset
max_turns: 30
scheduler¶
Overview¶
The scheduler subagent submits, monitors, updates, and troubleshoots scheduled jobs on Apache Airflow. It is invoked by the chat agent via task(type="scheduler") and provides the full Airflow job lifecycle through LLM function calling.
Key Features¶
- Full job lifecycle: Submit, trigger, pause, resume, update, and delete Airflow DAG jobs
- SQL and SparkSQL support: Submit both SQL and SparkSQL job types
- SQL file management: Create or update job SQL files with filesystem tools before submission
- Monitoring: List job runs, fetch run logs, and troubleshoot failures
- Connection discovery: List available Airflow connections for job configuration
Configuration¶
agent:
services:
schedulers:
airflow_prod:
type: airflow
api_base_url: "${AIRFLOW_URL}"
username: "${AIRFLOW_USER}"
password: "${AIRFLOW_PASSWORD}"
dags_folder: "${AIRFLOW_DAGS_DIR}"
agentic_nodes:
scheduler:
model: claude # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
scheduler_service: airflow_prod
Requirements:
- agent.services.schedulers section in agent.yml with Airflow credentials
- datus-scheduler-airflow package installed (datus-scheduler-core is pulled in transitively)
How It Works¶
graph LR
A[chat agent] -->|task type=scheduler| B[SchedulerAgenticNode]
B --> C[LLM Function Calling]
C --> D[write_file / edit_file]
D --> E[submit_sql_job / submit_sparksql_job]
C --> F[trigger_scheduler_job]
C --> G[pause_job / resume_job]
C --> H[list_job_runs / get_run_log]
Available Tools¶
| Tool | Description |
|---|---|
submit_sql_job |
Submit a scheduled SQL job from a .sql file with cron expression |
submit_sparksql_job |
Submit a scheduled SparkSQL job from a .sql file |
read_file / write_file / edit_file |
Read, create, or update SQL files used by scheduled jobs |
trigger_scheduler_job |
Manually trigger an existing job run |
pause_job |
Pause a scheduled job |
resume_job |
Resume a paused job |
delete_job |
Delete a scheduled job |
update_job |
Update job schedule or configuration |
get_scheduler_job |
Get job details and current status |
list_scheduler_jobs |
List all scheduled jobs |
list_scheduler_connections |
List available Airflow connections |
list_job_runs |
List recent runs for a job |
get_run_log |
Fetch logs for a specific job run |
Output Format¶
{
"response": "Submitted daily SQL job 'daily_revenue' scheduled at 8:00 AM every day.",
"scheduler_result": {
"job_id": "daily_revenue_dag",
"status": "active",
"schedule": "0 8 * * *"
},
"tokens_used": 1580
}
Usage¶
The scheduler subagent is invoked automatically by the chat agent via task(type="scheduler"), or launched manually:
/scheduler Submit /opt/sql/daily_revenue.sql as a daily job at 8am using the postgres_prod connection
You can also create a custom subagent using the scheduler node class:
Summary¶
| Subagent | Purpose | Output | Stored In | Key Features |
|---|---|---|---|---|
gen_sql_summary |
Summarize and classify SQL queries | YAML (SQL summary) | /data/reference_sql |
Subject tree categorization, auto context retrieval |
gen_semantic_model |
Generate semantic model from tables | YAML (semantic model) | /data/semantic_models |
DDL to MetricFlow model, built-in validation |
gen_metrics |
Generate metrics from SQL | YAML (metric) | /data/semantic_models |
SQL to MetricFlow metric, subject tree support |
gen_ext_knowledge |
Generate business concepts | YAML (external knowledge) | /data/ext_knowledge |
Question plus SQL to knowledge extraction |
explore |
Read-only data exploration | Structured context | N/A | Strictly read-only, fast turn budget, three exploration directions |
gen_sql |
Generate optimized SQL | SQL query / SQL file | N/A | Deep SQL expertise, auto-validation, file-based output |
gen_report |
Flexible report generation | Structured report | N/A | Configurable tools, extensible, custom report subagents |
gen_table |
Create tables interactively | DDL + execution result | Database | DDL confirmation, CTAS or natural-language schema creation |
gen_job |
Data pipeline jobs (intra-DB ETL + cross-DB transfer) | Job / transfer result | Source + target databases | DDL/DML execution, cross-dialect type mapping via MigrationTargetMixin, transfer_query_result, lightweight reconciliation when source != target |
gen_skill |
Create or optimize skills | Skill path | Skills directory | Interactive authoring, validation, skill loading |
gen_dashboard |
BI dashboard CRUD (Superset, Grafana) | Dashboard result | BI platform | Dynamic tool exposure, existing serving data, multi-platform |
scheduler |
Airflow job lifecycle management | Scheduler result | Airflow | Submit, monitor, update, and troubleshoot jobs |
Built-in Features Across All Subagents:
- Minimal configuration required (only model and max_turns optional)
- Automatic tool setup, hooks, and MCP server integration
- Built-in system prompts, with the latest available version selected by default
- Workflow-specific validation and Knowledge Base sync
- Knowledge Base integration for semantic search
- Automatic workspace management
Together, these subagents automate the data engineering knowledge pipeline — from data exploration → query generation → model definition → metric generation → business knowledge capture → searchable Knowledge Base.