SQL Summary Generation Guide¶
Overview¶
The SQL Summary feature helps you analyze, classify, and catalog SQL queries for knowledge reuse. It automatically generates structured YAML summaries that are stored in a searchable Knowledge Base, making it easy to find and reuse similar queries in the future.
What is a SQL Summary?¶
A SQL summary is a structured YAML document that captures:
- Query Text: The complete SQL query
- Business Context: Domain, categories, and tags
- Semantic Summary: Detailed explanation for vector search
- Metadata: Name, comment, file path
Use Cases:
- Build a searchable SQL query library
- Share and reuse proven queries across teams
- Document complex query patterns
- Enable semantic search: "Find queries related to revenue analysis"
Quick Start¶
Launch the SQL summary generation subagent:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales GROUP BY region.(You can also add some description on this SQL)
How It Works¶
Generation Workflow¶
User provides SQL + description → Agent analyzes query → Automatically retrieves context (taxonomy + similar queries) →
Generates unique ID → Creates YAML → Saves file → Syncs to Knowledge Base
Step-by-Step Process¶
- Understand SQL: The AI analyzes your query structure and business logic
- Get Context: Automatically retrieves from Knowledge Base:
- Existing subject trees (domain/layer1/layer2 combinations)
- Similar SQL summaries for classification reference
- Generate Unique ID: Uses
generate_sql_summary_id()tool based on SQL + comment - Create Unique Name: Generates a descriptive name (max 20 characters)
- Classify Query: Assigns domain, layer1, layer2, and tags following existing patterns
- Generate YAML: Creates structured summary document
- Save File: Writes YAML to workspace using
write_file()tool - Sync to Knowledge Base: Stores in LanceDB for semantic search
Sync Behavior¶
After the YAML is written successfully, the generation hook syncs it to the Knowledge Base automatically.
Configuration¶
Agent Configuration¶
Most configurations are built-in. In agent.yml, minimal setup is needed:
agentic_nodes:
gen_sql_summary:
model: deepseek # Optional: defaults to configured model
max_turns: 30 # Optional: defaults to 30
Built-in configurations (automatically enabled):
- Tools: Filesystem tools (read_file, write_file, edit_file, glob) and generate_sql_summary_id
- Hooks: Knowledge Base sync
- System Prompt: Built-in template; the latest available version is used unless prompt_version is set
- Workspace: ~/.datus/data/{datasource}/reference_sql
- Context Retrieval: Automatically queries existing subject trees and similar SQLs from Knowledge Base
Configuration Options¶
| Parameter | Required | Description | Default |
|---|---|---|---|
model |
No | LLM model to use | Uses default configured model |
max_turns |
No | Maximum conversation turns | 30 |
Subject Tree Categorization¶
Subject tree allows organizing SQL summaries by domain and layers. In CLI mode, include it in your question:
Example with subject_tree:
/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales, subject_tree: sales/reporting/revenue_analysis
Example without subject_tree:
When subject_tree is provided, the SQL summary will be categorized accordingly (e.g., domain: sales, layer1: reporting, layer2: revenue_analysis). If not provided, the agent operates in learning mode and automatically suggests categories based on: - Existing subject trees in the Knowledge Base - Similar SQL summaries (top 5 most similar queries)
YAML Structure¶
The generated SQL summary follows this structure:
id: "abc123def456..." # Auto-generated MD5 hash
name: "Revenue by Region" # Descriptive name (max 20 chars)
sql: | # Complete SQL query
SELECT
region,
SUM(revenue) as total_revenue
FROM sales
GROUP BY region
comment: "Calculate total revenue grouped by region"
summary: "This query aggregates total revenue from the sales table, grouping results by geographic region. It uses SUM aggregation to calculate revenue totals for each region."
filepath: "/Users/you/.datus/data/reference_sql/revenue_by_region.yml"
domain: "Sales" # Business domain
layer1: "Reporting" # Primary category
layer2: "Revenue Analysis" # Secondary category
tags: "revenue, region, aggregation" # Comma-separated tags
Field Descriptions¶
| Field | Required | Description | Example |
|---|---|---|---|
id |
Yes | Unique hash (auto-generated) | abc123def456... |
name |
Yes | Short descriptive name (max 20 chars) | Revenue by Region |
sql |
Yes | Complete SQL query | SELECT ... |
comment |
Yes | Brief one-line description | User's message or generated summary |
summary |
Yes | Detailed explanation (for search) | Comprehensive query description |
filepath |
Yes | Actual file path | /path/to/file.yml |
domain |
Yes | Business domain | Sales, Marketing, Finance |
layer1 |
Yes | Primary category | Reporting, Analytics, ETL |
layer2 |
Yes | Secondary category | Revenue Analysis, Customer Insights |
tags |
Optional | Comma-separated keywords | revenue, region, aggregation |
Summary¶
The SQL Summary feature provides:
✅ Automated Analysis: AI understands query structure and purpose
✅ Smart Classification: Taxonomy-based categorization with consistency using existing patterns
✅ Automatic Context Retrieval: Queries existing subject trees and similar SQLs from Knowledge Base
✅ Deduplication: Automatic hash-based duplicate detection using generate_sql_summary_id tool
✅ Semantic Search: Vector embeddings enable intelligent query discovery
✅ Automatic Sync: Syncs generated summaries after the YAML file is written
✅ Subject Tree Support: Organize by domain/layer1/layer2 with predefined or learned categories
✅ Knowledge Reuse: Build searchable SQL query library