SQL Summary Generation Guide¶

Overview¶

The SQL Summary feature helps you analyze, classify, and catalog SQL queries for knowledge reuse. It automatically generates structured YAML summaries that are stored in a searchable Knowledge Base, making it easy to find and reuse similar queries in the future.

What is a SQL Summary?¶

A SQL summary is a structured YAML document that captures:

Query Text: The complete SQL query
Business Context: Domain, categories, and tags
Semantic Summary: Detailed explanation for vector search
Metadata: Name, comment, file path

Use Cases:

Build a searchable SQL query library
Share and reuse proven queries across teams
Document complex query patterns
Enable semantic search: "Find queries related to revenue analysis"

Quick Start¶

Launch the SQL summary generation subagent:

/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales GROUP BY region.(You can also add some description on this SQL)

How It Works¶

Generation Workflow¶

User provides SQL + description → Agent analyzes query → Automatically retrieves context (taxonomy + similar queries) →
Generates unique ID → Creates YAML → Saves file → Syncs to Knowledge Base

Step-by-Step Process¶

Understand SQL: The AI analyzes your query structure and business logic
Get Context: Automatically retrieves from Knowledge Base:
Existing subject trees (domain/layer1/layer2 combinations)
Similar SQL summaries for classification reference
Generate Unique ID: Uses generate_sql_summary_id() tool based on SQL + comment
Create Unique Name: Generates a descriptive name (max 20 characters)
Classify Query: Assigns domain, layer1, layer2, and tags following existing patterns
Generate YAML: Creates structured summary document
Save File: Writes YAML to workspace using write_file() tool
Sync to Knowledge Base: Stores in LanceDB for semantic search

Sync Behavior¶

After the YAML is written successfully, the generation hook syncs it to the Knowledge Base automatically.

Configuration¶

Agent Configuration¶

Most configurations are built-in. In agent.yml, minimal setup is needed:

agentic_nodes:
  gen_sql_summary:
    model: deepseek      # Optional: defaults to configured model
    max_turns: 30        # Optional: defaults to 30

Built-in configurations (automatically enabled): - Tools: Filesystem tools (read_file, write_file, edit_file, glob) and generate_sql_summary_id - Hooks: Knowledge Base sync - System Prompt: Built-in template; the latest available version is used unless prompt_version is set - Workspace: ~/.datus/data/{datasource}/reference_sql - Context Retrieval: Automatically queries existing subject trees and similar SQLs from Knowledge Base

Configuration Options¶

Parameter	Required	Description	Default
`model`	No	LLM model to use	Uses default configured model
`max_turns`	No	Maximum conversation turns	30

Subject Tree Categorization¶

Subject tree allows organizing SQL summaries by domain and layers. In CLI mode, include it in your question:

Example with subject_tree:

/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales, subject_tree: sales/reporting/revenue_analysis

Example without subject_tree:

/gen_sql_summary Analyze this SQL: SELECT SUM(revenue) FROM sales

When subject_tree is provided, the SQL summary will be categorized accordingly (e.g., domain: sales, layer1: reporting, layer2: revenue_analysis). If not provided, the agent operates in learning mode and automatically suggests categories based on: - Existing subject trees in the Knowledge Base - Similar SQL summaries (top 5 most similar queries)

YAML Structure¶

The generated SQL summary follows this structure:

id: "abc123def456..."                      # Auto-generated MD5 hash
name: "Revenue by Region"                  # Descriptive name (max 20 chars)
sql: |                                     # Complete SQL query
  SELECT
    region,
    SUM(revenue) as total_revenue
  FROM sales
  GROUP BY region
comment: "Calculate total revenue grouped by region"
summary: "This query aggregates total revenue from the sales table, grouping results by geographic region. It uses SUM aggregation to calculate revenue totals for each region."
filepath: "/Users/you/.datus/data/reference_sql/revenue_by_region.yml"
domain: "Sales"                            # Business domain
layer1: "Reporting"                        # Primary category
layer2: "Revenue Analysis"                 # Secondary category
tags: "revenue, region, aggregation"       # Comma-separated tags

Field Descriptions¶

Field	Required	Description	Example
`id`	Yes	Unique hash (auto-generated)	`abc123def456...`
`name`	Yes	Short descriptive name (max 20 chars)	`Revenue by Region`
`sql`	Yes	Complete SQL query	`SELECT ...`
`comment`	Yes	Brief one-line description	User's message or generated summary
`summary`	Yes	Detailed explanation (for search)	Comprehensive query description
`filepath`	Yes	Actual file path	`/path/to/file.yml`
`domain`	Yes	Business domain	`Sales`, `Marketing`, `Finance`
`layer1`	Yes	Primary category	`Reporting`, `Analytics`, `ETL`
`layer2`	Yes	Secondary category	`Revenue Analysis`, `Customer Insights`
`tags`	Optional	Comma-separated keywords	`revenue, region, aggregation`

Summary¶

The SQL Summary feature provides:

✅ Automated Analysis: AI understands query structure and purpose ✅ Smart Classification: Taxonomy-based categorization with consistency using existing patterns ✅ Automatic Context Retrieval: Queries existing subject trees and similar SQLs from Knowledge Base ✅ Deduplication: Automatic hash-based duplicate detection using generate_sql_summary_id tool ✅ Semantic Search: Vector embeddings enable intelligent query discovery ✅ Automatic Sync: Syncs generated summaries after the YAML file is written ✅ Subject Tree Support: Organize by domain/layer1/layer2 with predefined or learned categories ✅ Knowledge Reuse: Build searchable SQL query library