External Knowledge Intelligence¶

Overview¶

Bootstrap-KB External Knowledge is a component that processes, stores, and indexes domain-specific business knowledge to create an intelligent searchable knowledge base. It transforms business rules and concepts into a structured repository with semantic search capabilities.

Core Value¶

What Problem Does It Solve?¶

Business Knowledge Silos: Domain knowledge scattered across teams without centralized access
Terminology Ambiguity: Different interpretations of business terms across the organization
Context Gap: SQL agents lacking understanding of business-specific concepts
Knowledge Onboarding: New team members struggling to understand domain-specific terminology

What Value Does It Provide?¶

Unified Knowledge Base: Centralized repository for business terminology and rules
Semantic Search: Find relevant knowledge using natural language queries
Agent Context Enhancement: Enriches SQL generation with domain understanding
Knowledge Preservation: Captures business expertise in a structured, searchable format

Usage¶

Basic Command¶

# From CSV (direct import)
datus-agent bootstrap-kb \
    --namespace <your_namespace> \
    --components ext_knowledge \
    --ext_knowledge /path/to/knowledge.csv \
    --kb_update_strategy overwrite

# From success story (AI generation)
datus-agent bootstrap-kb \
    --namespace <your_namespace> \
    --components ext_knowledge \
    --success_story /path/to/success_story.csv \
    --kb_update_strategy overwrite

Key Parameters¶

Parameter	Required	Description	Example
`--namespace`	✅	Database namespace	`analytics_db`
`--components`	✅	Components to initialize	`ext_knowledge`
`--ext_knowledge`	⚠️	Path to knowledge CSV file (required if no `--success_story`)	`/data/knowledge.csv`
`--success_story`	⚠️	Path to success story CSV file (required if no `--ext_knowledge`)	`/data/success_story.csv`
`--kb_update_strategy`	✅	Update strategy	`overwrite`/`incremental`
`--subject_tree`	❌	Predefined subject tree categories	`Finance/Revenue,User/Engagement`
`--pool_size`	❌	Concurrent processing threads, default is 4	`8`

Data Source Formats¶

Direct Import¶

Import pre-defined knowledge entries directly from a CSV file.

CSV Format¶

Column	Required	Description	Example
`subject_path`	Yes	Hierarchical category path	`Finance/Revenue/Metrics`
`name`	Yes	Knowledge entry name	`GMV Definition`
`search_text`	Yes	Searchable business term	`GMV`
`explanation`	Yes	Detailed description	`Gross Merchandise Volume...`

Example CSV¶

subject_path,name,search_text,explanation
Finance/Revenue/Metrics,GMV Definition,GMV,"Gross Merchandise Volume (GMV) represents the total value of merchandise sold through the platform, including both paid and unpaid orders."
User/Engagement/DAU,DAU Definition,DAU,"Daily Active Users (DAU) counts unique users who performed at least one activity within a calendar day."
User/Engagement/Retention,Retention Rate,retention rate,"The percentage of users who return to the platform after their first visit, typically measured at Day 1, Day 7, and Day 30 intervals."

AI Generation¶

Generate knowledge automatically from question-SQL pairs using AI agent.

CSV Format¶

Column	Required	Description	Example
`question`	Yes	Business question or query intent	`What is the total GMV for last month?`
`sql`	Yes	SQL query that answers the question	`SELECT SUM(amount) FROM orders...`
`subject_path`	No	Hierarchical category path (optional)	`Finance/Revenue/Metrics`

Example CSV¶

question,sql,subject_path
"What is the total GMV for last month?","SELECT SUM(amount) as gmv FROM orders WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)",Finance/Revenue/Metrics
"How many daily active users do we have?","SELECT COUNT(DISTINCT user_id) as dau FROM user_activity WHERE activity_date = CURDATE()",User/Engagement/DAU
"What is our 7-day retention rate?","SELECT COUNT(DISTINCT d7.user_id) / COUNT(DISTINCT d0.user_id) as retention FROM users d0 LEFT JOIN users d7 ON d0.user_id = d7.user_id",User/Engagement/Retention

How AI Generation Works¶

The success story mode uses GenExtKnowledgeAgenticNode which supports two operating modes:

Workflow Mode: When question and gold_sql are provided as structured fields (e.g., from CSV batch processing), they are used directly.
Agentic Mode: When only a user_message is provided (e.g., interactive chat), the system uses a lightweight LLM to parse and extract the question and reference SQL from the message.

The generation pipeline:

Analyze Question-SQL Pairs: Understands the business intent behind each query
Extract Business Concepts: Identifies key terminology, rules, and patterns
Generate Knowledge Entries: Creates structured knowledge with search_text and explanation
Classify Categories: Assigns appropriate subject paths for organization
Verify SQL: Uses verify_sql tool to compare agent-generated SQL against the hidden reference SQL. If verification fails, the system provides comparison feedback (match rate, column differences, data preview, and improvement suggestions) generated via CompareAgenticNode
Retry on Failure: If verification fails, the system automatically retries up to max_verification_retries times (default: 3), injecting retry prompts that guide the agent to correct or create knowledge entries and re-verify

Update Strategies¶

1. Overwrite Mode¶

Clears existing knowledge and loads fresh data:

datus-agent bootstrap-kb \
    --namespace analytics_db \
    --components ext_knowledge \
    --ext_knowledge /path/to/knowledge.csv \
    --kb_update_strategy overwrite

2. Incremental Mode¶

Adds new knowledge entries while preserving existing ones. Entries with the same subject_path + name are automatically updated (upsert):

datus-agent bootstrap-kb \
    --namespace analytics_db \
    --components ext_knowledge \
    --success_story /path/to/success_story.csv \
    --kb_update_strategy incremental

Subject Tree Categorization¶

Subject tree provides a hierarchical taxonomy for organizing knowledge entries.

1. Predefined Mode¶

datus-agent bootstrap-kb \
    --namespace analytics_db \
    --components ext_knowledge \
    --success_story /path/to/success_story.csv \
    --kb_update_strategy overwrite \
    --subject_tree "Finance/Revenue/Metrics,User/Engagement/DAU"

2. Learning Mode¶

When no subject_tree is provided, the system:

Reuses existing categories from the Knowledge Base
Creates new categories as needed based on content
Builds taxonomy organically over time

Integration with SQL Agent¶

External knowledge integrates with the SQL generation agent through context search tools:

Automatic Context Retrieval: When generating SQL, the agent queries relevant business knowledge
Term Resolution: Ambiguous business terms are resolved using stored definitions
Rule Application: Business rules stored in knowledge base guide SQL logic

Example Workflow¶

User asks: "Calculate the GMV for last month"
Agent searches knowledge for "GMV"
Finds definition: "GMV = total value of merchandise including paid and unpaid orders"
Generates SQL with correct business logic

Summary¶

The Bootstrap-KB External Knowledge component transforms scattered business knowledge into an intelligent, searchable knowledge base.

Key Features:

Dual Import Modes: Direct CSV import or AI-driven generation from success stories
Dual Operating Modes: Workflow mode (structured input) and agentic mode (free-form user message with LLM parsing)
Unified Repository: Centralized storage for business terminology and rules
Semantic Search: Find knowledge using natural language queries
Hierarchical Organization: Navigate knowledge through subject path taxonomy
Flexible Classification: Support both predefined and learning modes
Agent Integration: Enhances SQL generation with domain context
Upsert Deduplication: Entries with the same subject_path + name are automatically updated rather than duplicated
SQL Verification Loop: AI-generated knowledge is validated by comparing SQL results against a hidden reference, with automatic retry and feedback
Batch Knowledge Retrieval: get_knowledge supports fetching multiple entries by paths in a single call

By implementing external knowledge, teams can ensure consistent understanding of business concepts and enable intelligent SQL generation with domain awareness.

Best Practices: Building Knowledge Base End-to-End¶

This section walks through the complete process of building an external knowledge base using two approaches, demonstrated with the california_schools database.

Scenario¶

The goal is to build knowledge that enables the SQL agent to correctly answer:

"Under whose administration is the school with the highest number of students scoring 1500 or more on the SAT? Indicate their full names."

The key business knowledge is:

Full name means first name + last name
There are at most 3 administrators for each school (AdmFName1/AdmLName1, AdmFName2/AdmLName2, AdmFName3/AdmLName3)
SAT Scores >= 1500 refers to the NumGE1500 column in the satscores table

The expected SQL:

SELECT T2.AdmFName1, T2.AdmLName1, T2.AdmFName2, T2.AdmLName2, T2.AdmFName3, T2.AdmLName3
FROM satscores AS T1
INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode
ORDER BY T1.NumGE1500 DESC
LIMIT 1

Approach 1: Bootstrap from Success Stories (Batch Workflow)¶

Best for bulk import from existing question-SQL pairs, initial KB builds, and CI/CD pipelines.

Step 1: Prepare the Success Story CSV¶

Create a CSV file (e.g., success_story.csv) with question and sql columns. Optionally include subject_path for pre-classification:

question,sql,subject_path
"Under whose administration is the school with the highest number of students scoring 1500 or more on the SAT? Indicate their full names.","SELECT T2.AdmFName1, T2.AdmLName1, T2.AdmFName2, T2.AdmLName2, T2.AdmFName3, T2.AdmLName3 FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode ORDER BY T1.NumGE1500 DESC LIMIT 1",Education/SAT/Administrators

Step 2: Run the Bootstrap Command¶

datus-agent bootstrap-kb \
    --namespace california_schools \
    --components ext_knowledge \
    --success_story /path/to/success_story.csv \
    --kb_update_strategy overwrite \
    --subject_tree "Education/SAT/Administrators,Education/SAT/Scores,Education/Schools"

Step 3: What Happens Internally¶

Each CSV row is read and converted into an ExtKnowledgeNodeInput with question and gold_sql passed as structured fields (workflow mode)
A GenExtKnowledgeAgenticNode is created in workflow mode for each row
The agent analyzes the question-SQL pair, extracts business concepts, and generates a knowledge YAML file
The agent calls verify_sql to compare its generated SQL against the hidden reference SQL. If match_rate < 100%, CompareAgenticNode generates improvement suggestions, and the agent retries (up to max_verification_retries times, default 3)
Once verification passes, the knowledge entry is automatically saved to the Knowledge Base (no user confirmation needed in workflow mode)

Bootstrap Log Output¶

During execution, the bootstrap process outputs logs that track progress and verification status:

[info     ] Verification status updated: passed=True, match_rate=1.0 [datus.agent.node.gen_ext_knowledge_agentic_node]
[info     ] Agentic loop ended. Verification passed: True, attempt: 1/4 [datus.agent.node.gen_ext_knowledge_agentic_node]
[info     ] Successfully upserted 2 items in batch [datus.storage.subject_tree.store]
[info     ] Successfully upserted 2 external knowledge entries to Knowledge Base [datus.cli.generation_hooks]
[info     ] Successfully saved to database: Upserted 2 knowledge entries: SAT Score Record Types, Admin Full Names Columns [datus.agent.node.gen_ext_knowledge_agentic_node]
[info     ] Auto-saved to database: Education_SAT_Administrators_knowledge.yaml [datus.agent.node.gen_ext_knowledge_agentic_node]
[info     ] Generated knowledge for: Under whose administration is the school with the highest number of students scoring 1500 or more on the SAT? Indicate their full names. [datus.storage.ext_knowledge.ext_knowledge_init]
[info     ] Final Result: {'status': 'success', 'message': 'ext_knowledge bootstrap completed, knowledge_size=2'} [__main__]

Step 4: Expected Output¶

The system generates a knowledge YAML file like:

name: "SAT Score Record Types"
search_text: "SAT scores highest school district record type rtype filter"
explanation: "When querying SAT data for 'schools' with highest scores: (1) The satscores table has rtype column where 'S'=School level, 'D'=District level; (2) When question mentions 'school' without explicit specification, do NOT add rtype='S' filter - the question may refer to any educational entity including districts; (3) District-level records often have higher aggregate numbers than individual schools."
subject_path: "Education/SAT/Administrators"
created_at: "2025-01-15T10:00:00Z"
---
name: "Admin Full Names Columns"
search_text: "administrator full name school multiple administrators"
explanation: "When retrieving administrator 'full names' from schools table: (1) Include ALL administrator columns: AdmFName1, AdmLName1, AdmFName2, AdmLName2, AdmFName3, AdmLName3; (2) Schools may have multiple administrators (up to 3); (3) Even if most entries only have one administrator, query all 6 columns to ensure complete coverage of 'full names' as requested."
subject_path: "Education/SAT/Administrators"
created_at: "2025-01-15T10:00:00Z"

Step 5: Verify the Result¶

Start the cli, first use @subject to browse the generated knowledge entries, then test with the original question:

datus-agent --namespace california_schools

# Browse the knowledge tree and entries
Datus> @subject
# Should show Education/SAT/Administrators and Education/SAT/Scores with the generated knowledge entries

# Test with the original question
Datus> Under whose administration is the school with the highest number of students scoring 1500 or more on the SAT? Indicate their full names.

The agent should:

Search the knowledge base and retrieve entries about NumGE1500 and administrator full name columns
Generate SQL that joins satscores with schools, orders by NumGE1500 DESC, and selects all 6 administrator name columns (AdmFName1, AdmLName1, AdmFName2, AdmLName2, AdmFName3, AdmLName3)
Return results matching the expected SQL output

Approach 2: Subagent Interactive Mode¶

Best for ad-hoc knowledge creation, exploring and debugging, or refining individual entries.

Step 1: Start the REPL¶

datus-agent --namespace california_schools

Step 2: Invoke the Subagent¶

Use the /gen_ext_knowledge slash command, pasting the question and reference SQL together in the message:

Datus> /gen_ext_knowledge Under whose administration is the school with the highest number of students scoring 1500 or more on the SAT? Indicate their full names. Reference SQL: SELECT T2.AdmFName1, T2.AdmLName1, T2.AdmFName2, T2.AdmLName2, T2.AdmFName3, T2.AdmLName3 FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode ORDER BY T1.NumGE1500 DESC LIMIT 1

Step 3: What Happens Internally¶

The system creates a GenExtKnowledgeAgenticNode in interactive mode
Since only user_message is provided (no structured question/gold_sql fields), a lightweight LLM parses the message to extract the question and reference SQL separately
The agent enters its agentic loop: analyzes the question, queries the database schema, generates knowledge entries, and writes them to a YAML file
verify_sql validates the generated SQL against the hidden reference. If it fails, the agent automatically retries with targeted feedback

Step 4: Confirm Database Sync¶

In interactive mode, GenerationHooks intercepts the write_file call and prompts for confirmation:

[Knowledge Generated] School Administrator Full Names
Sync this knowledge entry to the Knowledge Base? [y/n]: y

Approve to save the entry to the Knowledge Base. You can also decline and edit the YAML file manually before re-syncing.

When to Use Which¶

Aspect	Bootstrap (batch)	Subagent (interactive)
Use case	Bulk import from existing Q&A pairs	Ad-hoc knowledge creation/refinement
Input	CSV file with `question`, `sql` columns	Free-form message in REPL
Gold SQL handling	Passed directly as structured fields	Parsed from user message by LLM
DB save	Automatic (no confirmation)	User confirms via hook prompt
Verification	Automatic retry loop	Automatic retry loop
Best for	Initial KB build, CI/CD pipelines	Exploring, debugging, single-entry fixes

Tips¶

Emphasize knowledge usage when verifying: When testing in the CLI, add "Please search the knowledge base first" to your question to ensure the agent uses stored knowledge rather than relying solely on its own reasoning.
Pre-build subject tree: Create the subject tree structure in advance via @subject. Subsequent runs without --subject_tree will automatically reuse existing categories (learning mode), giving you control over taxonomy without requiring subject_path in every CSV row.
Iterate for stability: Run bootstrap multiple times with incremental mode. Each run may produce improved knowledge entries, and the upsert mechanism ensures existing entries are updated with better content.