Skip to content

Contextual Data Engineering

What's Contextual Data Engineering

Contextual Data Engineering is a new paradigm that transforms how data systems are built, maintained, and used in the age of AI. Instead of delivering static tables and pipelines, it focuses on building evolvable context — a living, intelligent layer for data that integrates metadata, referenced SQLs, semantic models, and metrics into a unified system that both humans and AI agents can understand.

Data Engineering Pipeline Comparison

  • In traditional data engineering, pipelines end with data delivery.
  • In contextual data engineering, the pipeline itself becomes a knowledge graph of your data system, continuously learning from historical SQL, feedback loops, and human corrections.

It is not just about moving data and build tables — it's about understanding and evolving the context around the data.

Why It Matters

LLMs hallucinate without context

Data context is a vast and complex space. We need data engineers — the ones who know the data best — to build reusable, AI-ready contexts that ground every query and response.

Static tables don't scale for dynamic needs

Modern business questions change daily. Ad-hoc data fetch requests consume half of a data engineer's time, but the knowledge behind those queries is rarely captured or reused.

Traditional data engineering is not evolvable

The focus has long been on the data consumer side (analytics and dashboards), not the producer side where context and accuracy are built. Contextual data engineering shifts that focus — empowering engineers to produce living context rather than static artifacts.

Why Datus

Automatic Context Capture

Datus automatically captures, stores, and recalls historical SQL, table structures, metrics, and semantic layers on demand — turning every interaction into long-term knowledge.

Enhanced Long-term Memory

Dual recall mechanisms (Tree + Vector) allow the system to remember not just exact matches, but semantically related queries and patterns — forming a continuously growing "context graph" for your data.

Evolving Context Engineering

The system learns from both machine generation and human feedback, incrementally refining its context over time. Every correction, benchmark, or success story becomes part of a self-improving data memory.

Core Concepts

Long-Term Memory

We model Data Engineering Context (Long-Term Memory) as two trees:

Context Tree Structure

  • In Datus CLI, you can browse and edit them via @catalog and @subject
  • Use datus-agent bootstrap-kb to batch-initialize and cold-start the knowledge base
  • With subagents, you can define a scoped context — a curated subset of the global store that enables precise, domain-aware delivery

Interactive Context Building

Co-authored context

LLMs draft semantic models and metrics from tables and reference SQL, while engineers refine labels, metadata, and the subject tree.

Command-driven iteration

Commands like /gen_semantic_model, /gen_metrics, and /gen_sql_summary create and update assets. The @catalog and @subject screens support in-place edits.

Feedback drives continuous improvement

Exploration with /chat, success-story writebacks, and issue/benchmark loops convert usage into durable, reusable knowledge.

Subagent System

Subagent System

Scoped, domain-aware subagents

Package description, rules, and scoped context to unify tables, SQL patterns, metrics, and constraints for specific business scenarios.

Configurable tools and MCPs

Configure tools per scenario. Built-ins include DB tools, context-search tools, and filesystem tools. Enable and compose as needed.

RL-ready architecture

The subagent's scoped context forms an ideal RL environment (environment + question + SQL) for continuous training and evaluation.

Tools and Components

Datus CLI

An interactive CLI for data engineers with context-aware compression and search tools. Provides three "magic commands":

  • / to chat and orchestrate
  • @ to view and recall context
  • ! to execute node/tool actions

Datus Agent

A benchmarking and bootstrap utility — the batch companion to the CLI. Used to:

  • Build initial context from historical data
  • Run benchmarks and evaluations
  • Expose corresponding APIs

Datus Chat

A lightweight web chatbot for analysts and business users, supporting:

  • Multi-turn conversations
  • Built-in feedback (upvotes, issue reports, success stories)

User Journey

Initial Exploration

Kick off with /chat

Probe the database and quickly validate end-to-end Q&A: schema understanding, simple aggregations, and table lookups.

Refine with targeted hints

Use /question + @table @file to provide examples, join rules, column notes, or query templates that ground the model in the right context.

Tight feedback loop

Try → inspect result → add clarifications (e.g., "use PK join between orders and users", "status ∈ {A,B,C}") → retry.

Example:

/chat What were last week's DAU and retention for the new region?
/question @table users events @file ./notes/retention_rules.md

Building Context

Import Reference SQL

Capture proven patterns and edge-case handling. Promote high-value snippets into reusable blocks. See Reference SQL Tracking for details.

Generate first-pass assets

  • /gen_semantic_model @table ... — draft YAML for dimensions/measures
  • /gen_metrics @sql @semantic_model ... — define business metrics and tests
  • /gen_sql_summary @file ./sql/ — summarize intent and dependencies

Human-in-the-loop curation

Refine domain → layer_1 → layer_2 descriptions, align naming conventions, and edit metadata directly in @catalog / @subject.

Example:

/gen_semantic_model @table fact_orders dim_users
/gen_metrics @sql ./sql/retention.sql @semantic_model ./models/game.yaml "30-day retention"

Creating a Subagent

Define a domain-aware chatbot

Use .subagent add <name> for a specific scenario (e.g., "commercialization analytics").

Package the essentials

Include description, rules (join keys, filters, calculation rules), scoped context (tables, SQL snippets, metrics), and allowed tools.

Constrain the search space

Limit to selected catalogs/tables and enable only relevant tools/MCPs (DB tools, context search, filesystem where needed).

Example:

.subagent add california_schools

Delivering and Iterating

Ship to the web

Serve the subagent as a lightweight chatbot UI for analysts to perform multi-turn analysis and report preview.

datus-agent --namespace schools --web

You can then access http://localhost:8501/?subagent=california_schools (change localhost to your IP address if you're deploying the subagent to your stakeholders).

Collect and write back feedback

Analysts upvote good results and report issues with traceable session links. Export successful runs as success stories.

Close the loop

When data engineers receive issue links from customers, they can revise SQL, update rules and metadata, build more metrics, or expand the scoped context. The subagent improves continuously as knowledge is captured.

Next Steps

  • Knowledge Base


    Explore detailed context management with metadata, metrics, and reference SQL.

    Learn more

  • Workflow Integration


    Integrate context into automated data pipelines and orchestration.

    Explore workflows

  • CLI Context Commands


    Master the CLI with hands-on context management commands.

    View commands

  • Configuration


    Configure advanced settings for agents, namespaces, and storage.

    Configuration guide