Data Engineering Quickstart¶
This guide walks through a complete local Datus workflow using the open DAComp data-engineering dataset. You will inspect the warehouse design, build layered tables interactively in a local DuckDB workbench file, generate ETL jobs, produce marts data, submit a daily Airflow job, and publish the result to Superset.
The local open-source quickstart does not require Iceberg, MinIO, or S3. The SaaS Studio tour uses a managed DuckDB + Iceberg lakehouse instead; see SaaS Studio Tour Variant for the namespace model.
Step 0: Download the Quickstart Data¶
DAComp is not bundled with datus-agent. This tutorial uses a small
quickstart package derived from the DAComp Lever example, so you do not need to
download the full DAComp archive.
First create and enter the working directory:
Run the bash block below — it downloads and unpacks the quickstart data and
local Docker stack, creates a writable DuckDB workbench, exports DACOMP_HOME
/ DATUS_QUICKSTART_STACK, and finally prints the two export statements so
you can paste them into another shell:
curl -L -o datus-de-lever-quickstart-v1.zip \
https://github.com/Datus-ai/datus-quickstart-data/releases/download/data-engineering-v1/datus-de-lever-quickstart-v1.zip
curl -L -o datus-data-engineering-quickstart-stack-v1.zip \
https://github.com/Datus-ai/datus-quickstart-data/releases/download/data-engineering-v1/datus-data-engineering-quickstart-stack-v1.zip
unzip -o datus-de-lever-quickstart-v1.zip
unzip -o datus-data-engineering-quickstart-stack-v1.zip
export DACOMP_HOME="$(pwd)/datus-de-lever-quickstart"
export DATUS_QUICKSTART_STACK="$(pwd)/datus-data-engineering-quickstart-stack"
cp "$DACOMP_HOME/lever_start.duckdb" "$DACOMP_HOME/lever_workbench.duckdb"
cd "$DACOMP_HOME"
echo "export DACOMP_HOME=$DACOMP_HOME"
echo "export DATUS_QUICKSTART_STACK=$DATUS_QUICKSTART_STACK"
The rest of this guide assumes the example directory contains:
docs/data_contract.yamlconfig/layer_dependencies.yamllever_start.duckdb
Step 1: Understand the Warehouse Layers¶
The DAComp example already encodes a classic warehouse layout:
| Layer | Tables | Purpose |
|---|---|---|
staging |
24 | Clean raw ATS records and normalize types and formats |
intermediate |
17 | Join entities and apply reusable business logic |
marts |
14 | Publish analytics-ready outputs for dashboards and reporting |
The two files that drive the design are:
docs/data_contract.yaml- row-level cleanup, validation, and normalization rulesconfig/layer_dependencies.yaml- layer order and table dependencies
Read those first so the prompts you give to the agent stay aligned with the intended warehouse design.
Step 2: Start the Local Quickstart Stack¶
The downloaded stack includes the local demo services used by this walkthrough.
Start Superset:
Start Airflow:
Default local endpoints:
- Superset:
http://127.0.0.1:8088, usernameadmin, passwordadmin - Airflow:
http://127.0.0.1:8080, usernameadmin, passwordadmin
For this quickstart, the Superset compose file uses local demo defaults for the metadata database and admin user.
The Airflow compose file mounts ${DACOMP_HOME} into the container and exposes
an Airflow connection named duckdb_dacomp_lever, which points to
/workspace/lever_workbench.duckdb.
Step 3: Configure agent.yml¶
Merge the following service configuration into the existing agent: section in
~/.datus/conf/agent.yml. Keep any existing agent.providers settings; the
/model command uses those credentials. The paths use the DACOMP_HOME and DATUS_QUICKSTART_STACK
environment variables from Step 0.
agent:
services:
datasources:
lever_duckdb:
type: duckdb
uri: "duckdb:///${DACOMP_HOME}/lever_workbench.duckdb"
default: true
superset_serving:
type: postgresql
host: 127.0.0.1
port: 5433
database: superset_examples
schema: public
username: superset
password: superset
bi_platforms:
superset:
type: superset
api_base_url: http://127.0.0.1:8088
username: admin
password: admin
dataset_db:
datasource_ref: superset_serving
bi_database_name: examples
schedulers:
airflow_prod:
type: airflow
api_base_url: http://127.0.0.1:8080/api/v1
username: admin
password: admin
dags_folder: "${DATUS_QUICKSTART_STACK}/airflow/dags"
connections:
duckdb_dacomp_lever: DAComp Lever DuckDB
semantic_layer:
metricflow:
type: metricflow
agentic_nodes:
gen_dashboard:
bi_platform: superset
scheduler:
scheduler_service: airflow_prod
Then start Datus with the lever_duckdb datasource, which points at the
writable workbench file:
If the CLI says no model is configured, configure one before continuing:
Choose a provider/model and enter credentials if prompted. /model writes
provider credentials under agent.providers in ~/.datus/conf/agent.yml and
writes the active provider/model for this project to ./.datus/config.yml.
Here dags_folder is the host-side directory where Datus writes generated DAG files. The Airflow compose file mounts that directory into the Airflow container as /opt/airflow/dags, so newly generated DAGs are picked up automatically.
Step 4: Create the Required Staging Tables¶
For natural-language agent tasks, avoid starting the message with a raw SQL verb
such as CREATE or COPY; the CLI uses those leading keywords to detect direct
SQL.
Ask the agent to create the target schemas:
Please set up the target schemas staging, intermediate, and marts in the current DuckDB database. Keep the existing raw schema unchanged.
This walkthrough builds a narrow but complete dependency chain for
marts.lever__requisition_enhanced. Use docs/data_contract.yaml as the source
of truth for field selection, renames, and business logic.
Ask the agent to create the staging tables required by the source_models
listed for lever__requisition_enhanced and
intermediate.int_lever__requisition_users. The agent will route the request to
the table-generation workflow:
Read ./docs/data_contract.yaml and create the staging tables needed for marts.lever__requisition_enhanced: staging.stg_lever__requisition from raw.requisition, staging.stg_lever__user from raw.user, staging.stg_lever__requisition_posting from raw.requisition_posting, and staging.stg_lever__requisition_offer from raw.requisition_offer. Use the field design and source-to-target mapping from the contract.
These four staging tables are the minimum raw-to-staging inputs for the requisition-enhancement example.
Step 5: Build the Intermediate and Marts Tables¶
Build the intermediate model first. It should combine requisition fields with
user fields according to the int_lever__requisition_users entry in
docs/data_contract.yaml.
Create the intermediate table:
Read ./docs/data_contract.yaml and create intermediate.int_lever__requisition_users from staging.stg_lever__requisition and staging.stg_lever__user. Use the contract's field design, joins, and source-to-target mapping.
Then create the marts table that is ready for downstream analytics. The contract
defines marts.lever__requisition_enhanced as one row per requisition_id,
using:
intermediate.int_lever__requisition_usersstaging.stg_lever__requisition_postingstaging.stg_lever__requisition_offer
Create the marts table:
Read ./docs/data_contract.yaml and create marts.lever__requisition_enhanced from intermediate.int_lever__requisition_users, staging.stg_lever__requisition_posting, and staging.stg_lever__requisition_offer. Use the contract's business logic: keep all base requisition rows, count posting and offer links by requisition_id, fill missing counts with 0, and add has_posting and has_offer flags.
The intended order is always:
After the marts table is built, validate it directly:
Step 6: Submit a Daily Airflow Job¶
Ask the agent to operationalize a daily marts refresh. The Airflow quickstart environment already exposes the duckdb_dacomp_lever connection.
Submit a daily SQL job at 8 AM that rebuilds the same contract-derived chain:
Submit a daily SQL job named daily_lever_requisition_enhanced that refreshes staging.stg_lever__requisition, staging.stg_lever__user, staging.stg_lever__requisition_posting, staging.stg_lever__requisition_offer, intermediate.int_lever__requisition_users, and marts.lever__requisition_enhanced at 8am every day using the duckdb_dacomp_lever connection. Use the SQL generated and validated from docs/data_contract.yaml in the previous steps.
Then trigger it once for validation:
What to expect:
- a DAG file appears under
${DATUS_QUICKSTART_STACK}/airflow/dags - the same file is visible inside the Airflow container as
/opt/airflow/dags/<dag_id>.py - Airflow returns a
job_id - the job becomes visible in the Airflow UI
Step 7: Promote the Marts Table to the Superset Serving DB¶
The marts table above was built through the lever_duckdb datasource. Before
dashboard generation can create Superset assets, copy that table into the
BI-registered superset_serving Postgres datasource referenced by
dataset_db.datasource_ref. These names are Datus datasource names from
agent.yml, not physical database or catalog names inside DuckDB or Postgres.
Please copy the source table marts.lever__requisition_enhanced from the lever_duckdb datasource into the superset_serving datasource as public.lever__requisition_enhanced, replacing the target table if it already exists. Then verify the source and target row counts.
The transfer tool creates public.lever__requisition_enhanced from the source
result columns if it does not already exist.
After this step, the table exists in the same database Superset knows as
bi_database_name: examples.
Step 8: Create a Superset Dashboard¶
Once the marts table exists in superset_serving, ask the agent to build the dashboard.
Please create a requisition operations dashboard in Superset from public.lever__requisition_enhanced. Include KPI tiles for total requisitions, open requisitions, requisitions with postings, requisitions with offers, and total requested headcount. Add charts by status, team, location, employment_status, count_postings, and count_offers.
Data preparation is a separate ETL / scheduler step. Dashboard generation expects the table or SQL dataset to already be available in the BI-registered database.
Step 9: Verify the End-to-End Result¶
You should now have:
staging,intermediate, andmartsschemas inlever_workbench.duckdbmarts.lever__requisition_enhancedbuilt from raw data through staging and intermediate layers- a daily Airflow job visible in the scheduler UI
- a Superset dashboard URL returned by the dashboard generation flow
SaaS Studio Tour Variant¶
The hosted SaaS tour uses the same Lever workflow, but it does not use the
local lever_workbench.duckdb file. Instead, the platform provides a shared
DuckDB + Iceberg lakehouse:
- shared read-only raw namespace:
lake.demo_raw - per-workspace writable namespace:
lake.ws_<workspace_id> - SaaS Airflow connection:
duckdb_lever_workbench
Every user should run the tour in a separate workspace. The backend renders the
seeded docs/data_contract.yaml for that workspace, so outputs target
lake.ws_<workspace_id> while sources stay in lake.demo_raw. Prompts and SQL
should use fully qualified table names such as:
lake.demo_raw.requisition
lake.ws_<workspace_id>.stg_lever__requisition
lake.ws_<workspace_id>.int_lever__requisition_users
lake.ws_<workspace_id>.marts_lever__requisition_enhanced
Do not use unqualified physical schemas such as raw.*, staging.*,
intermediate.*, or marts.* in the SaaS tour. Those names are logical layers
only; the physical write boundary is the workspace namespace.
If a demo project or Airflow DAG was generated before the workspace-namespace
change, reset or recreate the demo project and regenerate the job so the DAG
uses lake.ws_<workspace_id> instead of an old hard-coded namespace.