Storage¶
Configure storage settings for Datus Agent's embedding models and vector databases. The storage configuration manages how metadata, documents, and metrics are embedded and stored for efficient retrieval during schema linking and knowledge search.
Storage Configuration Structure¶
The storage configuration defines the base path for vector databases and embedding models for different data types:
storage:
base_path: data # RAG storage base path
embedding_device_type: cpu # Device type for embedding models
# Database metadata and sample data embedding
database:
registry_name: openai # Embedding provider
model_name: text-embedding-v3-small
dim_size: 1024
batch_size: 10
target_model: openai
# Document embedding configuration
document:
model_name: all-MiniLM-L6-v2 # Local embedding model
dim_size: 384
# Metrics embedding configuration
metric:
model_name: all-MiniLM-L6-v2 # Local embedding model
dim_size: 384
Base Configuration¶
Storage Path¶
The final data paths will be:
- data/datus_db_<namespace_name> for each configured namespace
- Example: data/datus_db_snowflake, data/datus_db_local_sqlite
Device Configuration¶
Device Options:
cpu: Force CPU usage for embedding modelscuda: Use NVIDIA GPU (if available)mps: Use Apple Metal Performance Shaders (Apple Silicon)auto: Automatically select best available device
Embedding Model Configuration¶
Database Embeddings¶
For table metadata, schema information, and sample data:
database:
registry_name: openai # openai or sentence-transformers
model_name: text-embedding-v3-small
dim_size: 1024
batch_size: 10
target_model: openai # Reference to agent.models
Configuration Parameters:
registry_name: Embedding provider type (openaiorsentence-transformers)model_name: Specific embedding model to usedim_size: Output embedding dimension sizebatch_size: Number of texts to process in each batchtarget_model: LLM model key fromagent.models(for OpenAI embeddings)
Document Embeddings¶
For knowledge base documents and extended documentation:
document:
model_name: all-MiniLM-L6-v2 # Lightweight model (~100MB)
dim_size: 384 # Smaller dimension for efficiency
Metric Embeddings¶
For business metrics and KPI definitions:
metric:
model_name: all-MiniLM-L6-v2 # Consistent with document embeddings
dim_size: 384 # Matching dimension size
Embedding Provider Options¶
OpenAI Embeddings (Cloud)¶
For high-quality embeddings with cloud API:
database:
registry_name: openai
model_name: text-embedding-v3-small # or text-embedding-v3-large
dim_size: 1536 # 1536 for v3-small, 3072 for v3-large
batch_size: 10 # Adjust based on rate limits
target_model: openai # Must reference valid model in agent.models
Environment Variables
Ensure your OpenAI API key is configured:
Sentence Transformers (Local)¶
For local embedding models without external API calls:
database:
registry_name: sentence-transformers # Default local provider
model_name: all-MiniLM-L6-v2 # Lightweight option
dim_size: 384
Alternative Local Models
Consider these high-quality alternatives:
intfloat/multilingual-e5-large-instruct: 1.2GB, 1024 dimensions, multilingualBAAI/bge-large-en-v1.5: 1.2GB, 1024 dimensions (English optimized)BAAI/bge-large-zh-v1.5: 1.2GB, 1024 dimensions (Chinese optimized)
Model Selection Guidelines¶
Optimized for speed and minimal resource usage:
Good balance of speed and retrieval quality:
Complete Configuration Examples¶
Fast local embeddings optimized for development:
storage:
base_path: data
embedding_device_type: auto # Use best available device
# Fast local embeddings for all data types
database:
registry_name: sentence-transformers
model_name: all-MiniLM-L6-v2
dim_size: 384
document:
model_name: all-MiniLM-L6-v2
dim_size: 384
metric:
model_name: all-MiniLM-L6-v2
dim_size: 384
Best For
- Development environments
- Resource-constrained systems
- Offline deployment requirements
Combines cloud quality for critical data with local efficiency:
storage:
base_path: data
embedding_device_type: cpu
# High-quality cloud embeddings for database metadata
database:
registry_name: openai
model_name: text-embedding-v3-small
dim_size: 1536
batch_size: 10
target_model: openai
# Local embeddings for documents and metrics
document:
model_name: intfloat/multilingual-e5-large-instruct
dim_size: 1024
metric:
model_name: intfloat/multilingual-e5-large-instruct
dim_size: 1024
Best For
- Production environments with mixed requirements
- Cost-conscious deployments
- Balancing quality and performance
Maximum quality embeddings for production systems:
storage:
base_path: /opt/datus/embeddings
embedding_device_type: cuda # Use GPU acceleration
# High-quality embeddings across all data types
database:
registry_name: openai
model_name: text-embedding-v3-large
dim_size: 3072
batch_size: 5 # Smaller batches for large model
target_model: openai
document:
model_name: BAAI/bge-large-en-v1.5
dim_size: 1024
metric:
model_name: BAAI/bge-large-en-v1.5
dim_size: 1024
Best For
- Enterprise production environments
- High-accuracy requirements
- Systems with GPU acceleration
Integration with Other Components¶
Metrics Configuration¶
The storage configuration works with the metrics section to embed business metrics: