Benchmark¶
Evaluate Datus Agent's performance and capabilities using industry-standard benchmarks. Run comprehensive tests against datasets like BIRD and Spider 2.0-Snow to assess accuracy, execution success rate, and query generation quality.
Overview¶
Datus Agent benchmark mode enables you to:
- Measure Accuracy: Evaluate how well the agent generates correct SQL from natural language
- Track Success Rates: Monitor query execution success across different database types
- Compare Results: Validate generated queries against expected outputs
- Identify Improvements: Discover areas for optimization and refinement
Quick Start with Docker¶
Get started quickly with pre-configured Docker containers that include benchmark datasets.
Step 1: Pull the Docker Image¶
Tip
Ensure Docker is installed and running on your system before proceeding.
Step 2: Launch the Docker Container¶
Tip
Demo datasets are preloaded, allowing you to quickly explore Datus capabilities without additional setup.
Step 3: Run Benchmark Tests¶
Warning
Each task may take several minutes to complete. Running all tasks may require hours or days depending on your system configuration.
BIRD Dataset
Info
Task ID range: 0-1533
Spider 2.0-Snow Dataset
Info
You can find the task ID (instance ID) in the spider2-snow.jsonl file.
Note
Ensure you start the Docker container with Snowflake environment parameters configured.
Step 4: Review Benchmark Results¶
Tip
Each benchmark generates a comprehensive performance summary with detailed metrics and task breakdowns.
================================================================================
BENCHMARK ACCURACY EVALUATION REPORT
================================================================================
Generated Time: 2025-09-18 12:07:30
EXECUTIVE SUMMARY
----------------------------------------
Total tasks analyzed: 1
Execution success rate: 100.0%
Result comparison match rate: 100.0%
DETAILED STATISTICS
----------------------------------------
Total comparisons performed: 1
Successful matches: 1
Mismatches: 0
Comparison errors: 0
Empty result errors: 0
Mismatch rate: 0.0%
Error rate: 0.0%
TASK BREAKDOWN BY CATEGORY
----------------------------------------
Matched tasks (1):
14
Mismatched tasks (0):
None
Failed tasks (0):
None
ADDITIONAL STATISTICS
----------------------------------------
Overall success rate: 100.0%
Successful tasks: 1
Failed tasks: 0
Mismatched tasks: 0
================================================================================ [datus.utils.benchmark_utils]
2025-09-18 12:07:30 [info ]
Final Result: {'status': 'success', 'generated_time': '2025-09-18T12:07:30.497861', 'summary': {'total_files': 1, 'total_output_nodes': 1, 'total_output_success': 1, 'total_output_failure': 0, 'success_rate': 100.0, 'comparison_summary': {'total_comparisons': 1, 'successful_matches': 1, 'mismatches': 0, 'comparison_errors': 0, 'empty_result_errors': 0, 'match_rate': 100.0}}, 'task_ids': {'failed_task_ids': '', 'matched_task_ids': '14', 'mismatched_task_ids': '', 'empty_result_task_ids': ''}, 'details': {'14': {'total_nodes': 6, 'output_nodes': 1, 'output_success': 1, 'output_failure': 0, 'errors': [], 'node_types': {'start': 1, 'schema_linking': 1, 'generate_sql': 1, 'execute_sql': 1, 'reflect': 1, 'output': 1}, 'completion_time': 1758197249.7893646, 'status': 'completed', 'comparison_results': [{'task_id': '14', 'actual_file_exists': True, 'gold_file_exists': True, 'actual_path': 'output/bird_sqlite/14.csv', 'gold_path': '/app/benchmark/dev_20240627/gold/exec_result/14.csv', 'comparison': {'match': True, 'actual_file_exists': True, 'expected_file_exists': True, 'actual_shape': (5, 1), 'expected_shape': (5, 1), 'actual_preview': 'NCESSchool\n ----------\n 11707\n 4653\n 8283\n ...', 'expected_preview': 'NCESSchool\n ----------\n 11707\n 4653\n 8283\n ...', 'error': None}}]}}} [__main__]