This guide explains the consolidated benchmarking approach that leverages the existing comprehensive evaluation infrastructure rather than duplicating functionality.
This project already contains a mature, comprehensive evaluation framework with:
evaluation_orchestrator.py- End-to-end evaluation orchestration systemreal_production_evaluation.py- Production-ready evaluation using real infrastructurecomparative_analysis_system.py- Comprehensive pipeline comparisonragas_metrics_framework.py- Full RAGAS metrics implementation with statistical analysisstatistical_evaluation_methodology.py- Statistical testing and power analysisbiomedical_question_generator.py- Domain-specific question generationvisualization_engine.py- Advanced visualization and reporting
benchmarks/- Performance benchmark results and infrastructuretests/test_comprehensive_pipeline_validation_e2e_fixed.py- Pipeline validation with infrastructure assessment- Multiple evaluation reports and results in
evaluation_framework/outputs/
- Mature pipeline implementations (BasicRAG, CRAG, GraphRAG, BasicRAGReranking)
- Production-ready connection management and configuration
- Real vector search with IRIS database
- Comprehensive error handling and validation
Instead of creating duplicate functionality, the consolidation provides:
from unified_rag_benchmark import UnifiedRAGBenchmark, BenchmarkConfig
config = BenchmarkConfig(
experiment_name="my_benchmark",
num_queries=100,
pipelines=['BasicRAGPipeline', 'CRAGPipeline']
)
benchmark = UnifiedRAGBenchmark(config)
results = benchmark.run_full_benchmark()# Quick test
python scripts/run_benchmark.py --quick
# Full benchmark
python scripts/run_benchmark.py --full
# Custom benchmark
python scripts/run_benchmark.py --num-queries 200 --pipelines BasicRAGPipeline CRAGPipeline- Uses battle-tested evaluation components
- Avoids code duplication
- Maintains consistency with existing evaluation patterns
- Clean, simple API for common benchmarking tasks
- Multiple convenience methods (quick, full, performance-only)
- Preserves access to advanced features when needed
- Uses real infrastructure (IRIS database, OpenAI LLMs)
- Comprehensive error handling and logging
- Statistical rigor and reproducible results
- Easy to add new benchmark types
- Configuration-driven approach
- Integrates seamlessly with existing workflows
python scripts/run_benchmark.py --quick- 50 queries across 2 pipelines
- Fast execution for development/testing
- Performance metrics only
python scripts/run_benchmark.py --full- 100 queries across all 4 pipelines
- Performance + quality evaluation
- Complete RAGAS metrics
- Statistical analysis and reporting
python scripts/run_benchmark.py --performance-only --num-queries 25- Fast performance measurement
- Latency, throughput, success rate
- Minimal infrastructure requirements
# Custom pipeline selection
python scripts/run_benchmark.py --pipelines BasicRAGPipeline CRAGPipeline --num-queries 150
# Custom output directory
python scripts/run_benchmark.py --output-dir my_results --num-queries 200For advanced evaluation needs, use the existing evaluation framework directly:
# Use existing comparative analysis system
from evaluation_framework.comparative_analysis_system import create_comparative_analysis_system
# Use existing RAGAS framework
from evaluation_framework.ragas_metrics_framework import create_biomedical_ragas_framework
# Use existing orchestrator for complex experiments
from evaluation_framework.evaluation_orchestrator import create_evaluation_orchestratoroutputs/
├── unified_benchmark/ # Unified benchmark results
│ ├── benchmark_results_*.json # JSON results
│ ├── benchmark_report_*.html # HTML reports
│ └── benchmark_*.log # Execution logs
├── evaluation_framework/ # Existing evaluation outputs
│ ├── production_evaluation/
│ ├── real_production_evaluation/
│ └── evaluation_results/
└── benchmarks/ # Legacy benchmark results
├── performance_report_*.json
└── ...
If you have existing benchmark scripts, migrate by:
- Replace custom benchmark logic with calls to
UnifiedRAGBenchmark - Use existing evaluation framework for advanced metrics
- Leverage configuration classes instead of hardcoded parameters
- Use the runner script for common use cases
The consolidated approach integrates with:
- Make targets - Add benchmark targets that use the runner script
- CI/CD pipelines - Use
--performance-onlyfor fast validation - Development workflow - Use
--quickfor rapid testing - Production evaluation - Use existing
evaluation_frameworktools
- Runner script for common benchmarking tasks
- Unified interface for programmatic access
- Existing evaluation framework for advanced analysis
- Performance benchmarks are fast, good for development
- Quality benchmarks use RAGAS metrics, better for validation
- Full benchmarks provide comprehensive analysis
- Use
BenchmarkConfigfor consistency - Environment variables for infrastructure settings
- Configuration files for complex experiments
- JSON files for programmatic analysis
- HTML reports for human review
- Existing visualization tools for advanced charts
-
Missing evaluation framework
# Ensure evaluation_framework/ directory exists ls evaluation_framework/ -
Import errors
# Check Python path includes evaluation framework export PYTHONPATH="${PYTHONPATH}:./evaluation_framework"
-
Infrastructure connectivity
# Verify .env configuration # Check IRIS database connectivity # Validate OpenAI API keys
- Check existing evaluation framework documentation in
evaluation_framework/README.md - Review production evaluation logs in
evaluation_framework/outputs/ - Use
--verboseflag for detailed logging - Examine existing benchmark results in
benchmarks/andevaluation_framework/outputs/
The consolidation approach provides a clean, simple interface to the existing mature evaluation infrastructure while avoiding code duplication and maintaining consistency with established patterns. This approach leverages the significant investment already made in comprehensive evaluation capabilities while making them more accessible for common benchmarking tasks.