Skip to content

Latest commit

 

History

History
223 lines (170 loc) · 7.46 KB

File metadata and controls

223 lines (170 loc) · 7.46 KB

RAG Pipeline Benchmarking Consolidation Guide

Overview

This guide explains the consolidated benchmarking approach that leverages the existing comprehensive evaluation infrastructure rather than duplicating functionality.

Existing Infrastructure Analysis

This project already contains a mature, comprehensive evaluation framework with:

📁 Evaluation Framework (evaluation_framework/)

  • evaluation_orchestrator.py - End-to-end evaluation orchestration system
  • real_production_evaluation.py - Production-ready evaluation using real infrastructure
  • comparative_analysis_system.py - Comprehensive pipeline comparison
  • ragas_metrics_framework.py - Full RAGAS metrics implementation with statistical analysis
  • statistical_evaluation_methodology.py - Statistical testing and power analysis
  • biomedical_question_generator.py - Domain-specific question generation
  • visualization_engine.py - Advanced visualization and reporting

📁 Existing Benchmarking Components

  • benchmarks/ - Performance benchmark results and infrastructure
  • tests/test_comprehensive_pipeline_validation_e2e_fixed.py - Pipeline validation with infrastructure assessment
  • Multiple evaluation reports and results in evaluation_framework/outputs/

🏗️ Pipeline Infrastructure

  • Mature pipeline implementations (BasicRAG, CRAG, GraphRAG, BasicRAGReranking)
  • Production-ready connection management and configuration
  • Real vector search with IRIS database
  • Comprehensive error handling and validation

Consolidation Approach

Instead of creating duplicate functionality, the consolidation provides:

1. Unified Interface (scripts/unified_rag_benchmark.py)

from unified_rag_benchmark import UnifiedRAGBenchmark, BenchmarkConfig

config = BenchmarkConfig(
    experiment_name="my_benchmark",
    num_queries=100,
    pipelines=['BasicRAGPipeline', 'CRAGPipeline']
)

benchmark = UnifiedRAGBenchmark(config)
results = benchmark.run_full_benchmark()

2. Simple Runner (scripts/run_benchmark.py)

# Quick test
python scripts/run_benchmark.py --quick

# Full benchmark
python scripts/run_benchmark.py --full

# Custom benchmark
python scripts/run_benchmark.py --num-queries 200 --pipelines BasicRAGPipeline CRAGPipeline

Key Benefits of Consolidation

✅ Leverages Existing Mature Infrastructure

  • Uses battle-tested evaluation components
  • Avoids code duplication
  • Maintains consistency with existing evaluation patterns

✅ Simplified Interface

  • Clean, simple API for common benchmarking tasks
  • Multiple convenience methods (quick, full, performance-only)
  • Preserves access to advanced features when needed

✅ Production-Ready

  • Uses real infrastructure (IRIS database, OpenAI LLMs)
  • Comprehensive error handling and logging
  • Statistical rigor and reproducible results

✅ Extensible

  • Easy to add new benchmark types
  • Configuration-driven approach
  • Integrates seamlessly with existing workflows

Usage Examples

Quick Performance Test

python scripts/run_benchmark.py --quick
  • 50 queries across 2 pipelines
  • Fast execution for development/testing
  • Performance metrics only

Full Comprehensive Benchmark

python scripts/run_benchmark.py --full
  • 100 queries across all 4 pipelines
  • Performance + quality evaluation
  • Complete RAGAS metrics
  • Statistical analysis and reporting

Performance-Only Benchmark

python scripts/run_benchmark.py --performance-only --num-queries 25
  • Fast performance measurement
  • Latency, throughput, success rate
  • Minimal infrastructure requirements

Custom Benchmarks

# Custom pipeline selection
python scripts/run_benchmark.py --pipelines BasicRAGPipeline CRAGPipeline --num-queries 150

# Custom output directory
python scripts/run_benchmark.py --output-dir my_results --num-queries 200

Advanced Usage

For advanced evaluation needs, use the existing evaluation framework directly:

# Use existing comparative analysis system
from evaluation_framework.comparative_analysis_system import create_comparative_analysis_system

# Use existing RAGAS framework  
from evaluation_framework.ragas_metrics_framework import create_biomedical_ragas_framework

# Use existing orchestrator for complex experiments
from evaluation_framework.evaluation_orchestrator import create_evaluation_orchestrator

Output Structure

outputs/
├── unified_benchmark/           # Unified benchmark results
│   ├── benchmark_results_*.json # JSON results
│   ├── benchmark_report_*.html  # HTML reports
│   └── benchmark_*.log          # Execution logs
├── evaluation_framework/        # Existing evaluation outputs
│   ├── production_evaluation/
│   ├── real_production_evaluation/
│   └── evaluation_results/
└── benchmarks/                  # Legacy benchmark results
    ├── performance_report_*.json
    └── ...

Migration from Legacy Scripts

If you have existing benchmark scripts, migrate by:

  1. Replace custom benchmark logic with calls to UnifiedRAGBenchmark
  2. Use existing evaluation framework for advanced metrics
  3. Leverage configuration classes instead of hardcoded parameters
  4. Use the runner script for common use cases

Integration with Existing Workflows

The consolidated approach integrates with:

  • Make targets - Add benchmark targets that use the runner script
  • CI/CD pipelines - Use --performance-only for fast validation
  • Development workflow - Use --quick for rapid testing
  • Production evaluation - Use existing evaluation_framework tools

Best Practices

🎯 Choose the Right Tool

  • Runner script for common benchmarking tasks
  • Unified interface for programmatic access
  • Existing evaluation framework for advanced analysis

📊 Performance vs Quality

  • Performance benchmarks are fast, good for development
  • Quality benchmarks use RAGAS metrics, better for validation
  • Full benchmarks provide comprehensive analysis

🔧 Configuration Management

  • Use BenchmarkConfig for consistency
  • Environment variables for infrastructure settings
  • Configuration files for complex experiments

📈 Results Analysis

  • JSON files for programmatic analysis
  • HTML reports for human review
  • Existing visualization tools for advanced charts

Troubleshooting

Common Issues

  1. Missing evaluation framework

    # Ensure evaluation_framework/ directory exists
    ls evaluation_framework/
  2. Import errors

    # Check Python path includes evaluation framework
    export PYTHONPATH="${PYTHONPATH}:./evaluation_framework"
  3. Infrastructure connectivity

    # Verify .env configuration
    # Check IRIS database connectivity
    # Validate OpenAI API keys

Getting Help

  • Check existing evaluation framework documentation in evaluation_framework/README.md
  • Review production evaluation logs in evaluation_framework/outputs/
  • Use --verbose flag for detailed logging
  • Examine existing benchmark results in benchmarks/ and evaluation_framework/outputs/

Conclusion

The consolidation approach provides a clean, simple interface to the existing mature evaluation infrastructure while avoiding code duplication and maintaining consistency with established patterns. This approach leverages the significant investment already made in comprehensive evaluation capabilities while making them more accessible for common benchmarking tasks.