RAG Pipeline Benchmarking Consolidation Guide

Overview

This guide explains the consolidated benchmarking approach that leverages the existing comprehensive evaluation infrastructure rather than duplicating functionality.

Existing Infrastructure Analysis

This project already contains a mature, comprehensive evaluation framework with:

📁 Evaluation Framework (`evaluation_framework/`)

evaluation_orchestrator.py - End-to-end evaluation orchestration system
real_production_evaluation.py - Production-ready evaluation using real infrastructure
comparative_analysis_system.py - Comprehensive pipeline comparison
ragas_metrics_framework.py - Full RAGAS metrics implementation with statistical analysis
statistical_evaluation_methodology.py - Statistical testing and power analysis
biomedical_question_generator.py - Domain-specific question generation
visualization_engine.py - Advanced visualization and reporting

📁 Existing Benchmarking Components

benchmarks/ - Performance benchmark results and infrastructure
tests/test_comprehensive_pipeline_validation_e2e_fixed.py - Pipeline validation with infrastructure assessment
Multiple evaluation reports and results in evaluation_framework/outputs/

🏗️ Pipeline Infrastructure

Mature pipeline implementations (BasicRAG, CRAG, GraphRAG, BasicRAGReranking)
Production-ready connection management and configuration
Real vector search with IRIS database
Comprehensive error handling and validation

Consolidation Approach

Instead of creating duplicate functionality, the consolidation provides:

1. Unified Interface (`scripts/unified_rag_benchmark.py`)

from unified_rag_benchmark import UnifiedRAGBenchmark, BenchmarkConfig

config = BenchmarkConfig(
    experiment_name="my_benchmark",
    num_queries=100,
    pipelines=['BasicRAGPipeline', 'CRAGPipeline']
)

benchmark = UnifiedRAGBenchmark(config)
results = benchmark.run_full_benchmark()

2. Simple Runner (`scripts/run_benchmark.py`)

# Quick test
python scripts/run_benchmark.py --quick

# Full benchmark
python scripts/run_benchmark.py --full

# Custom benchmark
python scripts/run_benchmark.py --num-queries 200 --pipelines BasicRAGPipeline CRAGPipeline

Key Benefits of Consolidation

✅ Leverages Existing Mature Infrastructure

Uses battle-tested evaluation components
Avoids code duplication
Maintains consistency with existing evaluation patterns

✅ Simplified Interface

Clean, simple API for common benchmarking tasks
Multiple convenience methods (quick, full, performance-only)
Preserves access to advanced features when needed

✅ Production-Ready

Uses real infrastructure (IRIS database, OpenAI LLMs)
Comprehensive error handling and logging
Statistical rigor and reproducible results

✅ Extensible

Easy to add new benchmark types
Configuration-driven approach
Integrates seamlessly with existing workflows

Usage Examples

Quick Performance Test

python scripts/run_benchmark.py --quick

50 queries across 2 pipelines
Fast execution for development/testing
Performance metrics only

Full Comprehensive Benchmark

python scripts/run_benchmark.py --full

100 queries across all 4 pipelines
Performance + quality evaluation
Complete RAGAS metrics
Statistical analysis and reporting

Performance-Only Benchmark

python scripts/run_benchmark.py --performance-only --num-queries 25

Fast performance measurement
Latency, throughput, success rate
Minimal infrastructure requirements

Custom Benchmarks

# Custom pipeline selection
python scripts/run_benchmark.py --pipelines BasicRAGPipeline CRAGPipeline --num-queries 150

# Custom output directory
python scripts/run_benchmark.py --output-dir my_results --num-queries 200

Advanced Usage

For advanced evaluation needs, use the existing evaluation framework directly:

# Use existing comparative analysis system
from evaluation_framework.comparative_analysis_system import create_comparative_analysis_system

# Use existing RAGAS framework  
from evaluation_framework.ragas_metrics_framework import create_biomedical_ragas_framework

# Use existing orchestrator for complex experiments
from evaluation_framework.evaluation_orchestrator import create_evaluation_orchestrator

Output Structure

outputs/
├── unified_benchmark/           # Unified benchmark results
│   ├── benchmark_results_*.json # JSON results
│   ├── benchmark_report_*.html  # HTML reports
│   └── benchmark_*.log          # Execution logs
├── evaluation_framework/        # Existing evaluation outputs
│   ├── production_evaluation/
│   ├── real_production_evaluation/
│   └── evaluation_results/
└── benchmarks/                  # Legacy benchmark results
    ├── performance_report_*.json
    └── ...

Migration from Legacy Scripts

If you have existing benchmark scripts, migrate by:

Replace custom benchmark logic with calls to UnifiedRAGBenchmark
Use existing evaluation framework for advanced metrics
Leverage configuration classes instead of hardcoded parameters
Use the runner script for common use cases

Integration with Existing Workflows

The consolidated approach integrates with:

Make targets - Add benchmark targets that use the runner script
CI/CD pipelines - Use --performance-only for fast validation
Development workflow - Use --quick for rapid testing
Production evaluation - Use existing evaluation_framework tools

Best Practices

🎯 Choose the Right Tool

Runner script for common benchmarking tasks
Unified interface for programmatic access
Existing evaluation framework for advanced analysis

📊 Performance vs Quality

Performance benchmarks are fast, good for development
Quality benchmarks use RAGAS metrics, better for validation
Full benchmarks provide comprehensive analysis

🔧 Configuration Management

Use BenchmarkConfig for consistency
Environment variables for infrastructure settings
Configuration files for complex experiments

📈 Results Analysis

JSON files for programmatic analysis
HTML reports for human review
Existing visualization tools for advanced charts

Troubleshooting

Common Issues

Missing evaluation framework

# Ensure evaluation_framework/ directory exists
ls evaluation_framework/

Import errors

# Check Python path includes evaluation framework
export PYTHONPATH="${PYTHONPATH}:./evaluation_framework"

Infrastructure connectivity

# Verify .env configuration
# Check IRIS database connectivity
# Validate OpenAI API keys

Getting Help

Check existing evaluation framework documentation in evaluation_framework/README.md
Review production evaluation logs in evaluation_framework/outputs/
Use --verbose flag for detailed logging
Examine existing benchmark results in benchmarks/ and evaluation_framework/outputs/

Conclusion

The consolidation approach provides a clean, simple interface to the existing mature evaluation infrastructure while avoiding code duplication and maintaining consistency with established patterns. This approach leverages the significant investment already made in comprehensive evaluation capabilities while making them more accessible for common benchmarking tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Pipeline Benchmarking Consolidation Guide

Overview

Existing Infrastructure Analysis

📁 Evaluation Framework (`evaluation_framework/`)

📁 Existing Benchmarking Components

🏗️ Pipeline Infrastructure

Consolidation Approach

1. Unified Interface (`scripts/unified_rag_benchmark.py`)

2. Simple Runner (`scripts/run_benchmark.py`)

Key Benefits of Consolidation

✅ Leverages Existing Mature Infrastructure

✅ Simplified Interface

✅ Production-Ready

✅ Extensible

Usage Examples

Quick Performance Test

Full Comprehensive Benchmark

Performance-Only Benchmark

Custom Benchmarks

Advanced Usage

Output Structure

Migration from Legacy Scripts

Integration with Existing Workflows

Best Practices

🎯 Choose the Right Tool

📊 Performance vs Quality

🔧 Configuration Management

📈 Results Analysis

Troubleshooting

Common Issues

Getting Help

Conclusion

FilesExpand file tree

BENCHMARKING_CONSOLIDATION_GUIDE.md

Latest commit

History

BENCHMARKING_CONSOLIDATION_GUIDE.md

File metadata and controls

RAG Pipeline Benchmarking Consolidation Guide

Overview

Existing Infrastructure Analysis

📁 Evaluation Framework (evaluation_framework/)

📁 Existing Benchmarking Components

🏗️ Pipeline Infrastructure

Consolidation Approach

1. Unified Interface (scripts/unified_rag_benchmark.py)

2. Simple Runner (scripts/run_benchmark.py)

Key Benefits of Consolidation

✅ Leverages Existing Mature Infrastructure

✅ Simplified Interface

✅ Production-Ready

✅ Extensible

Usage Examples

Quick Performance Test

Full Comprehensive Benchmark

Performance-Only Benchmark

Custom Benchmarks

Advanced Usage

Output Structure

Migration from Legacy Scripts

Integration with Existing Workflows

Best Practices

🎯 Choose the Right Tool

📊 Performance vs Quality

🔧 Configuration Management

📈 Results Analysis

Troubleshooting

Common Issues

Getting Help

Conclusion

📁 Evaluation Framework (`evaluation_framework/`)

1. Unified Interface (`scripts/unified_rag_benchmark.py`)

2. Simple Runner (`scripts/run_benchmark.py`)