Skip to content

Commit c362cd1

Browse files
djukicmilicaventselarturAleksanderGladkovhaoranpbMilicaDjukic
authored
Add evaluation analysis scripts (#605)
Co-authored-by: ventselartur <ventselartur@microsoft.com> Co-authored-by: Aleksandr Gladkov <scorpio.szk@gmail.com> Co-authored-by: Haoran Sun (Business Central) <haoransun@microsoft.com> Co-authored-by: MilicaDjukic <milicadjukic@microsoft.com>
1 parent 28d0916 commit c362cd1

5 files changed

Lines changed: 1285 additions & 0 deletions

File tree

.github/copilot-instructions.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ This is a benchmark for evaluating coding agents on real-world Business Central
55
- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments
66
- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
77
- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
8+
- **Tools** (`tools/`): Standalone scripts for downloading and analyzing GitHub Actions artifacts
89
- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code
910
- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
1011
- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

tools/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Tools
2+
3+
Standalone scripts for downloading and analyzing GitHub Actions artifacts.
4+
5+
## `altest/`
6+
7+
Scripts for analyzing AL test results from BC-Bench GitHub Actions runs:
8+
9+
- **`Get-WorkflowSummary.ps1`** — Fetches workflow run summaries from GitHub Actions, downloads run artifacts, and extracts JSONL result files (even from nested zips).
10+
- **`bcbench_analyze_artifacts.py`** — Extracts, collects, and summarizes test results from downloaded artifact zips or pre-extracted folders. Outputs failure rankings, error variations, and extracted test code.
11+
- **`group_errors_from_summary.py`** — Groups error messages from `errors_summary.csv` into high-level categories for easier triage.

0 commit comments

Comments
 (0)