Skip to content

Commit 6d77920

Browse files
author
MilicaDjukic
committed
Merge branch 'main' into private/milicadjukic/ALTestAgent
2 parents c97d967 + c362cd1 commit 6d77920

48 files changed

Lines changed: 3627 additions & 1113 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/setup-bc-container/action.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ inputs:
1414
github-token:
1515
description: GitHub token for accessing public repositories
1616
required: true
17+
skip-container:
18+
description: Skip BC container setup (only clone repository)
19+
required: false
20+
default: "false"
1721

1822
outputs:
1923
repo_path:
@@ -24,6 +28,7 @@ runs:
2428
using: composite
2529
steps:
2630
- name: Generate BC container name and credentials
31+
if: inputs.skip-container != 'true'
2732
run: |
2833
# Generate a 32-character random password using Get-Random
2934
# The password is short-lived and only used for the duration of the workflow
@@ -38,6 +43,7 @@ runs:
3843
shell: pwsh
3944

4045
- name: Install BcContainerHelper module
46+
if: inputs.skip-container != 'true'
4147
run: Install-Module -Name BcContainerHelper -Force -AllowClobber -AllowPrerelease
4248
shell: pwsh
4349

@@ -59,5 +65,5 @@ runs:
5965
$env:ADO_TOKEN = az account get-access-token --resource "499b84ac-1321-427f-aa17-267ca6975798" --query accessToken -o tsv
6066
Write-Output "::add-mask::$env:ADO_TOKEN"
6167
62-
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}"
68+
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
6369
shell: pwsh

.github/actions/setup-python-uv/action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ runs:
1111
using: composite
1212
steps:
1313
- name: Install uv
14-
uses: astral-sh/setup-uv@v7
14+
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57
1515
with:
1616
enable-cache: true
1717

.github/copilot-instructions.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ This is a benchmark for evaluating coding agents on real-world Business Central
55
- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments
66
- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
77
- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
8+
- **Tools** (`tools/`): Standalone scripts for downloading and analyzing GitHub Actions artifacts
89
- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code
910
- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
1011
- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results
@@ -14,6 +15,9 @@ This is a benchmark for evaluating coding agents on real-world Business Central
1415
- Uses `uv` for dependency management: e.g. `uv add <package>` to add packages, `uv run <command>` to run commands
1516
- Uses `pre-commit` for code quality checks (ruff linting/formatting, trailing whitespace, etc.)
1617

18+
## Categories
19+
BC-Bench is category-based and designed to grow over time. It currently has two categories, `bug-fix` and `test-generation`. They share the same dataset tasks and execution-based setup, but use different prompts, expected outputs, and evaluation pipelines. Future categories such as `code-review` can be added within the same overall benchmark structure, though they may require different inputs, setup, or evaluation methods.
20+
1721
## Coding Patterns and Guidelines
1822

1923
- Prefer strong typing and type hints

.github/workflows/claude-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
category: ${{ inputs.category }}
6161

6262
evaluate-with-claude-code:
63-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
63+
runs-on: [GitHub-BCBench]
6464
needs: get-entries
6565
outputs:
6666
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

.github/workflows/copilot-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
category: ${{ inputs.category }}
6969

7070
evaluate-with-copilot-cli:
71-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
71+
runs-on: [GitHub-BCBench]
7272
needs: get-entries
7373
outputs:
7474
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

.github/workflows/copilot-setup-steps.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
uses: actions/checkout@v5
2222

2323
- name: Install uv
24-
uses: astral-sh/setup-uv@v7
24+
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57
2525
with:
2626
enable-cache: true
2727

.github/workflows/dataset-validation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
category: "bug-fix"
2323

2424
verify-build-and-tests:
25-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
25+
runs-on: [GitHub-BCBench]
2626
needs: get-entries
2727
if: needs.get-entries.outputs.entries != '[]'
2828
environment:

.github/workflows/mini-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ jobs:
4141
category: ${{ inputs.category }}
4242

4343
evaluate-with-mini-agent:
44-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
44+
runs-on: [GitHub-BCBench]
4545
needs: get-entries
4646
outputs:
4747
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

0 commit comments

Comments
 (0)