Skip to content

Commit 25e58d1

Browse files
author
MilicaDjukic
committed
Merge branch 'main' into private/milicadjukic/BCBenchScript2
2 parents 3d38770 + ea49c9a commit 25e58d1

42 files changed

Lines changed: 999 additions & 516 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/setup-bc-container/action.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ inputs:
1414
github-token:
1515
description: GitHub token for accessing public repositories
1616
required: true
17+
skip-container:
18+
description: Skip BC container setup (only clone repository)
19+
required: false
20+
default: "false"
1721

1822
outputs:
1923
repo_path:
@@ -24,6 +28,7 @@ runs:
2428
using: composite
2529
steps:
2630
- name: Generate BC container name and credentials
31+
if: inputs.skip-container != 'true'
2732
run: |
2833
# Generate a 32-character random password using Get-Random
2934
# The password is short-lived and only used for the duration of the workflow
@@ -38,6 +43,7 @@ runs:
3843
shell: pwsh
3944

4045
- name: Install BcContainerHelper module
46+
if: inputs.skip-container != 'true'
4147
run: Install-Module -Name BcContainerHelper -Force -AllowClobber -AllowPrerelease
4248
shell: pwsh
4349

@@ -59,5 +65,5 @@ runs:
5965
$env:ADO_TOKEN = az account get-access-token --resource "499b84ac-1321-427f-aa17-267ca6975798" --query accessToken -o tsv
6066
Write-Output "::add-mask::$env:ADO_TOKEN"
6167
62-
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}"
68+
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
6369
shell: pwsh

.github/actions/setup-python-uv/action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ runs:
1111
using: composite
1212
steps:
1313
- name: Install uv
14-
uses: astral-sh/setup-uv@v7
14+
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57
1515
with:
1616
enable-cache: true
1717

.github/copilot-instructions.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ This is a benchmark for evaluating coding agents on real-world Business Central
1414
- Uses `uv` for dependency management: e.g. `uv add <package>` to add packages, `uv run <command>` to run commands
1515
- Uses `pre-commit` for code quality checks (ruff linting/formatting, trailing whitespace, etc.)
1616

17+
## Categories
18+
BC-Bench is category-based and designed to grow over time. It currently has two categories, `bug-fix` and `test-generation`. They share the same dataset tasks and execution-based setup, but use different prompts, expected outputs, and evaluation pipelines. Future categories such as `code-review` can be added within the same overall benchmark structure, though they may require different inputs, setup, or evaluation methods.
19+
1720
## Coding Patterns and Guidelines
1821

1922
- Prefer strong typing and type hints

.github/workflows/claude-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
category: ${{ inputs.category }}
6161

6262
evaluate-with-claude-code:
63-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
63+
runs-on: [GitHub-BCBench]
6464
needs: get-entries
6565
outputs:
6666
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

.github/workflows/copilot-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
category: ${{ inputs.category }}
6969

7070
evaluate-with-copilot-cli:
71-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
71+
runs-on: [GitHub-BCBench]
7272
needs: get-entries
7373
outputs:
7474
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

.github/workflows/copilot-setup-steps.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
uses: actions/checkout@v5
2222

2323
- name: Install uv
24-
uses: astral-sh/setup-uv@v7
24+
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57
2525
with:
2626
enable-cache: true
2727

.github/workflows/dataset-validation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
category: "bug-fix"
2323

2424
verify-build-and-tests:
25-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
25+
runs-on: [GitHub-BCBench]
2626
needs: get-entries
2727
if: needs.get-entries.outputs.entries != '[]'
2828
environment:

.github/workflows/mini-evaluation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ jobs:
4141
category: ${{ inputs.category }}
4242

4343
evaluate-with-mini-agent:
44-
runs-on: [self-hosted, 1ES.Pool=GitHub-BCBench]
44+
runs-on: [GitHub-BCBench]
4545
needs: get-entries
4646
outputs:
4747
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}

notebooks/bug-fix/overview.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@
269269
},
270270
{
271271
"cell_type": "code",
272-
"execution_count": 6,
272+
"execution_count": null,
273273
"id": "8b5bb1be",
274274
"metadata": {},
275275
"outputs": [
@@ -291,7 +291,7 @@
291291
"merged_df[\"image_bin\"] = pd.cut(merged_df[\"image_count\"], bins=bins, labels=labels)\n",
292292
"\n",
293293
"# Add problem statement char count\n",
294-
"ps_chars = {entry.instance_id: len(entry.get_task(transform_image_paths=False)) for entry in bcbench_dataset}\n",
294+
"ps_chars = {entry.instance_id: len(entry.get_task()) for entry in bcbench_dataset}\n",
295295
"merged_df[\"ps_chars\"] = merged_df[\"instance_id\"].map(ps_chars)\n",
296296
"\n",
297297
"instance_df = (\n",

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ analysis = [
7979
"plotly>=6.5.0",
8080
]
8181
dev = [
82-
"pytest>=8.0",
82+
"pytest>=9.0.3",
8383
"pytest-cov>=7.0",
8484
"ruff>=0.13.0",
8585
"pre-commit>=4.3.0",

0 commit comments

Comments
 (0)