diff --git a/autonomous/agent-loop.sh b/autonomous/agent-loop.sh index e9fab30..88036dd 100644 --- a/autonomous/agent-loop.sh +++ b/autonomous/agent-loop.sh @@ -133,6 +133,16 @@ case "$STATE" in $(cat /tmp/plan-backup/done/task.md) +" + fi + if [ -f /tmp/plan-backup/benchmarks/comparison.md ]; then + PR_BODY="${PR_BODY} + +
+Benchmark comparison + +$(cat /tmp/plan-backup/benchmarks/comparison.md) +
" fi PR_BODY="${PR_BODY} @@ -249,7 +259,7 @@ EOF # If plan state doesn't exist on the branch, re-initialise if [ ! -d ./plan ]; then echo "No plan state found on branch, starting fresh" - mkdir -p ./plan/planning ./plan/ready ./plan/done + mkdir -p ./plan/planning ./plan/ready ./plan/done ./plan/benchmarks ISSUE_BODY=$(gh issue view "$ISSUE_NUM" --repo "$UPSTREAM_REPO" \ --json body,title --jq '"# Issue #'"$ISSUE_NUM"': " + .title + "\n\n" + .body') @@ -264,7 +274,7 @@ EOF echo "Starting fresh: agent/${ISSUE_NUM}-${BRANCH_SLUG}" git fetch upstream git checkout -b "agent/${ISSUE_NUM}-${BRANCH_SLUG}" "upstream/$UPSTREAM_BASE_BRANCH" - mkdir -p ./plan/planning ./plan/ready ./plan/done + mkdir -p ./plan/planning ./plan/ready ./plan/done ./plan/benchmarks ISSUE_BODY=$(gh issue view "$ISSUE_NUM" --repo "$UPSTREAM_REPO" \ --json body,title --jq '"# Issue #'"$ISSUE_NUM"': " + .title + "\n\n" + .body') diff --git a/autonomous/prompts/create-tasks.md b/autonomous/prompts/create-tasks.md index 900af7d..c7b4906 100644 --- a/autonomous/prompts/create-tasks.md +++ b/autonomous/prompts/create-tasks.md @@ -21,4 +21,36 @@ Include tasks for: Use `[ ]` for each task. Validate the task list is complete by cross-referencing every acceptance criterion in the brief — each criterion must be covered by at least one task. -Any task that attempts to alter the ./.github folder will likely fail due to permissions restrictions. These changes should accompany the PR as an attached file with clear direction on the manual intervention required to complete the work. \ No newline at end of file +Any task that attempts to alter the ./.github folder will likely fail due to permissions restrictions. These changes should accompany the PR as an attached file with clear direction on the manual intervention required to complete the work. + +If the brief category is `performance`, follow the benchmark-driven development process +in spec/tech-standards/testing-standards.md. The task ordering MUST be: + +Phase 1 — Benchmark scaffolding (before any implementation changes): + - [ ] Create benchmark class(es) in HdrHistogram.Benchmarking/ + - Micro-benchmarks for the specific operations being optimised + - End-to-end benchmarks that exercise the realistic user workflow + - Add [MemoryDiagnoser] to all benchmark classes + - Register new benchmarks in Program.cs BenchmarkSwitcher + - [ ] Build and verify benchmarks compile: + `dotnet build HdrHistogram.Benchmarking/ -c Release` + - [ ] Run baseline benchmarks on the UNMODIFIED code + - Run: `dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*BenchmarkClass*' --exporters json` + - Save formatted results table to `plan/benchmarks/baseline.md` + - Include: Mean, StdDev, Allocated, Op/s for each benchmark method + +Phase 2 — Implementation (the actual code changes): + - [ ] (implementation tasks as normal) + - [ ] (unit tests as normal) + +Phase 3 — Benchmark validation (after implementation is complete): + - [ ] Run post-change benchmarks with identical configuration + - Save results to `plan/benchmarks/post-change.md` + - [ ] Generate comparison in `plan/benchmarks/comparison.md` containing: + - Side-by-side table: Benchmark | Baseline | Post-Change | Delta | Delta % + - Summary: which metrics improved, which regressed, which unchanged + - Verdict: does the data support the change? + +Create the directory `plan/benchmarks/` for storing results. + +If the brief category is `functional`, use the current ordering (implementation, tests, docs). diff --git a/autonomous/prompts/execute-tasks.md b/autonomous/prompts/execute-tasks.md index 72ef5fb..a8af6ce 100644 --- a/autonomous/prompts/execute-tasks.md +++ b/autonomous/prompts/execute-tasks.md @@ -24,4 +24,25 @@ After all tasks are marked `[x]`: 2. If the review identifies issues, append new `[ ]` tasks to task.md describing each fix. 3. If the review is clean, move brief.md and task.md to ./plan/done/ -Process as many tasks as you can in this iteration. \ No newline at end of file +Process as many tasks as you can in this iteration. + +Special handling for benchmark tasks (applies when brief category is `performance`): + +Baseline capture (Phase 1 benchmark tasks): +- After creating benchmark classes, commit ONLY the benchmark files: + `git add HdrHistogram.Benchmarking/ && git commit -m "bench: add benchmarks for baseline capture"` +- Run benchmarks in Release configuration. Use `--filter` to target only the relevant benchmarks. +- BenchmarkDotNet outputs markdown tables to stdout and detailed results to `BenchmarkDotNet.Artifacts/`. + Copy the results table into `plan/benchmarks/baseline.md`. +- Do NOT proceed to Phase 2 implementation tasks until baseline results are captured and saved. + +Post-change validation (Phase 3 benchmark tasks): +- Run the exact same benchmark command used for the baseline. +- Save results to `plan/benchmarks/post-change.md`. +- Generate `plan/benchmarks/comparison.md` by reading both files and computing deltas. +- If any benchmark shows a regression, flag it in comparison.md and add a new task to investigate. + +Benchmark execution notes: +- Both Phase 1 (baseline) and Phase 3 (final comparison) MUST use default BenchmarkDotNet settings — do NOT use `--job short` for these runs. +- Use `--job short` ONLY for ad-hoc iteration during Phase 2 development. +- Benchmarks that fail to compile or run must be fixed before proceeding. diff --git a/autonomous/prompts/pick-issue.md b/autonomous/prompts/pick-issue.md index d7d33bd..032d637 100644 --- a/autonomous/prompts/pick-issue.md +++ b/autonomous/prompts/pick-issue.md @@ -22,3 +22,13 @@ Using the exploration results, create ./plan/planning/brief.md containing: - Acceptance criteria derived from the issue - Test strategy: which tests to add or modify - Risks or open questions +- Category: classify as either `functional` or `performance` + - `performance` if the issue mentions: allocation, memory, throughput, latency, + GC pressure, benchmark, hot path, serialisation performance, or similar + - `functional` for all other issues (bugs, features, refactors) +- Benchmark strategy (required for `performance` issues, optional for `functional`): + Follow the benchmark-driven development process in spec/tech-standards/testing-standards.md. + Identify: + - Which existing benchmarks are relevant (check HdrHistogram.Benchmarking/) + - What new micro-benchmarks and end-to-end benchmarks are needed + - Which metrics matter: throughput, allocation, GC collections diff --git a/autonomous/prompts/review-brief.md b/autonomous/prompts/review-brief.md index 10565d1..433a1e3 100644 --- a/autonomous/prompts/review-brief.md +++ b/autonomous/prompts/review-brief.md @@ -12,6 +12,11 @@ Review the brief for: - Feasibility: Do the proposed changes align with what the code actually looks like? - Test strategy: Are there specific test cases identified? - Acceptance criteria: Are they measurable and verifiable? +- Category: Is the classification correct? Performance issues MUST be marked `performance`. +- Benchmark strategy (for `performance` issues, per spec/tech-standards/testing-standards.md): + - Are both micro-benchmarks AND end-to-end benchmarks identified? + - Do the benchmarks measure what the issue actually claims to improve? + - Are the benchmarks testing the realistic hot path, not just the changed code in isolation? If changes are needed: create ./plan/planning/brief-review.md with specific, actionable suggestions. diff --git a/spec/README.md b/spec/README.md index 0ad3c07..9ed9af2 100644 --- a/spec/README.md +++ b/spec/README.md @@ -41,6 +41,7 @@ This file provides guidance on where to find standards and specifications for th | encoding, LEB128, DEFLATE, compression | [Histogram Encoding](./tech-standards/histogram-encoding.md) | | log format, V2, persistence | [Histogram Encoding](./tech-standards/histogram-encoding.md) | | xUnit, test, FluentAssertions | [Testing Standards](./tech-standards/testing-standards.md) | +| benchmark, performance, allocation, BenchmarkDotNet | [Testing Standards](./tech-standards/testing-standards.md) | | naming convention, XML docs, style | [Coding Standards](./tech-standards/coding-standards.md) | | build, NuGet, AppVeyor, CI/CD | [Build System](./tech-standards/build-system.md) | | milestone, issue, PR, GitHub | [GitHub CLI Reference](./tech-standards/github.md) | diff --git a/spec/tech-standards/testing-standards.md b/spec/tech-standards/testing-standards.md index 21b93f2..340aba9 100644 --- a/spec/tech-standards/testing-standards.md +++ b/spec/tech-standards/testing-standards.md @@ -171,13 +171,77 @@ Resource types: ## Performance Testing -Run tests in Release mode for accurate performance measurements: +### Benchmarking Framework + +The `HdrHistogram.Benchmarking/` project uses BenchmarkDotNet for performance measurement. +All benchmarks must run in Release mode for accurate results. + +| Component | Details | +|-----------|---------| +| Framework | BenchmarkDotNet | +| Targets | net10.0, net9.0, net8.0 | +| Entry point | `Program.cs` with `BenchmarkSwitcher` | +| Diagnostics | `[MemoryDiagnoser]` on all benchmark classes | + +### Benchmark-Driven Development + +Performance issues follow a benchmark-first methodology, analogous to test-driven development. +Benchmarks are the tests for non-functional requirements. + +**Phase 1 — Benchmark scaffolding (before any implementation changes):** + +1. Create benchmark classes covering both levels: + - **Micro-benchmarks** — isolate the specific operation being optimised (e.g. `PutLong`/`GetLong`) + - **End-to-end benchmarks** — exercise the realistic user workflow that the micro-operation feeds into (e.g. histogram encode/decode round-trip) +2. Register benchmarks in `Program.cs` `BenchmarkSwitcher` +3. Run benchmarks against the **unmodified code** to establish a baseline +4. Record baseline results (Mean, StdDev, Allocated, Op/s) + +**Phase 2 — Implementation:** + +5. Apply the optimisation +6. Add or update unit tests as normal + +**Phase 3 — Validation (after implementation is complete):** + +7. Run the same benchmarks with identical configuration +8. Compare against baseline: generate a side-by-side table with deltas +9. Document the results — including if the change shows no meaningful improvement + +Both levels of benchmark are required because: + +- Micro-benchmarks prove the isolated operation improved +- End-to-end benchmarks prove the improvement matters in a realistic context +- A micro-optimisation with no observable end-to-end impact may not justify the change + +### Benchmark Design Guidelines + +- Use `[MemoryDiagnoser]` on every benchmark class to surface allocation counts +- Reset buffer positions or state in each benchmark method, not in setup (matches BDN iteration model) +- Pre-allocate buffers in `[GlobalSetup]` so setup allocation is excluded from measurements +- Use realistic data sizes and distributions that match production usage +- Name benchmark methods clearly: `Encode`, `Decode`, `EncodeCompressed`, not `Test1`, `Bench_A` + +### Running Benchmarks ```bash -dotnet test --configuration Release +# Build in Release mode (required) +dotnet build HdrHistogram.Benchmarking/ -c Release + +# Run specific benchmarks +dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*' + +# Export results as JSON for comparison +dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*' --exporters json + +# Quick iteration (fewer iterations, faster feedback) +dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*' --job short ``` -Separate benchmarking project (`HdrHistogram.Benchmarking`) uses BenchmarkDotNet for micro-benchmarks. +### Existing Benchmarks + +Check `HdrHistogram.Benchmarking/` for the current set of benchmark classes. +Each subdirectory groups benchmarks by area (e.g. `LeadingZeroCount/`, `Recording/`). ## Test Organization Guidelines