Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions autonomous/agent-loop.sh
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,16 @@ case "$STATE" in

$(cat /tmp/plan-backup/done/task.md)

</details>"
fi
if [ -f /tmp/plan-backup/benchmarks/comparison.md ]; then
PR_BODY="${PR_BODY}

<details>
<summary>Benchmark comparison</summary>

$(cat /tmp/plan-backup/benchmarks/comparison.md)

</details>"
fi
PR_BODY="${PR_BODY}
Expand Down Expand Up @@ -249,7 +259,7 @@ EOF
# If plan state doesn't exist on the branch, re-initialise
if [ ! -d ./plan ]; then
echo "No plan state found on branch, starting fresh"
mkdir -p ./plan/planning ./plan/ready ./plan/done
mkdir -p ./plan/planning ./plan/ready ./plan/done ./plan/benchmarks

ISSUE_BODY=$(gh issue view "$ISSUE_NUM" --repo "$UPSTREAM_REPO" \
--json body,title --jq '"# Issue #'"$ISSUE_NUM"': " + .title + "\n\n" + .body')
Expand All @@ -264,7 +274,7 @@ EOF
echo "Starting fresh: agent/${ISSUE_NUM}-${BRANCH_SLUG}"
git fetch upstream
git checkout -b "agent/${ISSUE_NUM}-${BRANCH_SLUG}" "upstream/$UPSTREAM_BASE_BRANCH"
mkdir -p ./plan/planning ./plan/ready ./plan/done
mkdir -p ./plan/planning ./plan/ready ./plan/done ./plan/benchmarks

ISSUE_BODY=$(gh issue view "$ISSUE_NUM" --repo "$UPSTREAM_REPO" \
--json body,title --jq '"# Issue #'"$ISSUE_NUM"': " + .title + "\n\n" + .body')
Expand Down
34 changes: 33 additions & 1 deletion autonomous/prompts/create-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,36 @@ Include tasks for:
Use `[ ]` for each task.
Validate the task list is complete by cross-referencing every acceptance criterion in the brief — each criterion must be covered by at least one task.

Any task that attempts to alter the ./.github folder will likely fail due to permissions restrictions. These changes should accompany the PR as an attached file with clear direction on the manual intervention required to complete the work.
Any task that attempts to alter the ./.github folder will likely fail due to permissions restrictions. These changes should accompany the PR as an attached file with clear direction on the manual intervention required to complete the work.

If the brief category is `performance`, follow the benchmark-driven development process
in spec/tech-standards/testing-standards.md. The task ordering MUST be:

Phase 1 — Benchmark scaffolding (before any implementation changes):
- [ ] Create benchmark class(es) in HdrHistogram.Benchmarking/
- Micro-benchmarks for the specific operations being optimised
- End-to-end benchmarks that exercise the realistic user workflow
- Add [MemoryDiagnoser] to all benchmark classes
- Register new benchmarks in Program.cs BenchmarkSwitcher
- [ ] Build and verify benchmarks compile:
`dotnet build HdrHistogram.Benchmarking/ -c Release`
- [ ] Run baseline benchmarks on the UNMODIFIED code
- Run: `dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*BenchmarkClass*' --exporters json`
- Save formatted results table to `plan/benchmarks/baseline.md`
- Include: Mean, StdDev, Allocated, Op/s for each benchmark method

Phase 2 — Implementation (the actual code changes):
- [ ] (implementation tasks as normal)
- [ ] (unit tests as normal)

Phase 3 — Benchmark validation (after implementation is complete):
- [ ] Run post-change benchmarks with identical configuration
- Save results to `plan/benchmarks/post-change.md`
- [ ] Generate comparison in `plan/benchmarks/comparison.md` containing:
- Side-by-side table: Benchmark | Baseline | Post-Change | Delta | Delta %
- Summary: which metrics improved, which regressed, which unchanged
- Verdict: does the data support the change?

Create the directory `plan/benchmarks/` for storing results.

If the brief category is `functional`, use the current ordering (implementation, tests, docs).
23 changes: 22 additions & 1 deletion autonomous/prompts/execute-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,25 @@ After all tasks are marked `[x]`:
2. If the review identifies issues, append new `[ ]` tasks to task.md describing each fix.
3. If the review is clean, move brief.md and task.md to ./plan/done/

Process as many tasks as you can in this iteration.
Process as many tasks as you can in this iteration.

Special handling for benchmark tasks (applies when brief category is `performance`):

Baseline capture (Phase 1 benchmark tasks):
- After creating benchmark classes, commit ONLY the benchmark files:
`git add HdrHistogram.Benchmarking/ && git commit -m "bench: add benchmarks for baseline capture"`
- Run benchmarks in Release configuration. Use `--filter` to target only the relevant benchmarks.
- BenchmarkDotNet outputs markdown tables to stdout and detailed results to `BenchmarkDotNet.Artifacts/`.
Copy the results table into `plan/benchmarks/baseline.md`.
- Do NOT proceed to Phase 2 implementation tasks until baseline results are captured and saved.

Post-change validation (Phase 3 benchmark tasks):
- Run the exact same benchmark command used for the baseline.
- Save results to `plan/benchmarks/post-change.md`.
- Generate `plan/benchmarks/comparison.md` by reading both files and computing deltas.
- If any benchmark shows a regression, flag it in comparison.md and add a new task to investigate.

Benchmark execution notes:
- Both Phase 1 (baseline) and Phase 3 (final comparison) MUST use default BenchmarkDotNet settings — do NOT use `--job short` for these runs.
- Use `--job short` ONLY for ad-hoc iteration during Phase 2 development.
- Benchmarks that fail to compile or run must be fixed before proceeding.
10 changes: 10 additions & 0 deletions autonomous/prompts/pick-issue.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,13 @@ Using the exploration results, create ./plan/planning/brief.md containing:
- Acceptance criteria derived from the issue
- Test strategy: which tests to add or modify
- Risks or open questions
- Category: classify as either `functional` or `performance`
- `performance` if the issue mentions: allocation, memory, throughput, latency,
GC pressure, benchmark, hot path, serialisation performance, or similar
- `functional` for all other issues (bugs, features, refactors)
- Benchmark strategy (required for `performance` issues, optional for `functional`):
Follow the benchmark-driven development process in spec/tech-standards/testing-standards.md.
Identify:
- Which existing benchmarks are relevant (check HdrHistogram.Benchmarking/)
- What new micro-benchmarks and end-to-end benchmarks are needed
- Which metrics matter: throughput, allocation, GC collections
5 changes: 5 additions & 0 deletions autonomous/prompts/review-brief.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ Review the brief for:
- Feasibility: Do the proposed changes align with what the code actually looks like?
- Test strategy: Are there specific test cases identified?
- Acceptance criteria: Are they measurable and verifiable?
- Category: Is the classification correct? Performance issues MUST be marked `performance`.
- Benchmark strategy (for `performance` issues, per spec/tech-standards/testing-standards.md):
- Are both micro-benchmarks AND end-to-end benchmarks identified?
- Do the benchmarks measure what the issue actually claims to improve?
- Are the benchmarks testing the realistic hot path, not just the changed code in isolation?

If changes are needed: create ./plan/planning/brief-review.md with specific,
actionable suggestions.
Expand Down
1 change: 1 addition & 0 deletions spec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ This file provides guidance on where to find standards and specifications for th
| encoding, LEB128, DEFLATE, compression | [Histogram Encoding](./tech-standards/histogram-encoding.md) |
| log format, V2, persistence | [Histogram Encoding](./tech-standards/histogram-encoding.md) |
| xUnit, test, FluentAssertions | [Testing Standards](./tech-standards/testing-standards.md) |
| benchmark, performance, allocation, BenchmarkDotNet | [Testing Standards](./tech-standards/testing-standards.md) |
| naming convention, XML docs, style | [Coding Standards](./tech-standards/coding-standards.md) |
| build, NuGet, AppVeyor, CI/CD | [Build System](./tech-standards/build-system.md) |
| milestone, issue, PR, GitHub | [GitHub CLI Reference](./tech-standards/github.md) |
Expand Down
70 changes: 67 additions & 3 deletions spec/tech-standards/testing-standards.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,13 +171,77 @@ Resource types:

## Performance Testing

Run tests in Release mode for accurate performance measurements:
### Benchmarking Framework

The `HdrHistogram.Benchmarking/` project uses BenchmarkDotNet for performance measurement.
All benchmarks must run in Release mode for accurate results.

| Component | Details |
|-----------|---------|
| Framework | BenchmarkDotNet |
| Targets | net10.0, net9.0, net8.0 |
| Entry point | `Program.cs` with `BenchmarkSwitcher` |
| Diagnostics | `[MemoryDiagnoser]` on all benchmark classes |

### Benchmark-Driven Development

Performance issues follow a benchmark-first methodology, analogous to test-driven development.
Benchmarks are the tests for non-functional requirements.

**Phase 1 — Benchmark scaffolding (before any implementation changes):**

1. Create benchmark classes covering both levels:
- **Micro-benchmarks** — isolate the specific operation being optimised (e.g. `PutLong`/`GetLong`)
- **End-to-end benchmarks** — exercise the realistic user workflow that the micro-operation feeds into (e.g. histogram encode/decode round-trip)
2. Register benchmarks in `Program.cs` `BenchmarkSwitcher`
3. Run benchmarks against the **unmodified code** to establish a baseline
4. Record baseline results (Mean, StdDev, Allocated, Op/s)

**Phase 2 — Implementation:**

5. Apply the optimisation
6. Add or update unit tests as normal

**Phase 3 — Validation (after implementation is complete):**

7. Run the same benchmarks with identical configuration
8. Compare against baseline: generate a side-by-side table with deltas
9. Document the results — including if the change shows no meaningful improvement

Both levels of benchmark are required because:

- Micro-benchmarks prove the isolated operation improved
- End-to-end benchmarks prove the improvement matters in a realistic context
- A micro-optimisation with no observable end-to-end impact may not justify the change

### Benchmark Design Guidelines

- Use `[MemoryDiagnoser]` on every benchmark class to surface allocation counts
- Reset buffer positions or state in each benchmark method, not in setup (matches BDN iteration model)
- Pre-allocate buffers in `[GlobalSetup]` so setup allocation is excluded from measurements
- Use realistic data sizes and distributions that match production usage
- Name benchmark methods clearly: `Encode`, `Decode`, `EncodeCompressed`, not `Test1`, `Bench_A`

### Running Benchmarks

```bash
dotnet test --configuration Release
# Build in Release mode (required)
dotnet build HdrHistogram.Benchmarking/ -c Release

# Run specific benchmarks
dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*'

# Export results as JSON for comparison
dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*' --exporters json

# Quick iteration (fewer iterations, faster feedback)
dotnet run -c Release --project HdrHistogram.Benchmarking/ -- --filter '*ClassName*' --job short
```

Separate benchmarking project (`HdrHistogram.Benchmarking`) uses BenchmarkDotNet for micro-benchmarks.
### Existing Benchmarks

Check `HdrHistogram.Benchmarking/` for the current set of benchmark classes.
Each subdirectory groups benchmarks by area (e.g. `LeadingZeroCount/`, `Recording/`).

## Test Organization Guidelines

Expand Down
Loading