Problem
Issue #30 (Adaptive patch compression for large PRs) is open but has no eval fixtures to validate compression behavior. Current fixtures are all small, focused diffs. Real-world PRs can be 50+ files and thousands of lines — the reviewer needs to degrade gracefully, not silently drop findings.
Proposal
Create stress test fixtures that validate review quality under compression:
Fixtures
| # |
name |
size |
expected behavior |
| 1 |
`large-pr-50-files-mixed` |
50 files, ~3000 lines |
Must still catch the 1 security issue buried in file #38 |
| 2 |
`large-pr-refactor-plus-bug` |
30 files (28 renames + 2 real changes) |
Must not waste context on renames; must review the 2 substantive files |
| 3 |
`large-pr-generated-code` |
10 files but 1 is 2000-line generated proto |
Must skip generated file, review the rest |
| 4 |
`large-pr-deletion-heavy` |
20 files, 15 are pure deletions |
Must review the 5 non-deletion files; deletion-only may be skipped |
| 5 |
`context-budget-exceeded` |
Single file, 5000-line diff |
Must use chunking/compression, not truncate randomly |
Metrics
For each fixture, track:
- Files reviewed vs files skipped (and why)
- Compression strategy used (full / compressed / clipped / multi-call)
- Finding recall compared to a "small diff" version of the same bug
- Total tokens used vs context budget
Acceptance
🤖 Generated with Claude Code