TwinBench is strongest when the question is whether a system behaves like a persistent intelligence layer over time. In v1, it measures well:
- delayed recall of user-relevant facts and preferences
- continuity of multi-step work across interruptions
- stability of user identity and system role
- transfer of task state across contexts
- practical gains from remembered preferences
These are the behavioral properties most often missing from model-only and task-only benchmarks.
The current repository contains more than one evidence class:
measured first-party artifacts: recorded benchmark outputs generated by the repository team, currently most visibly inlegacy/results/reference examples: schema illustrations intended to show how v1 artifacts are structuredmodeled baselines: plausible comparison profiles intended to demonstrate benchmark differentiation, not to replace live evaluation
These classes should not be treated as interchangeable. Measured first-party artifacts carry more evidentiary weight than modeled baselines, even when they use an older schema.
Some of the strongest evidence in the repository is first-party, especially the recorded Nullalis harness runs. First-party evidence is still useful, but it carries obvious bias risk. TwinBench handles that risk by:
- keeping degraded and failed runs rather than promoting only favorable ones
- preserving artifact files instead of summarizing them away
- recording measured coverage alongside scores
- distinguishing measured artifacts from examples and modeled baselines
- stating when a result uses an earlier benchmark schema
This does not remove bias, but it makes the evidence easier to audit.
TwinBench v1 is intentionally narrower than a full system audit. It does not yet measure well:
- adversarial robustness and security depth
- production reliability under sustained scale
- infrastructure cost efficiency
- open-domain capability breadth
- emotional or social interaction quality
- multi-user persistence under heavy contention
Evaluator judgment remains necessary in several places:
- deciding whether a resumed task is genuinely coherent or merely plausible
- separating stylistic variation from identity drift
- deciding whether later improvement is true personalization gain or prompt luck
- judging whether omitted context is harmless compression or a continuity failure
TwinBench v1 treats that subjectivity honestly. Results should report notes and caveats instead of pretending complete automation where it does not exist.
Confidence should be interpreted by evidence class:
high: directly recorded behavior with clear artifact support and limited ambiguitymoderate: recorded behavior with partial measurement, schema translation, or meaningful evaluator interpretationlow: examples, projections, or modeled comparisons with no direct live run behind them
In the current repository, the measured Nullalis harness artifacts generally provide moderate-to-high confidence within their own v0.2 schema. The modeled v1 baseline comparisons provide lower confidence and should be read as benchmark illustrations rather than definitive product measurements.
Persistent-system evaluation is less deterministic than single-turn task evaluation because:
- elapsed time affects system state
- deployments change between checkpoints
- context transfer may depend on product instrumentation
- preference-learning scenarios depend on task comparability
- some systems expose internal state while others expose only outputs
TwinBench addresses these limits through fixed scenario definitions, stable metric formulas, explicit dates, and artifact-based reporting. It does not fully eliminate drift.
Schema drift is another current repository constraint. The strongest measured runs are still in the earlier v0.2 harness format, while the simplified public v1 scaffold uses a different five-metric structure. That means some cross-surface comparisons remain approximate until live v1 evaluations are generated.
TwinBench v1 is considered good enough if it does four things reliably:
- defines the persistence layer as a benchmarkable object
- gives evaluators a stable vocabulary and result schema
- makes partial coverage and caveats explicit
- supports comparable example runs without overstating certainty
That is a stronger and more trustworthy v1 than a more automated benchmark that hides uncertainty.
TwinBench is benchmark-first and vendor-neutral, but its methodology reflects a clear technical position: persistence is a systems property. Durable memory, identity stability, context transfer, and long-horizon continuity usually depend on architecture and runtime design, not only on prompt quality. This is why persistent runtimes matter to the benchmark without making the benchmark product-specific.