TwinBench Methodology Notes

What TwinBench Measures Well

TwinBench is strongest when the question is whether a system behaves like a persistent intelligence layer over time. In v1, it measures well:

delayed recall of user-relevant facts and preferences
continuity of multi-step work across interruptions
stability of user identity and system role
transfer of task state across contexts
practical gains from remembered preferences

These are the behavioral properties most often missing from model-only and task-only benchmarks.

Evidence Classes In This Repository

The current repository contains more than one evidence class:

measured first-party artifacts: recorded benchmark outputs generated by the repository team, currently most visibly in legacy/results/
reference examples: schema illustrations intended to show how v1 artifacts are structured
modeled baselines: plausible comparison profiles intended to demonstrate benchmark differentiation, not to replace live evaluation

These classes should not be treated as interchangeable. Measured first-party artifacts carry more evidentiary weight than modeled baselines, even when they use an older schema.

First-Party Evaluation Handling

Some of the strongest evidence in the repository is first-party, especially the recorded Nullalis harness runs. First-party evidence is still useful, but it carries obvious bias risk. TwinBench handles that risk by:

keeping degraded and failed runs rather than promoting only favorable ones
preserving artifact files instead of summarizing them away
recording measured coverage alongside scores
distinguishing measured artifacts from examples and modeled baselines
stating when a result uses an earlier benchmark schema

This does not remove bias, but it makes the evidence easier to audit.

What TwinBench Does Not Yet Measure Well

TwinBench v1 is intentionally narrower than a full system audit. It does not yet measure well:

adversarial robustness and security depth
production reliability under sustained scale
infrastructure cost efficiency
open-domain capability breadth
emotional or social interaction quality
multi-user persistence under heavy contention

Where Evaluator Judgment Is Required

Evaluator judgment remains necessary in several places:

deciding whether a resumed task is genuinely coherent or merely plausible
separating stylistic variation from identity drift
deciding whether later improvement is true personalization gain or prompt luck
judging whether omitted context is harmless compression or a continuity failure

TwinBench v1 treats that subjectivity honestly. Results should report notes and caveats instead of pretending complete automation where it does not exist.

Confidence Levels

Confidence should be interpreted by evidence class:

high: directly recorded behavior with clear artifact support and limited ambiguity
moderate: recorded behavior with partial measurement, schema translation, or meaningful evaluator interpretation
low: examples, projections, or modeled comparisons with no direct live run behind them

In the current repository, the measured Nullalis harness artifacts generally provide moderate-to-high confidence within their own v0.2 schema. The modeled v1 baseline comparisons provide lower confidence and should be read as benchmark illustrations rather than definitive product measurements.

Reproducibility Limits

Persistent-system evaluation is less deterministic than single-turn task evaluation because:

elapsed time affects system state
deployments change between checkpoints
context transfer may depend on product instrumentation
preference-learning scenarios depend on task comparability
some systems expose internal state while others expose only outputs

TwinBench addresses these limits through fixed scenario definitions, stable metric formulas, explicit dates, and artifact-based reporting. It does not fully eliminate drift.

Schema drift is another current repository constraint. The strongest measured runs are still in the earlier v0.2 harness format, while the simplified public v1 scaffold uses a different five-metric structure. That means some cross-surface comparisons remain approximate until live v1 evaluations are generated.

What “Good Enough v1” Means

TwinBench v1 is considered good enough if it does four things reliably:

defines the persistence layer as a benchmarkable object
gives evaluators a stable vocabulary and result schema
makes partial coverage and caveats explicit
supports comparable example runs without overstating certainty

That is a stronger and more trustworthy v1 than a more automated benchmark that hides uncertainty.

Why the Systems Thesis Matters

TwinBench is benchmark-first and vendor-neutral, but its methodology reflects a clear technical position: persistence is a systems property. Durable memory, identity stability, context transfer, and long-horizon continuity usually depend on architecture and runtime design, not only on prompt quality. This is why persistent runtimes matter to the benchmark without making the benchmark product-specific.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TwinBench Methodology Notes

What TwinBench Measures Well

Evidence Classes In This Repository

First-Party Evaluation Handling

What TwinBench Does Not Yet Measure Well

Where Evaluator Judgment Is Required

Confidence Levels

Reproducibility Limits

What “Good Enough v1” Means

Why the Systems Thesis Matters

FilesExpand file tree

methodology.md

Latest commit

History

methodology.md

File metadata and controls

TwinBench Methodology Notes

What TwinBench Measures Well

Evidence Classes In This Repository

First-Party Evaluation Handling

What TwinBench Does Not Yet Measure Well

Where Evaluator Judgment Is Required

Confidence Levels

Reproducibility Limits

What “Good Enough v1” Means

Why the Systems Thesis Matters