Skip to content

Latest commit

 

History

History
96 lines (62 loc) · 5.07 KB

File metadata and controls

96 lines (62 loc) · 5.07 KB

TwinBench Methodology Notes

What TwinBench Measures Well

TwinBench is strongest when the question is whether a system behaves like a persistent intelligence layer over time. In v1, it measures well:

  • delayed recall of user-relevant facts and preferences
  • continuity of multi-step work across interruptions
  • stability of user identity and system role
  • transfer of task state across contexts
  • practical gains from remembered preferences

These are the behavioral properties most often missing from model-only and task-only benchmarks.

Evidence Classes In This Repository

The current repository contains more than one evidence class:

  • measured first-party artifacts: recorded benchmark outputs generated by the repository team, currently most visibly in legacy/results/
  • reference examples: schema illustrations intended to show how v1 artifacts are structured
  • modeled baselines: plausible comparison profiles intended to demonstrate benchmark differentiation, not to replace live evaluation

These classes should not be treated as interchangeable. Measured first-party artifacts carry more evidentiary weight than modeled baselines, even when they use an older schema.

First-Party Evaluation Handling

Some of the strongest evidence in the repository is first-party, especially the recorded Nullalis harness runs. First-party evidence is still useful, but it carries obvious bias risk. TwinBench handles that risk by:

  • keeping degraded and failed runs rather than promoting only favorable ones
  • preserving artifact files instead of summarizing them away
  • recording measured coverage alongside scores
  • distinguishing measured artifacts from examples and modeled baselines
  • stating when a result uses an earlier benchmark schema

This does not remove bias, but it makes the evidence easier to audit.

What TwinBench Does Not Yet Measure Well

TwinBench v1 is intentionally narrower than a full system audit. It does not yet measure well:

  • adversarial robustness and security depth
  • production reliability under sustained scale
  • infrastructure cost efficiency
  • open-domain capability breadth
  • emotional or social interaction quality
  • multi-user persistence under heavy contention

Where Evaluator Judgment Is Required

Evaluator judgment remains necessary in several places:

  • deciding whether a resumed task is genuinely coherent or merely plausible
  • separating stylistic variation from identity drift
  • deciding whether later improvement is true personalization gain or prompt luck
  • judging whether omitted context is harmless compression or a continuity failure

TwinBench v1 treats that subjectivity honestly. Results should report notes and caveats instead of pretending complete automation where it does not exist.

Confidence Levels

Confidence should be interpreted by evidence class:

  • high: directly recorded behavior with clear artifact support and limited ambiguity
  • moderate: recorded behavior with partial measurement, schema translation, or meaningful evaluator interpretation
  • low: examples, projections, or modeled comparisons with no direct live run behind them

In the current repository, the measured Nullalis harness artifacts generally provide moderate-to-high confidence within their own v0.2 schema. The modeled v1 baseline comparisons provide lower confidence and should be read as benchmark illustrations rather than definitive product measurements.

Reproducibility Limits

Persistent-system evaluation is less deterministic than single-turn task evaluation because:

  • elapsed time affects system state
  • deployments change between checkpoints
  • context transfer may depend on product instrumentation
  • preference-learning scenarios depend on task comparability
  • some systems expose internal state while others expose only outputs

TwinBench addresses these limits through fixed scenario definitions, stable metric formulas, explicit dates, and artifact-based reporting. It does not fully eliminate drift.

Schema drift is another current repository constraint. The strongest measured runs are still in the earlier v0.2 harness format, while the simplified public v1 scaffold uses a different five-metric structure. That means some cross-surface comparisons remain approximate until live v1 evaluations are generated.

What “Good Enough v1” Means

TwinBench v1 is considered good enough if it does four things reliably:

  1. defines the persistence layer as a benchmarkable object
  2. gives evaluators a stable vocabulary and result schema
  3. makes partial coverage and caveats explicit
  4. supports comparable example runs without overstating certainty

That is a stronger and more trustworthy v1 than a more automated benchmark that hides uncertainty.

Why the Systems Thesis Matters

TwinBench is benchmark-first and vendor-neutral, but its methodology reflects a clear technical position: persistence is a systems property. Durable memory, identity stability, context transfer, and long-horizon continuity usually depend on architecture and runtime design, not only on prompt quality. This is why persistent runtimes matter to the benchmark without making the benchmark product-specific.