From 03fe4af627cb8ecf29139d4d2ebbc2b5d9bc786d Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Sun, 1 Mar 2026 23:42:30 -0500 Subject: [PATCH 1/3] docs: reframe positioning with multi-pillar strategy and honest scoping - README: Replace "Demo-Conditioned Prompting" with "Trajectory-Conditioned Disambiguation" showing the 2x2 experimental matrix (prompting validated, fine-tuning in progress). Add OpenCUA industry validation. - Landing page strategy: Lead with capture-to-deployment pipeline, add specialization pillar, update competitor table for March 2026 landscape (Agent S3, OpenCUA, Browser Use, CUA/Bytebot). Add honesty notes for proof points. Co-Authored-By: Claude Opus 4.6 --- README.md | 20 ++++--- docs/design/landing-page-strategy.md | 86 +++++++++++++++++----------- 2 files changed, 63 insertions(+), 43 deletions(-) diff --git a/README.md b/README.md index a247644fc..6bef02b2d 100644 --- a/README.md +++ b/README.md @@ -164,18 +164,20 @@ OpenAdapt follows a streamlined **Demonstrate → Learn → Execute** pipeline: - **Act**: Execute validated actions with safety gates - **Evaluate**: Measure success with `openadapt-evals` and feed results back for improvement -### Core Approach: Demo-Conditioned Prompting +### Core Approach: Trajectory-Conditioned Disambiguation -OpenAdapt explores **demonstration-conditioned automation** - "show, don't tell": +Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to **ambiguity in UI affordances**. OpenAdapt resolves this by conditioning agents on human demonstrations — "show, don't tell." -| Traditional Agent | OpenAdapt Agent | -|-------------------|-----------------| -| User writes prompts | User records demonstration | -| Ambiguous instructions | Grounded in actual UI | -| Requires prompt engineering | Reduced prompt engineering | -| Context-free | Context from similar demos | +| | No Retrieval | With Retrieval | +|---|---|---| +| **No Fine-tuning** | 33–47% (zero-shot baseline) | **100%** (validated, n=45) | +| **Fine-tuning** | Standard SFT baseline | **Demo-conditioned FT** (core goal) | -**Retrieval powers BOTH training AND evaluation**: Similar demonstrations are retrieved as context for the VLM. In early experiments on a controlled macOS benchmark, this improved first-action accuracy from 46.7% to 100% - though all 45 tasks in that benchmark share the same navigation entry point. See the [publication roadmap](docs/publication-roadmap.md) for methodology and limitations. +The bottom-right cell is OpenAdapt's unique value: training models to **use** demonstrations they haven't seen before, combining retrieval with fine-tuning for maximum accuracy. Phase 2 (retrieval-only prompting) is validated; Phase 3 (demo-conditioned fine-tuning) is in progress. + +**Validated result**: On a controlled macOS benchmark (45 System Settings tasks sharing a common navigation entry point), demo-conditioned prompting improved first-action accuracy from 46.7% to 100%. A length-matched control (+11.1 pp only) confirms the benefit is semantic, not token-length. See the [research thesis](https://github.com/OpenAdaptAI/openadapt-ml/blob/main/docs/research_thesis.md) for methodology and the [publication roadmap](docs/publication-roadmap.md) for limitations. + +**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) built their cross-platform capture tool on OpenAdapt, but uses demos only for training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator. ### Key Concepts diff --git a/docs/design/landing-page-strategy.md b/docs/design/landing-page-strategy.md index 543fe48d7..1d5fcbc4e 100644 --- a/docs/design/landing-page-strategy.md +++ b/docs/design/landing-page-strategy.md @@ -40,7 +40,9 @@ OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular met 4. **Open Source (MIT License)**: Full transparency, no vendor lock-in **Key Innovation**: -- **Trajectory-conditioned disambiguation of UI affordances** - validated experiment showing 33% -> 100% first-action accuracy with demo conditioning +- **Trajectory-conditioned disambiguation of UI affordances** — the only open-source CUA framework that conditions agents on recorded demonstrations at runtime (validated: 46.7% → 100% first-action accuracy) +- **Specialization over scale** — fine-tuned Qwen3-VL-2B outperforms Claude Sonnet 4.5 and GPT-5.1 on action accuracy (42.9% vs 11.2% vs 23.2%) on an internal benchmark +- **Capture-to-deployment pipeline** — record → retrieve → train → deploy, used by [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) as foundation for their capture tooling - **Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates ### 1.2 Current Landing Page Assessment @@ -218,25 +220,31 @@ Why: Clear 3-step process, action-oriented ### 3.3 Key Differentiators to Emphasize -1. **Demonstration-Based Learning** - - Not: "Use natural language to describe tasks" - - But: "Just do the task and OpenAdapt learns from watching" - - Proof: 33% -> 100% first-action accuracy with demo conditioning +1. **Capture-to-Deployment Pipeline** + - Not: "Prompt the AI to do your task" + - But: "Record the task once. OpenAdapt handles the rest — retrieval, training, deployment." + - Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) built on OpenAdapt's capture infrastructure -2. **Model Agnostic** +2. **Demonstration-Conditioned Agents** + - Not: "Zero-shot reasoning about what to click" + - But: "Agents conditioned on relevant demos — at inference AND during training" + - Proof: 46.7% → 100% first-action accuracy with demo conditioning (validated, n=45). No other open-source CUA framework does runtime demo conditioning. + - Note: This is first-action accuracy on tasks sharing a navigation entry point. Multi-step and cross-domain evaluation is ongoing on Windows Agent Arena. + +3. **Specialization Over Scale** + - Not: "Use the biggest model available" + - But: "A 2B model fine-tuned on your workflows outperforms frontier models" + - Proof: Qwen3-VL-2B (42.9%) vs Claude Sonnet 4.5 (11.2%) on action accuracy (internal benchmark, synthetic login task) + +4. **Model Agnostic** - Not: "Works with [specific AI]" - But: "Your choice: Claude, GPT-4V, Gemini, Qwen, or custom models" - Proof: Adapters for multiple VLM backends -3. **Runs Anywhere** - - Not: "Cloud-powered automation" - - But: "Run locally, in the cloud, or hybrid" - - Proof: CLI-based, works offline - -4. **Open Source** - - Not: "Try our free tier" - - But: "MIT licensed, fully transparent, community-driven" - - Proof: GitHub, PyPI, active Discord +5. **Runs Anywhere & Open Source** + - Not: "Cloud-powered automation" / "Try our free tier" + - But: "Run locally, in the cloud, or hybrid. MIT licensed, fully transparent." + - Proof: CLI-based, works offline; GitHub, PyPI, active Discord ### 3.4 Messaging Framework @@ -256,14 +264,17 @@ Why: Clear 3-step process, action-oriented ## 4. Competitive Positioning -### 4.1 Primary Competitors +### 4.1 Primary Competitors (Updated March 2026) | Competitor | Strengths | Weaknesses | Our Advantage | |------------|-----------|------------|---------------| -| **Anthropic Computer Use** | First-mover, Claude integration, simple API | Proprietary, cloud-only, no customization | Open source, model-agnostic, trainable | -| **UI-TARS (ByteDance)** | Strong benchmark scores, research backing | Closed source, not productized | Open source, deployable, extensible | -| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, large ecosystems | Brittle selectors, no AI reasoning, expensive | AI-first, learns from demos, affordable | -| **GPT-4V + Custom Code** | Powerful model, flexibility | Requires building everything, no structure | Ready-made SDK, training pipeline, benchmarks | +| **Anthropic Computer Use** | 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally | +| **Agent S3 (Simular)** | 72.6% OSWorld (superhuman), open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline | +| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique), OpenCUA built on our capture tool | +| **Browser Use** | 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library | +| **UI-TARS (ByteDance)** | Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval | +| **CUA / Bytebot** | Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy | +| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, UiPath Screen Agent #1 on OSWorld | Brittle selectors, expensive ($10K+/yr), requires scripting | AI-first, learns from demos, open source | ### 4.2 Positioning Statement @@ -352,21 +363,19 @@ Show it once. Let it handle the rest. ``` ## Why OpenAdapt? -### Demonstration-Based Learning -No prompt engineering required. OpenAdapt learns from how you actually do tasks. -[Stat: 33% -> 100% first-action accuracy with demo conditioning] +### Record Once, Automate Forever +Capture any workflow. OpenAdapt retrieves relevant demos to guide agents +AND trains specialized models on your recordings. +[Stat: 46.7% → 100% first-action accuracy with demo conditioning] -### Model Agnostic -Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own. -Not locked to any single provider. +### Small Models, Big Results +A 2B model fine-tuned on your workflows outperforms frontier models. +Specialization beats scale for GUI tasks. +[Stat: 42.9% action accuracy (Qwen 2B FT) vs 11.2% (Claude Sonnet 4.5)] -### Run Anywhere -CLI-based, works offline. Deploy locally, in the cloud, or hybrid. -Your data stays where you want it. - -### Fully Open Source -MIT licensed. Transparent, auditable, community-driven. -No vendor lock-in, ever. +### Model Agnostic & Open Source +Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own. +MIT licensed. Run locally, in the cloud, or hybrid. ``` ### 5.5 For Developers Section @@ -486,13 +495,22 @@ Example: Onboarding guides for complex internal tools. ### 6.3 Proof Points to Include -- "33% -> 100% first-action accuracy with demonstration conditioning" +- "46.7% → 100% first-action accuracy with demo conditioning (n=45, same model, no training)" +- "Fine-tuned 2B model outperforms Claude Sonnet 4.5 on action accuracy (42.9% vs 11.2%, internal benchmark)" +- "OpenCUA (NeurIPS 2025 Spotlight) built their capture tool on OpenAdapt" +- "Only open-source CUA framework with runtime demo-conditioned inference" - "[X,XXX] PyPI downloads this month" (dynamic) - "[XXX] GitHub stars" (dynamic) - "7 modular packages, 1 unified CLI" - "Integrated with Windows Agent Arena, WebArena, OSWorld benchmarks" - "MIT licensed, fully open source" +**Honesty notes for proof points**: +- The 46.7%→100% result is first-action only on 45 macOS tasks sharing the same navigation entry point +- The 42.9% vs 11.2% result is on a controlled internal synthetic login benchmark (~3 UI elements) +- Multi-step episode success on real-world benchmarks (WAA) is under active evaluation +- Frame these as "validated signal" not "production-proven" + --- ## 7. Wireframe Concepts From 667a9bb0f2826979eb51e54a2b437af914e9688c Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Mon, 2 Mar 2026 00:16:37 -0500 Subject: [PATCH 2/3] fix: correct OpenCUA attribution to macOS a11y code reuse OpenCUA reused OpenAdapt's macOS accessibility tree capture code (AX API traversal functions + oa_atomacos dependency), not the full capture-to-deployment pipeline. The recorder architecture came from DuckTrack. Updated README, landing page strategy, competitor table, and proof points to reflect this accurately. Evidence: arxiv.org/html/2508.09123v3 Section 2.2, OpenCUA README "Acknowledge" section. Co-Authored-By: Claude Opus 4.6 --- README.md | 2 +- docs/design/landing-page-strategy.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6bef02b2d..09c929cb9 100644 --- a/README.md +++ b/README.md @@ -177,7 +177,7 @@ The bottom-right cell is OpenAdapt's unique value: training models to **use** de **Validated result**: On a controlled macOS benchmark (45 System Settings tasks sharing a common navigation entry point), demo-conditioned prompting improved first-action accuracy from 46.7% to 100%. A length-matched control (+11.1 pp only) confirms the benefit is semantic, not token-length. See the [research thesis](https://github.com/OpenAdaptAI/openadapt-ml/blob/main/docs/research_thesis.md) for methodology and the [publication roadmap](docs/publication-roadmap.md) for limitations. -**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) built their cross-platform capture tool on OpenAdapt, but uses demos only for training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator. +**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) [reused OpenAdapt's macOS accessibility capture code](https://arxiv.org/html/2508.09123v3) in their AgentNetTool, but uses demos only for model training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator. ### Key Concepts diff --git a/docs/design/landing-page-strategy.md b/docs/design/landing-page-strategy.md index 1d5fcbc4e..17693b550 100644 --- a/docs/design/landing-page-strategy.md +++ b/docs/design/landing-page-strategy.md @@ -42,7 +42,7 @@ OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular met **Key Innovation**: - **Trajectory-conditioned disambiguation of UI affordances** — the only open-source CUA framework that conditions agents on recorded demonstrations at runtime (validated: 46.7% → 100% first-action accuracy) - **Specialization over scale** — fine-tuned Qwen3-VL-2B outperforms Claude Sonnet 4.5 and GPT-5.1 on action accuracy (42.9% vs 11.2% vs 23.2%) on an internal benchmark -- **Capture-to-deployment pipeline** — record → retrieve → train → deploy, used by [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) as foundation for their capture tooling +- **Capture-to-deployment pipeline** — record → retrieve → train → deploy. [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) [reused OpenAdapt's macOS accessibility capture code](https://arxiv.org/html/2508.09123v3) in their AgentNetTool - **Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates ### 1.2 Current Landing Page Assessment @@ -223,7 +223,7 @@ Why: Clear 3-step process, action-oriented 1. **Capture-to-Deployment Pipeline** - Not: "Prompt the AI to do your task" - But: "Record the task once. OpenAdapt handles the rest — retrieval, training, deployment." - - Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) built on OpenAdapt's capture infrastructure + - Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) [reused OpenAdapt's macOS a11y capture code](https://arxiv.org/html/2508.09123v3) 2. **Demonstration-Conditioned Agents** - Not: "Zero-shot reasoning about what to click" @@ -270,7 +270,7 @@ Why: Clear 3-step process, action-oriented |------------|-----------|------------|---------------| | **Anthropic Computer Use** | 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally | | **Agent S3 (Simular)** | 72.6% OSWorld (superhuman), open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline | -| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique), OpenCUA built on our capture tool | +| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique); OpenCUA reused our macOS a11y code | | **Browser Use** | 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library | | **UI-TARS (ByteDance)** | Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval | | **CUA / Bytebot** | Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy | @@ -497,7 +497,7 @@ Example: Onboarding guides for complex internal tools. - "46.7% → 100% first-action accuracy with demo conditioning (n=45, same model, no training)" - "Fine-tuned 2B model outperforms Claude Sonnet 4.5 on action accuracy (42.9% vs 11.2%, internal benchmark)" -- "OpenCUA (NeurIPS 2025 Spotlight) built their capture tool on OpenAdapt" +- "OpenCUA (NeurIPS 2025 Spotlight) reused OpenAdapt's macOS accessibility capture code in AgentNetTool" - "Only open-source CUA framework with runtime demo-conditioned inference" - "[X,XXX] PyPI downloads this month" (dynamic) - "[XXX] GitHub stars" (dynamic) From e061daa46d4de0059cc5cd45cd9a9eeedc48615c Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Mon, 2 Mar 2026 00:37:20 -0500 Subject: [PATCH 3/3] =?UTF-8?q?fix:=20review=20fixes=20=E2=80=94=20accurac?= =?UTF-8?q?y,=20claims,=20and=20add=20builders=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Use 46.7% consistently (not 33-47% range) - Change "core goal" to "planned" in 2x2 matrix - Drop "superhuman" for Agent S3 (barely above human baseline) - Fix possessive "our" to "OpenAdapt's" in competitor table - Add "Built for Builders" section for non-technical users - Renumber subsequent sections Co-Authored-By: Claude Opus 4.6 --- README.md | 4 ++-- docs/design/landing-page-strategy.md | 34 ++++++++++++++++++++++++---- 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 09c929cb9..be1a4103a 100644 --- a/README.md +++ b/README.md @@ -170,8 +170,8 @@ Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to **amb | | No Retrieval | With Retrieval | |---|---|---| -| **No Fine-tuning** | 33–47% (zero-shot baseline) | **100%** (validated, n=45) | -| **Fine-tuning** | Standard SFT baseline | **Demo-conditioned FT** (core goal) | +| **No Fine-tuning** | 46.7% (zero-shot baseline) | **100%** (validated, n=45) | +| **Fine-tuning** | Standard SFT (baseline) | **Demo-conditioned FT** (planned) | The bottom-right cell is OpenAdapt's unique value: training models to **use** demonstrations they haven't seen before, combining retrieval with fine-tuning for maximum accuracy. Phase 2 (retrieval-only prompting) is validated; Phase 3 (demo-conditioned fine-tuning) is in progress. diff --git a/docs/design/landing-page-strategy.md b/docs/design/landing-page-strategy.md index 17693b550..53b6dca74 100644 --- a/docs/design/landing-page-strategy.md +++ b/docs/design/landing-page-strategy.md @@ -269,8 +269,8 @@ Why: Clear 3-step process, action-oriented | Competitor | Strengths | Weaknesses | Our Advantage | |------------|-----------|------------|---------------| | **Anthropic Computer Use** | 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally | -| **Agent S3 (Simular)** | 72.6% OSWorld (superhuman), open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline | -| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique); OpenCUA reused our macOS a11y code | +| **Agent S3 (Simular)** | 72.6% OSWorld, open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline | +| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique); OpenCUA reused OpenAdapt's macOS a11y code | | **Browser Use** | 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library | | **UI-TARS (ByteDance)** | Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval | | **CUA / Bytebot** | Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy | @@ -378,7 +378,31 @@ Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own. MIT licensed. Run locally, in the cloud, or hybrid. ``` -### 5.5 For Developers Section +### 5.5 For Builders Section + +```` +## Built for Builders + +### Show it once. Done. +Record yourself doing a task. OpenAdapt handles the rest. +No code, no prompts, no configuration. + +### Three commands +```bash +pip install openadapt +openadapt capture start --name my-task # Record +openadapt run --capture my-task # Replay with AI +``` + +### Works with the AI you already use +Claude, GPT-4V, Gemini, Qwen — pick your model. +Or let OpenAdapt train a small one that runs on your laptop. + +### Your data stays yours +Everything runs locally. Nothing leaves your machine unless you want it to. +```` + +### 5.6 For Developers Section ```` ## Built for Developers @@ -415,7 +439,7 @@ Compare your models against published baselines. [View Documentation] [GitHub Repository] ```` -### 5.6 For Enterprise Section +### 5.7 For Enterprise Section ``` ## Enterprise-Ready Automation @@ -438,7 +462,7 @@ Custom development, training, and support packages available. [Contact Sales: sales@openadapt.ai] ``` -### 5.7 Use Cases Section (Refined) +### 5.8 Use Cases Section (Refined) **Current**: Generic industry grid