Add humanity test experiment for eval vs real prompt detection by chughtapan · Pull Request #31 · chughtapan/wags

chughtapan · 2025-11-25T19:42:01Z

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook.

🤖 Generated with Claude Code

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

chughtapan and others added 2 commits November 25, 2025 11:41

Fix mypy errors for appworld imports in humanity test

2ffb997

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add humanity test experiment for eval vs real prompt detection#31

Add humanity test experiment for eval vs real prompt detection#31
chughtapan wants to merge 2 commits intomainfrom
human_test

chughtapan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chughtapan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant