Skip to content

Add humanity test experiment for eval vs real prompt detection#31

Open
chughtapan wants to merge 2 commits intomainfrom
human_test
Open

Add humanity test experiment for eval vs real prompt detection#31
chughtapan wants to merge 2 commits intomainfrom
human_test

Conversation

@chughtapan
Copy link
Copy Markdown
Owner

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook.

🤖 Generated with Claude Code

chughtapan and others added 2 commits November 25, 2025 11:41
Test whether LLMs can distinguish between real human requests and
evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe
data loaders with a marimo analysis notebook.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant