CortexScout is the Deep Research & Web Extraction module within the Cortex-Works ecosystem.
Designed for agent workloads that require token-efficient web retrieval, reliable anti-bot handling, and optional Human-in-the-Loop (HITL) fallback.
CortexScout provides a single, self-hostable Rust binary that exposes search and extraction capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.
It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval → Chromium CDP rendering → HITL workflows.
| Area | MCP Tools / Capabilities |
|---|---|
| Search | web_search, web_search_json (parallel meta-search + dedup/scoring) |
| Fetch | web_fetch, web_fetch_batch (token-efficient clean output, optional semantic filtering) |
| Crawl | web_crawl (bounded discovery for doc sites / sub-pages) |
| Extraction | extract_fields, fetch_then_extract (schema-driven extraction) |
| Anti-bot handling | CDP rendering, proxy rotation, block-aware retries |
| HITL | visual_scout (screenshot for gate confirmation), human_auth_session (authenticated fetch with persisted sessions), non_robot_search (last resort rendering) |
| Memory | memory_search (LanceDB-backed research history) |
| Deep research | deep_research (multi-hop search + scrape + synthesis via OpenAI-compatible APIs) |
While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.
This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.
| Target | Protection | Evidence | Notes |
|---|---|---|---|
| Cloudflare + Auth | JSON · Snippet | Auth-gated listings extraction | |
| Ticketmaster | Cloudflare Turnstile | JSON · Snippet | Challenge-handled extraction |
| Airbnb | DataDome | JSON · Snippet | Large result sets under bot controls |
| Upwork | reCAPTCHA | JSON · Snippet | Protected listings retrieval |
| Amazon | AWS Shield | JSON · Snippet | Search result extraction |
| nowsecure.nl | Cloudflare | JSON | Manual return path validated |
See proof/README.md for methodology and raw outputs.
Download the latest release assets from GitHub Releases and run one of:
cortex-scout-mcp— MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)cortex-scout— optional HTTP server (default port5000; override via--port,PORT, orCORTEX_SCOUT_PORT)
Health check (HTTP server):
./cortex-scout --port 5000
curl http://localhost:5000/healthgit clone https://github.com/cortex-works/cortex-scout.git
cd cortex-scout
cd mcp-server
cargo build --release --all-featuresAdd a server entry to your MCP config. Example for VS Code (stdio transport):
Multi-IDE guide: docs/IDE_SETUP.md
Create cortex-scout.json in the same directory as the binary (or repository root). All fields are optional; environment variables act as fallback.
{
"deep_research": {
"enabled": true,
"llm_base_url": "http://localhost:1234/v1",
"llm_api_key": "",
"llm_model": "lfm2-2.6b",
"synthesis_enabled": true,
"synthesis_max_sources": 3,
"synthesis_max_chars_per_source": 800,
"synthesis_max_tokens": 1024
}
}| Variable | Default | Description |
|---|---|---|
CHROME_EXECUTABLE |
auto-detected | Override path to Chromium/Chrome/Brave |
SEARCH_ENGINES |
google,bing,duckduckgo,brave |
Active engines (comma-separated) |
SEARCH_MAX_RESULTS_PER_ENGINE |
10 |
Results per engine before merge |
SEARCH_CDP_FALLBACK |
auto | Retry blocked retrieval via native Chromium CDP |
LANCEDB_URI |
— | Path for semantic memory (optional) |
CORTEX_SCOUT_MEMORY_DISABLED |
0 |
Set 1 to disable memory features |
HTTP_TIMEOUT_SECS |
30 |
Per-request timeout |
OUTBOUND_LIMIT |
32 |
Max concurrent outbound connections |
MAX_CONTENT_CHARS |
10000 |
Max chars per scraped document |
IP_LIST_PATH |
— | Proxy IP list path |
PROXY_SOURCE_PATH |
— | Proxy source definition path |
DEEP_RESEARCH_ENABLED |
1 |
Disable the deep_research tool at runtime by setting 0 |
OPENAI_API_KEY |
— | API key for synthesis (omit for key-less local endpoints) |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI-compatible endpoint (Ollama/LM Studio supported) |
DEEP_RESEARCH_LLM_MODEL |
gpt-4o-mini |
Model name (OpenAI-compatible) |
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS |
1024 |
Response token budget for synthesis |
Recommended operational flow:
- Use
memory_searchbefore new research runs to avoid re-fetching. - Prefer
web_search_jsonfor initial discovery (search + content summaries). - Use
web_fetchfor known URLs; useoutput_format="clean_json"and setquery+strict_relevance=truefor token efficiency. - On 403/429/rate-limit: call
proxy_controlwithaction:"grab", then retry withuse_proxy:true. - For auth walls:
visual_scoutto confirm gating, thenhuman_auth_sessionto complete login and persist sessions under~/.cortex-scout/sessions/.
Full agent rules: /.github/copilot-instructions.md
See CHANGELOG.md.
MIT. See LICENSE.
{ "servers": { "cortex-scout": { "type": "stdio", "command": "env", "args": [ "RUST_LOG=info", "SEARCH_ENGINES=google,bing,duckduckgo,brave", "LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb", "HTTP_TIMEOUT_SECS=30", "MAX_CONTENT_CHARS=10000", "IP_LIST_PATH=/YOUR_PATH/cortex-scout/ip.txt", "PROXY_SOURCE_PATH=/YOUR_PATH/cortex-scout/proxy_source.json", "/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp" ] } } }