|
| 1 | +# CallShield Adversarial Testing |
| 2 | + |
| 3 | +This document describes the adversarial test cases we used to challenge CallShield's detection logic — including deliberate attempts to trick the model with polite scammers, angry-but-innocent callers, and synthesized evasion tactics. |
| 4 | + |
| 5 | +The goal: confirm that the audio-native Voxtral pipeline catches what text-only models miss. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## The Core Challenge |
| 10 | + |
| 11 | +Text-only scam detection can be fooled by word choice. A scammer who says *"I understand your concern"* instead of *"You must pay NOW"* can drop their text-score significantly. Voxtral doesn't have this weakness — it listens to the acoustic delivery, not just the words. |
| 12 | + |
| 13 | +We tested this systematically. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Adversarial Scenario Results |
| 18 | + |
| 19 | +### 1. The Polite IRS Agent |
| 20 | +**Attack:** A scammer using calm, professional language — "I'd like to help you resolve this" instead of threatening arrest. |
| 21 | +**Text-only vulnerability:** Politeness reduces urgency signal weight. |
| 22 | +**Result:** Score **0.95 SCAM** — Voxtral detected the scripted call-center delivery cadence and IRS authority claim regardless of polite framing. |
| 23 | + |
| 24 | +### 2. The Hedged Crypto Pitch |
| 25 | +**Attack:** Softened language — "some people have seen returns" instead of "guaranteed profits." |
| 26 | +**Text-only vulnerability:** Hedging removes the "too good to be true" signal. |
| 27 | +**Result:** Score **0.80 LIKELY_SCAM** — Voxtral caught the rehearsed sales delivery pattern and financial solicitation structure. |
| 28 | + |
| 29 | +### 3. The Angry Legitimate Customer |
| 30 | +**Attack:** A genuinely upset customer complaining about a billing error — aggressive tone, emotional language, demand for resolution. |
| 31 | +**Risk:** Could be misclassified due to emotional intensity and urgency. |
| 32 | +**Result:** Score **0.10 SAFE** — No payment demand, no authority impersonation, no information extraction. Anger alone is not a scam signal. |
| 33 | + |
| 34 | +### 4. The "Certified" Tech Support |
| 35 | +**Attack:** Scammer claims to be from a "Microsoft Certified Partner" with a legitimate-sounding business name. |
| 36 | +**Text-only vulnerability:** "Certified" and business-name legitimacy signals can lower suspicion. |
| 37 | +**Result:** Score **0.90 SCAM** — Remote access request + unsolicited outbound call pattern flagged regardless of claimed credentials. |
| 38 | + |
| 39 | +### 5. The FDIC Bank Examiner |
| 40 | +**Attack:** Highly convincing authority impersonation of a federal banking regulator — formal language, regulation citations. |
| 41 | +**Text-only vulnerability:** Formal institutional language can suppress scam scores. |
| 42 | +**Result:** Score **0.92 SCAM** — Voxtral detected the combination of authority impersonation + account information request, which legitimate FDIC examiners never do by phone. |
| 43 | + |
| 44 | +### 6. The Legitimate Doctor IVR |
| 45 | +**Attack:** A real automated appointment reminder — robotic voice, pre-recorded, mentions a patient name. |
| 46 | +**Risk:** Automated voice + patient data mention could trip false positive. |
| 47 | +**Result:** Score **0.10 SAFE** — No financial request, no urgency pressure, recognisable healthcare IVR pattern. Correctly cleared. |
| 48 | + |
| 49 | +### 7. The Legitimate Bank Fraud Alert |
| 50 | +**Attack:** Real bank automated alert — uses authority language ("This is First National Bank"), urgency ("possible unauthorized transaction"), and asks for callback. |
| 51 | +**Risk:** Authority + urgency is the classic scam combination. |
| 52 | +**Result:** Score **0.15 SAFE** — Critically, the call does NOT request credentials or payment. CallShield distinguishes "call us back" from "give us your PIN now." |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## Automated Adversarial Suite |
| 57 | + |
| 58 | +All adversarial scenarios are implemented as automated tests in `backend/tests/test_adversarial.py`: |
| 59 | + |
| 60 | +| Test | What it probes | Expected result | |
| 61 | +|------|---------------|-----------------| |
| 62 | +| Prompt injection in recommendation field | Model output manipulation | Score clamped, valid result | |
| 63 | +| Score out of range (1.5, -0.5) | Clamping enforcement | Clamped to [0.0, 1.0] | |
| 64 | +| Missing fields in model response | Default value safety | No crash, safe defaults applied | |
| 65 | +| Silence (zero-byte PCM buffer) | Edge case handling | is_silent() → True | |
| 66 | +| Long-con script (friendly opener → wire transfer) | Multi-phase scam detection | score ≥ 0.6 | |
| 67 | +| Pharmacy IVR (benign robocall) | False positive prevention | verdict ≠ SCAM | |
| 68 | + |
| 69 | +Run: `cd backend && pytest tests/test_adversarial.py -v` |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Why Native Audio Matters for Adversarial Robustness |
| 74 | + |
| 75 | +The key finding across all adversarial tests: **acoustic delivery is harder to fake than word choice.** |
| 76 | + |
| 77 | +A scammer can rewrite their script to sound polite. They cannot easily: |
| 78 | +- Suppress the call-center background noise of a boiler room |
| 79 | +- Remove the flat, rehearsed cadence of a scripted pitch |
| 80 | +- Eliminate the TTS artifacts of a synthesized robocall voice |
| 81 | +- Change the rhythm of a pre-recorded IVR message |
| 82 | + |
| 83 | +Text-based models see only the words. Voxtral hears the room. |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +*Full evaluation results: [docs/EVALUATION.md](docs/EVALUATION.md)* |
| 88 | +*Threat model and red-team mitigations: [docs/THREAT_MODEL.md](docs/THREAT_MODEL.md)* |
0 commit comments