Evaluation Suite
OASB Eval
MITRE ATT&CK Evaluations, but for AI agent security products.
222 standardized attack scenarios across 15 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks - process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.
Different tools, different jobs
OASB Eval vs HackMyAgent
OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.
| OASB Eval | HackMyAgent | |
|---|---|---|
| Purpose | Evaluate security products | Pentest AI agents |
| Tests | "Does your EDR catch this?" | "Is your agent leaking?" |
| Audience | Security vendors, evaluators | Agent developers, red teams |
| Analogous to | MITRE ATT&CK Evaluations | OWASP ZAP / Burp Suite |
222 attack scenarios
Test categories
Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.
43
Multi-step
40
AI-layer
28
Filesystem
21
Intelligence
19
Process
18
Network
18
Enforcement
14
App hooks
12
Baseline
9
Real OS
222 total scenarios across 10 categories
Framework alignment
MITRE ATLAS coverage
All 222 scenarios map to 15 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.
| Technique | ATLAS ID | Description |
|---|---|---|
| Command and Scripting Interpreter | AML.T0050 | Child process spawns and suspicious binary execution (curl, wget, nc) |
| Escape to Host | AML.T0105 | Privilege escalation toward host-level access |
| Agentic Resource Consumption | AML.T0034.002 | CPU, connection-burst, and mass-file resource exhaustion |
| Exfiltration via Cyber Means | AML.T0025 | Outbound data exfiltration over network channels |
| Unsecured Credentials | AML.T0055 | Reading .ssh, .aws, and credential files (.npmrc, .netrc) |
| Data from Local System | AML.T0037 | Filesystem access outside the allowed scope |
| Modify AI Agent Configuration | AML.T0081 | Shell/config dotfile modification for persistence |
| LLM Prompt Injection | AML.T0051 | Crafted inputs that override or bypass system instructions |
| LLM Jailbreak | AML.T0054 | Bypassing model safety constraints and guardrails |
| LLM Data Leakage | AML.T0057 | Sensitive data disclosed in model output |
| AI Agent Tool Invocation | AML.T0053 | MCP tool-call abuse (path traversal, command injection, SSRF) |
| Impersonation | AML.T0073 | A2A identity spoofing and trust exploitation |
| Exfiltration via AI Agent Tool Invocation | AML.T0086 | Using the agent’s own tools as an exfiltration channel |
| Spamming AI System with Chaff Data | AML.T0046 | Noise floods that drain budget and evade detection |
| Evade AI Model | AML.T0015 | Behavioral adaptation over time to evade detection |
Get started
Run the evaluation
Clone the repository, install dependencies, and run the full test suite against your security product.
# Clone and install
git clone https://github.com/opena2a-org/oasb.git cd oasb npm install
# Run the full evaluation suite
npm test
# Run specific category
npm test -- --grep "process" npm test -- --grep "network" npm test -- --grep "multi-step"
Scorecard
Product comparison
Same 222 tests, different products. Implement the SecurityProductAdapter interface and run the benchmark against your own product.
| Surface | arp-guard (reference) | llm-guard |
|---|---|---|
| Prompt input scanning | ✓ | ✓ |
| Prompt output scanning | ✓ | N/A |
| MCP tool-call scanning | ✓ | N/A |
| A2A message scanning | ✓ | N/A |
| Pattern scanning | ✓ | ✓ |
| Process / network / filesystem monitoring | ✓ | N/A |
| Anomaly detection, budget, enforcement | ✓ | N/A |
OASB scores a surface a product does not declare as N/A, not as a failure - a prompt-only scanner is not penalized for lacking filesystem monitoring. arp-guard, the reference adapter, covers every surface and passes all 222 scenarios. llm-guard is a prompt-only scanner, so its other surfaces are N/A. For a neutral cross-product detection comparison, run the verdict-based corpus benchmark rather than comparing pass counts across different capability sets.
Transparency
Known detection gaps
No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.
Multi-step campaigns
Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.
Application-level hooks
Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.
Encrypted exfiltration
Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.
Living-off-the-land
Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.
Run the evaluation
Test your security product against 222 attack scenarios. View the repository for setup instructions.
git clone ... && npm test>