OASB

Evaluation Suite

OASB Eval

MITRE ATT&CK Evaluations, but for AI agent security products.

222 standardized attack scenarios across 15 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks - process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.

Different tools, different jobs

OASB Eval vs HackMyAgent

OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.

OASB EvalHackMyAgent
PurposeEvaluate security productsPentest AI agents
Tests"Does your EDR catch this?""Is your agent leaking?"
AudienceSecurity vendors, evaluatorsAgent developers, red teams
Analogous toMITRE ATT&CK EvaluationsOWASP ZAP / Burp Suite

222 attack scenarios

Test categories

Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.

43

Multi-step

40

AI-layer

28

Filesystem

21

Intelligence

19

Process

18

Network

18

Enforcement

14

App hooks

12

Baseline

9

Real OS

222 total scenarios across 10 categories

Framework alignment

MITRE ATLAS coverage

All 222 scenarios map to 15 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.

TechniqueATLAS IDDescription
Command and Scripting InterpreterAML.T0050Child process spawns and suspicious binary execution (curl, wget, nc)
Escape to HostAML.T0105Privilege escalation toward host-level access
Agentic Resource ConsumptionAML.T0034.002CPU, connection-burst, and mass-file resource exhaustion
Exfiltration via Cyber MeansAML.T0025Outbound data exfiltration over network channels
Unsecured CredentialsAML.T0055Reading .ssh, .aws, and credential files (.npmrc, .netrc)
Data from Local SystemAML.T0037Filesystem access outside the allowed scope
Modify AI Agent ConfigurationAML.T0081Shell/config dotfile modification for persistence
LLM Prompt InjectionAML.T0051Crafted inputs that override or bypass system instructions
LLM JailbreakAML.T0054Bypassing model safety constraints and guardrails
LLM Data LeakageAML.T0057Sensitive data disclosed in model output
AI Agent Tool InvocationAML.T0053MCP tool-call abuse (path traversal, command injection, SSRF)
ImpersonationAML.T0073A2A identity spoofing and trust exploitation
Exfiltration via AI Agent Tool InvocationAML.T0086Using the agent’s own tools as an exfiltration channel
Spamming AI System with Chaff DataAML.T0046Noise floods that drain budget and evade detection
Evade AI ModelAML.T0015Behavioral adaptation over time to evade detection

Get started

Run the evaluation

Clone the repository, install dependencies, and run the full test suite against your security product.

# Clone and install

git clone https://github.com/opena2a-org/oasb.git
cd oasb
npm install

# Run the full evaluation suite

npm test

# Run specific category

npm test -- --grep "process"
npm test -- --grep "network"
npm test -- --grep "multi-step"

Scorecard

Product comparison

Same 222 tests, different products. Implement the SecurityProductAdapter interface and run the benchmark against your own product.

Surfacearp-guard (reference)llm-guard
Prompt input scanning
Prompt output scanningN/A
MCP tool-call scanningN/A
A2A message scanningN/A
Pattern scanning
Process / network / filesystem monitoringN/A
Anomaly detection, budget, enforcementN/A

OASB scores a surface a product does not declare as N/A, not as a failure - a prompt-only scanner is not penalized for lacking filesystem monitoring. arp-guard, the reference adapter, covers every surface and passes all 222 scenarios. llm-guard is a prompt-only scanner, so its other surfaces are N/A. For a neutral cross-product detection comparison, run the verdict-based corpus benchmark rather than comparing pass counts across different capability sets.

Transparency

Known detection gaps

No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.

Multi-step campaigns

Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.

Application-level hooks

Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.

Encrypted exfiltration

Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.

Living-off-the-land

Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.

Run the evaluation

Test your security product against 222 attack scenarios. View the repository for setup instructions.

$git clone ... && npm test>