Evaluation Suite

OASB Eval

MITRE ATT&CK Evaluations, but for AI agent security products.

222 standardized attack scenarios across 15 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks - process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.

View on GitHub Documentation

Different tools, different jobs

OASB Eval vs HackMyAgent

OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.

	OASB Eval	HackMyAgent
Purpose	Evaluate security products	Pentest AI agents
Tests	"Does your EDR catch this?"	"Is your agent leaking?"
Audience	Security vendors, evaluators	Agent developers, red teams
Analogous to	MITRE ATT&CK Evaluations	OWASP ZAP / Burp Suite

222 attack scenarios

Test categories

Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.

Multi-step

AI-layer

Filesystem

Intelligence

Process

Network

Enforcement

App hooks

Baseline

Real OS

222 total scenarios across 10 categories

Framework alignment

MITRE ATLAS coverage

All 222 scenarios map to 15 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.

Technique	ATLAS ID	Description
Command and Scripting Interpreter	AML.T0050	Child process spawns and suspicious binary execution (curl, wget, nc)
Escape to Host	AML.T0105	Privilege escalation toward host-level access
Agentic Resource Consumption	AML.T0034.002	CPU, connection-burst, and mass-file resource exhaustion
Exfiltration via Cyber Means	AML.T0025	Outbound data exfiltration over network channels
Unsecured Credentials	AML.T0055	Reading .ssh, .aws, and credential files (.npmrc, .netrc)
Data from Local System	AML.T0037	Filesystem access outside the allowed scope
Modify AI Agent Configuration	AML.T0081	Shell/config dotfile modification for persistence
LLM Prompt Injection	AML.T0051	Crafted inputs that override or bypass system instructions
LLM Jailbreak	AML.T0054	Bypassing model safety constraints and guardrails
LLM Data Leakage	AML.T0057	Sensitive data disclosed in model output
AI Agent Tool Invocation	AML.T0053	MCP tool-call abuse (path traversal, command injection, SSRF)
Impersonation	AML.T0073	A2A identity spoofing and trust exploitation
Exfiltration via AI Agent Tool Invocation	AML.T0086	Using the agent’s own tools as an exfiltration channel
Spamming AI System with Chaff Data	AML.T0046	Noise floods that drain budget and evade detection
Evade AI Model	AML.T0015	Behavioral adaptation over time to evade detection

Get started

Run the evaluation

Clone the repository, install dependencies, and run the full test suite against your security product.

# Clone and install

git clone https://github.com/opena2a-org/oasb.git
cd oasb
npm install

# Run the full evaluation suite

npm test

# Run specific category

npm test -- --grep "process"
npm test -- --grep "network"
npm test -- --grep "multi-step"

Scorecard

Product comparison

Same 222 tests, different products. Implement the SecurityProductAdapter interface and run the benchmark against your own product.

Surface	arp-guard (reference)	llm-guard
Prompt input scanning	✓	✓
Prompt output scanning	✓	N/A
MCP tool-call scanning	✓	N/A
A2A message scanning	✓	N/A
Pattern scanning	✓	✓
Process / network / filesystem monitoring	✓	N/A
Anomaly detection, budget, enforcement	✓	N/A

OASB scores a surface a product does not declare as N/A, not as a failure - a prompt-only scanner is not penalized for lacking filesystem monitoring. arp-guard, the reference adapter, covers every surface and passes all 222 scenarios. llm-guard is a prompt-only scanner, so its other surfaces are N/A. For a neutral cross-product detection comparison, run the verdict-based corpus benchmark rather than comparing pass counts across different capability sets.

Transparency

Known detection gaps

No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.

Multi-step campaigns

Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.

Application-level hooks

Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.

Encrypted exfiltration

Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.

Living-off-the-land

Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.

Run the evaluation

Test your security product against 222 attack scenarios. View the repository for setup instructions.

$git clone ... && npm test>