OASB Skills Security Benchmark

A ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. The HMA full pipeline scores 82.9% F1 (82.6% recall, 1.16% FPR).

Last updated: June 5, 2026 | Build: hackmyagent 0.23.8 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)

4,245

Labeled Samples

82.9%

Full Pipeline F1

82.6%

Recall

83.2%

Precision

1.16%

False Positive Rate

Verdict: attack findings, not posture

A sample is flagged malicious on a high/critical attack finding. Posture findings that fire on benign and malicious artifacts alike are excluded from the verdict (the scanner still surfaces them to users): missing prompt/governance defenses, and "allowedTools": ["*"] wildcard tool access, which 2,900+ benign registry MCP servers also declare -- a least-privilege posture issue, not malice. That is the same treatment already applied to missing-governance checks. Read recall alongside F1: the pipeline favors precision, so recall (82.6%) is the honest measure of coverage, and the per-category table below shows where it is strong vs. weak. The previously shown 82.1% F1 / 1.26% FPR (a skill-routing artifact that bypassed the MCP analyzers) and the older 89.2% figure are withdrawn.

Why this benchmark exists

Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.

OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.

Scanner Leaderboard

#	Scanner	Tier	F1	Precision	Recall	FPR	Flag Rate	Categories
1	HMA Full Pipeline v0.23.8 - AST compilation + 6 analyzers + NanoMind(shipped scanner)	GOLD	82.9%	83.2%	82.6%	1.16%	6.3%	9/9
2	HMA Static Patterns v0.23.8 - Regex-only, no NanoMind	SILVER	67.5%	99.3%	51.1%	0.03%	3.6%	9/9
3	NanoMind TME v0.5.0ablation v0.5.0 - Model-only ablation, not a scanner verdict	LISTED	14.0%	7.5%	93.0%	79.18%	79.8%	9/9

The NanoMind TME row is a model-only ablation, not a scanner verdict -- the classifier currently under-performs on this corpus on its own (a whitespace-vocabulary limitation on code and skill text), which is why its standalone false positive rate is high. A code and text-aware classifier is in progress.

Per-Category Detection (HMA Full Pipeline)

30 malicious samples per category. Recall, sorted high to low.

Category	Recall	Recall Bar
Credential Exfiltration	96.7%
Unicode Steganography	96.7%
Social Engineering	96.7%
Data Exfiltration	96.7%
Persistence	90.0%
Prompt Injection	86.7%
Heartbeat RCE	70.0%
Privilege Escalation	63.3%
Supply Chain	46.7%

DVAA Ground-Truth Validation

The Damn Vulnerable AI Agent scenarios provide a second ground-truth set of intentionally vulnerable agents, each with a known attack type.

Across the full DVAA scenario repo (86 scenarios, real attack files), the structural pipeline detects 29.1% under the same verdict (a high or critical attack finding). On the config-structural DVAA samples carried in the corpus (91 samples) it reaches 81.3%. The gap is the honest picture: the structural analyzers catch config-encoded attacks (self-escalation, control-bypass and credential-harvest directives) but miss most behavioral and natural-language attacks, which depend on the semantic layer. Lead with the 29.1% full-repo figure when characterizing DVAA detection.

Industry Comparison

Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).

Scanner	Platform	Flag Rate	Precision	Recall
HMA Static	OASB v2	3.6%	99.3%	51.1%
HMA Full Pipeline	OASB v2	6.3%	83.2%	82.6%
Socket	Skills.sh	3.8%	--	--
Snyk	Skills.sh	7.7%	--	--
agent-trust-hub	Skills.sh	13.8%	--	--
Cisco Skill Scanner	Skills.sh	14.0%	--	--
Cisco Skill Scanner	ClawHub	16.7%	--	--
GPT 5.3-based LLM	Skills.sh	27.3%	--	--
VirusTotal	ClawHub	36.2%	--	--
GPT 5.3-based LLM	ClawHub	38.8%	--	--
OpenClaw Scanner	ClawHub	41.9%	--	--

Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.

Scoring Tiers

Tier	F1 Score	False Positive Rate	Category Coverage	Kappa vs HMA
Platinum	>=0.90	<=5%	9/9	>=0.85
Gold	>=0.80	<=10%	>=7/9	--
Silver	>=0.65	<=20%	>=5/9	--
Listed	Any	Any	Any	--

Methodology

Dataset Composition

270 malicious samples (30 per category)
3,881 benign samples from real registries
94 edge cases (security tools, defensive configs)
Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
225 registry metadata-flagged stubs excluded (no malicious content)

Scoring

Binary detection: malicious/benign verdict per sample
Category assignment: 9 attack categories for malicious verdicts
Metrics: pooled F1, precision, recall over all samples (micro-average)
FPR: false positives / (false positives + true negatives)
Edge case samples excluded from scoring

Submit Your Scanner

The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.

Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.

POST https://api.oa2a.org/api/v1/benchmark/submit
Content-Type: application/json

{
  "scannerId": "your-scanner-id",
  "scannerName": "Your Scanner",
  "scannerVersion": "1.0.0",
  "datasetVersion": "v2.0",
  "results": [
    { "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" },
    ...
  ]
}

References

Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
OASB benchmark code and dataset: github.com/opena2a-org/oasb
DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent