OASB Skills Security Benchmark
A ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. The HMA full pipeline scores 82.9% F1 (82.6% recall, 1.16% FPR).
Last updated: June 5, 2026 | Build: hackmyagent 0.23.8 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)
Verdict: attack findings, not posture
A sample is flagged malicious on a high/critical attack finding. Posture findings that fire on benign and malicious artifacts alike are excluded from the verdict (the scanner still surfaces them to users): missing prompt/governance defenses, and "allowedTools": ["*"] wildcard tool access, which 2,900+ benign registry MCP servers also declare -- a least-privilege posture issue, not malice. That is the same treatment already applied to missing-governance checks. Read recall alongside F1: the pipeline favors precision, so recall (82.6%) is the honest measure of coverage, and the per-category table below shows where it is strong vs. weak. The previously shown 82.1% F1 / 1.26% FPR (a skill-routing artifact that bypassed the MCP analyzers) and the older 89.2% figure are withdrawn.
Why this benchmark exists
Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.
OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.
Scanner Leaderboard
| # | Scanner | Tier | F1 | Precision | Recall | FPR | Flag Rate | Categories |
|---|---|---|---|---|---|---|---|---|
| 1 | HMA Full Pipeline v0.23.8 - AST compilation + 6 analyzers + NanoMind(shipped scanner) | GOLD | 82.9% | 83.2% | 82.6% | 1.16% | 6.3% | 9/9 |
| 2 | HMA Static Patterns v0.23.8 - Regex-only, no NanoMind | SILVER | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% | 9/9 |
| 3 | NanoMind TME v0.5.0ablation v0.5.0 - Model-only ablation, not a scanner verdict | LISTED | 14.0% | 7.5% | 93.0% | 79.18% | 79.8% | 9/9 |
The NanoMind TME row is a model-only ablation, not a scanner verdict -- the classifier currently under-performs on this corpus on its own (a whitespace-vocabulary limitation on code and skill text), which is why its standalone false positive rate is high. A code and text-aware classifier is in progress.
Per-Category Detection (HMA Full Pipeline)
30 malicious samples per category. Recall, sorted high to low.
| Category | Recall | Recall Bar |
|---|---|---|
| Credential Exfiltration | 96.7% | |
| Unicode Steganography | 96.7% | |
| Social Engineering | 96.7% | |
| Data Exfiltration | 96.7% | |
| Persistence | 90.0% | |
| Prompt Injection | 86.7% | |
| Heartbeat RCE | 70.0% | |
| Privilege Escalation | 63.3% | |
| Supply Chain | 46.7% |
DVAA Ground-Truth Validation
The Damn Vulnerable AI Agent scenarios provide a second ground-truth set of intentionally vulnerable agents, each with a known attack type.
Across the full DVAA scenario repo (86 scenarios, real attack files), the structural pipeline detects 29.1% under the same verdict (a high or critical attack finding). On the config-structural DVAA samples carried in the corpus (91 samples) it reaches 81.3%. The gap is the honest picture: the structural analyzers catch config-encoded attacks (self-escalation, control-bypass and credential-harvest directives) but miss most behavioral and natural-language attacks, which depend on the semantic layer. Lead with the 29.1% full-repo figure when characterizing DVAA detection.
Industry Comparison
Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).
| Scanner | Platform | Flag Rate | Precision | Recall | Flag Rate |
|---|---|---|---|---|---|
| HMA Static | OASB v2 | 3.6% | 99.3% | 51.1% | |
| HMA Full Pipeline | OASB v2 | 6.3% | 83.2% | 82.6% | |
| Socket | Skills.sh | 3.8% | -- | -- | |
| Snyk | Skills.sh | 7.7% | -- | -- | |
| agent-trust-hub | Skills.sh | 13.8% | -- | -- | |
| Cisco Skill Scanner | Skills.sh | 14.0% | -- | -- | |
| Cisco Skill Scanner | ClawHub | 16.7% | -- | -- | |
| GPT 5.3-based LLM | Skills.sh | 27.3% | -- | -- | |
| VirusTotal | ClawHub | 36.2% | -- | -- | |
| GPT 5.3-based LLM | ClawHub | 38.8% | -- | -- | |
| OpenClaw Scanner | ClawHub | 41.9% | -- | -- |
Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.
Scoring Tiers
| Tier | F1 Score | False Positive Rate | Category Coverage | Kappa vs HMA |
|---|---|---|---|---|
| Platinum | >=0.90 | <=5% | 9/9 | >=0.85 |
| Gold | >=0.80 | <=10% | >=7/9 | -- |
| Silver | >=0.65 | <=20% | >=5/9 | -- |
| Listed | Any | Any | Any | -- |
Methodology
Dataset Composition
- 270 malicious samples (30 per category)
- 3,881 benign samples from real registries
- 94 edge cases (security tools, defensive configs)
- Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
- 225 registry metadata-flagged stubs excluded (no malicious content)
Scoring
- Binary detection: malicious/benign verdict per sample
- Category assignment: 9 attack categories for malicious verdicts
- Metrics: pooled F1, precision, recall over all samples (micro-average)
- FPR: false positives / (false positives + true negatives)
- Edge case samples excluded from scoring
Submit Your Scanner
The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.
Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.
POST https://api.oa2a.org/api/v1/benchmark/submit
Content-Type: application/json
{
"scannerId": "your-scanner-id",
"scannerName": "Your Scanner",
"scannerVersion": "1.0.0",
"datasetVersion": "v2.0",
"results": [
{ "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" },
...
]
}References
- Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
- OASB benchmark code and dataset: github.com/opena2a-org/oasb
- DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent