OASB

OASB Skills Security Benchmark

A ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. The HMA full pipeline scores 82.9% F1 (82.6% recall, 1.16% FPR).

Last updated: June 5, 2026 | Build: hackmyagent 0.23.8 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)

4,245
Labeled Samples
82.9%
Full Pipeline F1
82.6%
Recall
83.2%
Precision
1.16%
False Positive Rate

Verdict: attack findings, not posture

A sample is flagged malicious on a high/critical attack finding. Posture findings that fire on benign and malicious artifacts alike are excluded from the verdict (the scanner still surfaces them to users): missing prompt/governance defenses, and "allowedTools": ["*"] wildcard tool access, which 2,900+ benign registry MCP servers also declare -- a least-privilege posture issue, not malice. That is the same treatment already applied to missing-governance checks. Read recall alongside F1: the pipeline favors precision, so recall (82.6%) is the honest measure of coverage, and the per-category table below shows where it is strong vs. weak. The previously shown 82.1% F1 / 1.26% FPR (a skill-routing artifact that bypassed the MCP analyzers) and the older 89.2% figure are withdrawn.

Why this benchmark exists

Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.

OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.

Scanner Leaderboard

#ScannerTierF1PrecisionRecallFPRFlag RateCategories
1
HMA Full Pipeline
v0.23.8 - AST compilation + 6 analyzers + NanoMind(shipped scanner)
GOLD82.9%83.2%82.6%1.16%6.3%9/9
2
HMA Static Patterns
v0.23.8 - Regex-only, no NanoMind
SILVER67.5%99.3%51.1%0.03%3.6%9/9
3
NanoMind TME v0.5.0ablation
v0.5.0 - Model-only ablation, not a scanner verdict
LISTED14.0%7.5%93.0%79.18%79.8%9/9

The NanoMind TME row is a model-only ablation, not a scanner verdict -- the classifier currently under-performs on this corpus on its own (a whitespace-vocabulary limitation on code and skill text), which is why its standalone false positive rate is high. A code and text-aware classifier is in progress.

Per-Category Detection (HMA Full Pipeline)

30 malicious samples per category. Recall, sorted high to low.

CategoryRecallRecall Bar
Credential Exfiltration96.7%
Unicode Steganography96.7%
Social Engineering96.7%
Data Exfiltration96.7%
Persistence90.0%
Prompt Injection86.7%
Heartbeat RCE70.0%
Privilege Escalation63.3%
Supply Chain46.7%

DVAA Ground-Truth Validation

The Damn Vulnerable AI Agent scenarios provide a second ground-truth set of intentionally vulnerable agents, each with a known attack type.

Across the full DVAA scenario repo (86 scenarios, real attack files), the structural pipeline detects 29.1% under the same verdict (a high or critical attack finding). On the config-structural DVAA samples carried in the corpus (91 samples) it reaches 81.3%. The gap is the honest picture: the structural analyzers catch config-encoded attacks (self-escalation, control-bypass and credential-harvest directives) but miss most behavioral and natural-language attacks, which depend on the semantic layer. Lead with the 29.1% full-repo figure when characterizing DVAA detection.

Industry Comparison

Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).

ScannerPlatformFlag RatePrecisionRecallFlag Rate
HMA StaticOASB v23.6%99.3%51.1%
HMA Full PipelineOASB v26.3%83.2%82.6%
SocketSkills.sh3.8%----
SnykSkills.sh7.7%----
agent-trust-hubSkills.sh13.8%----
Cisco Skill ScannerSkills.sh14.0%----
Cisco Skill ScannerClawHub16.7%----
GPT 5.3-based LLMSkills.sh27.3%----
VirusTotalClawHub36.2%----
GPT 5.3-based LLMClawHub38.8%----
OpenClaw ScannerClawHub41.9%----

Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.

Scoring Tiers

TierF1 ScoreFalse Positive RateCategory CoverageKappa vs HMA
Platinum>=0.90<=5%9/9>=0.85
Gold>=0.80<=10%>=7/9--
Silver>=0.65<=20%>=5/9--
ListedAnyAnyAny--

Methodology

Dataset Composition

  • 270 malicious samples (30 per category)
  • 3,881 benign samples from real registries
  • 94 edge cases (security tools, defensive configs)
  • Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
  • 225 registry metadata-flagged stubs excluded (no malicious content)

Scoring

  • Binary detection: malicious/benign verdict per sample
  • Category assignment: 9 attack categories for malicious verdicts
  • Metrics: pooled F1, precision, recall over all samples (micro-average)
  • FPR: false positives / (false positives + true negatives)
  • Edge case samples excluded from scoring

Submit Your Scanner

The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.

Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.

POST https://api.oa2a.org/api/v1/benchmark/submit Content-Type: application/json { "scannerId": "your-scanner-id", "scannerName": "Your Scanner", "scannerVersion": "1.0.0", "datasetVersion": "v2.0", "results": [ { "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" }, ... ] }

References

  • Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
  • OASB benchmark code and dataset: github.com/opena2a-org/oasb
  • DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent