OASB Skills Security Benchmark

Ground-truth labeled dataset. Standardized scoring. Public leaderboard. The calibration authority for AI agent skill scanners.

90
Labeled Samples
9
Attack Categories
v1.0
Dataset Version
0.85
Top F1 Score

Scoring Tiers

TierF1 ScoreFalse Positive RateCategory CoverageKappa vs HMA
Platinum>=0.90<=0.059/9>=0.85
Gold>=0.80<=0.10>=7/9--
Silver>=0.65<=0.20>=5/9--
ListedAnyAnyAny--

Scanner Leaderboard

#ScannerTierF1PrecisionRecallFPRCategories
1
HackMyAgent + NanoMind AST
v0.12.2(reference)semantic
GOLD0.850.860.871.3%9/9
2
HackMyAgent + NanoMind (heuristic)
v0.12.0semantic
LISTED0.310.310.354.0%4/9
3
HackMyAgent (static regex)
v0.12.0
LISTED0.260.330.220.3%3/9

Attack Categories

Supply Chain
Prompt Injection
Credential Exfiltration
Heartbeat RCE
Unicode Steganography
Privilege Escalation
Persistence
Social Engineering
Data Exfiltration

Submit Your Scanner

The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.

Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.

POST https://api.oa2a.org/api/v1/benchmark/submit Content-Type: application/json { "scannerId": "your-scanner-id", "scannerName": "Your Scanner", "scannerVersion": "1.0.0", "datasetVersion": "v1.0", "results": [ { "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" }, ... ] }