OASB Skills Security Benchmark
Ground-truth labeled dataset. Standardized scoring. Public leaderboard. The calibration authority for AI agent skill scanners.
90
Labeled Samples
9
Attack Categories
v1.0
Dataset Version
0.85
Top F1 Score
Scoring Tiers
| Tier | F1 Score | False Positive Rate | Category Coverage | Kappa vs HMA |
|---|---|---|---|---|
| Platinum | >=0.90 | <=0.05 | 9/9 | >=0.85 |
| Gold | >=0.80 | <=0.10 | >=7/9 | -- |
| Silver | >=0.65 | <=0.20 | >=5/9 | -- |
| Listed | Any | Any | Any | -- |
Scanner Leaderboard
| # | Scanner | Tier | F1 | Precision | Recall | FPR | Categories |
|---|---|---|---|---|---|---|---|
| 1 | HackMyAgent + NanoMind AST v0.12.2(reference)semantic | GOLD | 0.85 | 0.86 | 0.87 | 1.3% | 9/9 |
| 2 | HackMyAgent + NanoMind (heuristic) v0.12.0semantic | LISTED | 0.31 | 0.31 | 0.35 | 4.0% | 4/9 |
| 3 | HackMyAgent (static regex) v0.12.0 | LISTED | 0.26 | 0.33 | 0.22 | 0.3% | 3/9 |
Attack Categories
Supply Chain
Prompt Injection
Credential Exfiltration
Heartbeat RCE
Unicode Steganography
Privilege Escalation
Persistence
Social Engineering
Data Exfiltration
Submit Your Scanner
The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.
Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.
POST https://api.oa2a.org/api/v1/benchmark/submit
Content-Type: application/json
{
"scannerId": "your-scanner-id",
"scannerName": "Your Scanner",
"scannerVersion": "1.0.0",
"datasetVersion": "v1.0",
"results": [
{ "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" },
...
]
}