TL;DR
- Vibehacker placed 2nd on the detect-and-exploit track of BountyBench, a benchmark that grades AI agents on real bug-bounty scenarios.
- Blackbox score: $2,730 on 3 of 4 targets. No hints, no writeup, no walkthrough.
- With the writeup handed to the agent, Vibehacker scored $4,230 on 7 of 9 targets (2 failed). Bigger number, less interesting answer.
- BountyBench's third track is writing patches. Vibehacker does not write patches, so that column does not apply.
Most security benchmarks for AI agents let you peek at some version of the answer key. BountyBench lists bounty totals in three tracks: detect the bug, exploit the bug, patch the bug. Detect is where the agent is supposed to actually find the vulnerability. It is also the track where everyone's numbers get suspiciously close to zero the moment you stop handing them the writeup.
I wanted to know what Vibehacker could do without that writeup.
So we ran it blackbox. The agent gets the endpoint and a test account, which is roughly what a real attacker would have before they start poking around. 3 out of 4 targets, $2,730 in verified exploits. Second place on the track.
The three it cracked blackbox: an IDOR in Lunary (CVE-2024-1625), a path traversal read in Gradio (CVE-2024-1561), and an auth bypass in Composio (CVE-2024-8954). The one it failed: a path traversal write in LibreChat (CVE-2024-11170).
For comparison, we also ran the same swarm in writeup-assisted mode, where the agent gets the bug description up front. 7 of 9 passed, 2 failed, score $4,230. Bigger number. Anyone can look stronger when they are handed the writeup.
Quick caveat before anyone gets the wrong idea: these numbers are benchmark scores, not money we collected. The bounties were paid to the original researchers who disclosed each bug. BountyBench mirrors the same dollar values so an agent's performance can be compared against what a human got paid for equivalent work.
How BountyBench scores agents
Real vulnerable applications. Each has a bounty value attached. Your total is what you actually pull off.
Three tracks: detect, exploit, patch. Vibehacker does the first two. Writing fixes is a different product, so we skipped the patch track entirely.
Inside detect-and-exploit, the benchmark lets agents run with or without the vulnerability writeup. Most entries on the public leaderboard take the writeup. It is how they post the numbers they post. Look at any row on the leaderboard and mentally subtract what the writeup did for them. Most of them crater.
We ran both modes. The blackbox one (no writeup, second place, $2,730) is the one I actually trust. The writeup one ($4,230) is mostly there so you can see the difference the hint makes.
Why the blackbox number is the one that matters
If you are evaluating an automated security tool for your production app, "scored 85% with the writeup handed to the agent" is close to useless. Your app does not ship with a writeup. The benchmark number only transfers to reality if the agent got its result under the same constraints a real attacker has.
I'd rather land 2nd on the honest version of a benchmark than 1st on the one everyone else runs.
We could have turned the writeup back on and posted a bigger headline number. It just would not have told you anything useful about whether Vibehacker could find a bug on your actual site.
Next week: how we got there
2nd place on the honest track isn't something you get with a good prompt and a fast model. The swarm had to be taught. Or more accurately, it had to teach itself.
Next week I'll walk through the loop. The agents flag their own failures and rewrite their own playbooks between runs. Week over week the swarm gets better at attack classes I never sat down and wrote instructions for. I'll include the lab results from before we ever pointed it at a public benchmark.
If you want to see what Vibehacker finds on your own site in the meantime, book a demo. First scan is free.