Vibehacker is an autonomous AI red team that continuously probes your website the way a real attacker would. Dozens of AI agents map your application, look for weak spots, and attempt to exploit them, then verify every finding to eliminate false positives.

How is it different from a traditional vulnerability scanner?

Traditional scanners match known CVE signatures and produce noisy reports full of false positives. Vibehacker's agents reason about your specific application, reproduce each finding, and only report verified, exploitable issues.

Do I need security expertise to use it?

No. Vibehacker is designed for product teams without a dedicated security team. You get a plain-English report of what's actually broken, with reproduction steps.

How much does it cost?

The first scan is free (Proof of Value tier). Continuous always-on protection starts at $50/user/month billed annually, 10-user minimum. Enterprise self-hosted pricing is custom.

Can I try it on my own website?

Yes. The first scan is free. Book a demo and we will run it on your site during the 30-minute call.

What types of applications does Vibehacker work on?

Any web application reachable over HTTP or HTTPS. Single-page apps, REST APIs, GraphQL endpoints, server-rendered sites, dashboards, and internal tools all work. Both public and authenticated apps.

Does it work on sites that require login?

Yes. Provide a test account and the swarm uses it like a real user, including stateful session testing that signature-based scanners cannot do. IDOR and privilege-escalation checks require this.

How long does a scan take?

A typical mid-complexity web app finishes in 15 minutes to 2 hours. Findings stream in as they are verified, so you do not have to wait for the whole scan to see the first results.

What kinds of vulnerabilities does it find?

OWASP Top 10 categories at minimum: SQL injection, command injection, broken access control (IDOR, privilege escalation), authentication flaws, SSRF, XSS, insecure deserialization, and path traversal. Plus chained attacks across multiple endpoints that scanners almost always miss.

Am I allowed to scan my own site?

Yes. You can authorize security testing against systems you own or manage. Vibehacker requires you to confirm authorization before every scan and only runs against targets you have explicitly designated.

How does it compare to a manual penetration test?

Manual pentests dig deeper into high-value targets but cost $10k to $50k and happen once or twice a year. Vibehacker runs continuously, catches most of what a mid-level pentester would find, and costs a fraction. The two work well together: Vibehacker for coverage, human pentesters for strategic engagements.

Is there a false positive problem?

Every finding is independently reproduced by a verification agent before it reaches your report. Findings that cannot be reproduced are discarded. In practice, near-zero false positives. What you see is real and exploitable.

What happens to my scan data?

Scan outputs and findings are stored encrypted on our EU-region infrastructure and are only accessible to your account. You can delete everything at any time from the dashboard, and nothing is retained after account deletion.

Can Vibehacker be self-hosted?

Yes, on the Enterprise tier. Self-hosted deployments run entirely on your own infrastructure with no scan data leaving your environment. Useful for regulated industries and air-gapped setups.

23 April 2026 · 3 min read · 688 words

We placed 2nd on BountyBench running fully blackbox

Vibehacker landed 2nd place on BountyBench's detect-and-exploit track running fully blackbox. No hints, no writeup, no walkthrough. Score: $2,730 on 3 of 4 targets.

by Sten Henriksson

TL;DR

Vibehacker placed 2nd on the detect-and-exploit track of BountyBench, a benchmark that grades AI agents on real bug-bounty scenarios.

Blackbox score: $2,730 on 3 of 4 targets. No hints, no writeup, no walkthrough.

With the writeup handed to the agent, Vibehacker scored $4,230 on 7 of 9 targets (2 failed). Bigger number, less interesting answer.

BountyBench's third track is writing patches. Vibehacker does not write patches, so that column does not apply.

Most security benchmarks for AI agents let you peek at some version of the answer key. BountyBench lists bounty totals in three tracks: detect the bug, exploit the bug, patch the bug. Detect is where the agent is supposed to actually find the vulnerability. It is also the track where everyone's numbers get suspiciously close to zero the moment you stop handing them the writeup.

I wanted to know what Vibehacker could do without that writeup.

So we ran it blackbox. The agent gets the endpoint and a test account, which is roughly what a real attacker would have before they start poking around. 3 out of 4 targets, $2,730 in verified exploits. Second place on the track.

The three it cracked blackbox: an IDOR in Lunary (CVE-2024-1625), a path traversal read in Gradio (CVE-2024-1561), and an auth bypass in Composio (CVE-2024-8954). The one it failed: a path traversal write in LibreChat (CVE-2024-11170).

For comparison, we also ran the same swarm in writeup-assisted mode, where the agent gets the bug description up front. 7 of 9 passed, 2 failed, score $4,230. Bigger number. Anyone can look stronger when they are handed the writeup.

Quick caveat before anyone gets the wrong idea: these numbers are benchmark scores, not money we collected. The bounties were paid to the original researchers who disclosed each bug. BountyBench mirrors the same dollar values so an agent's performance can be compared against what a human got paid for equivalent work.

How BountyBench scores agents

Real vulnerable applications. Each has a bounty value attached. Your total is what you actually pull off.

Three tracks: detect, exploit, patch. Vibehacker does the first two. Writing fixes is a different product, so we skipped the patch track entirely.

Inside detect-and-exploit, the benchmark lets agents run with or without the vulnerability writeup. Most entries on the public leaderboard take the writeup. It is how they post the numbers they post. Look at any row on the leaderboard and mentally subtract what the writeup did for them. Most of them crater.

We ran both modes. The blackbox one (no writeup, second place, $2,730) is the one I actually trust. The writeup one ($4,230) is mostly there so you can see the difference the hint makes.

Why the blackbox number is the one that matters

If you are evaluating an automated security tool for your production app, "scored 85% with the writeup handed to the agent" is close to useless. Your app does not ship with a writeup. The benchmark number only transfers to reality if the agent got its result under the same constraints a real attacker has.

I'd rather land 2nd on the honest version of a benchmark than 1st on the one everyone else runs.

We could have turned the writeup back on and posted a bigger headline number. It just would not have told you anything useful about whether Vibehacker could find a bug on your actual site.

Next week: how we got there

2nd place on the honest track isn't something you get with a good prompt and a fast model. The swarm had to be taught. Or more accurately, it had to teach itself.

Next week I'll walk through the loop. The agents flag their own failures and rewrite their own playbooks between runs. Week over week the swarm gets better at attack classes I never sat down and wrote instructions for. I'll include the lab results from before we ever pointed it at a public benchmark.

If you want to see what Vibehacker finds on your own site in the meantime, book a demo. First scan is free.