Build a Multi-Agent AI Code Review Pipeline

Reda Aboutika

Security Researcher

8 Minutes Read

Published: 11 Jun 2026

A while back, I pointed a single "audit this" prompt at a ~5,000-line lending protocol — a Uniswap-V2-style AMM with a lending layer bolted on. It came back with three findings. The final report for that codebase had twenty-three. The one real issue my pass surfaced, it surfaced; the rest it either missed or talked itself out of.

Then I tried something dumb: I split the audit into a handful of agents, each scoped to one bug class, and ran them over the same scope, same model, same afternoon. Fifteen findings, nine of them flagged by more than one agent. I didn't formally score all fifteen against the final report, but it caught the critical my single pass had killed, plus a couple more it had skimmed straight past.

Same model, same code, same hours — the difference was structure. This is one comparison on one codebase, not a benchmark, so take the numbers as a hint, not a proof — and raw finding count is noisy anyway: fifteen mediocre findings are worse than one critical.

By the end of this, you'll have a small Python pipeline (a few hundred lines) you can run and extend. The code is on GitHub: multi-agent-audit-pipeline; here I'll walk through the parts that matter and the parts that don't work. It's built on Anthropic's API (Claude), but the model call lives in a single file, so pointing it at another provider is a contained change, not a rewrite.

Single Agent vs Multi-Agent: Why One Prompt Isn't Enough

A single "find the bugs" prompt has to hold the whole codebase, every vulnerability class, and the output format in its head at once. The model's attention gets spread thin, so nothing gets read hard. And writing a longer, more detailed prompt doesn't rescue it — you're just handing an already-crowded context more to juggle; more instructions buy more output, not a sharper read.

The failure mode that convinced me wasn't a miss — it was a kill. On that lending protocol, my single pass actually reached the bug that mattered: a borrow(to, amount, …) that minted debt to to but only validated solvency on msg.sender. It even floated a reentrancy hypothesis. Then it reasoned itself out: "the final solvency check will catch it." It doesn't — the check sees cumulative debt with no pre-callback baseline, so it passes exactly when it shouldn't. A broad pass is busy, so it rationalizes. An agent with one job has nothing to rationalize toward.

There's a mechanical side to this, not just an attention one. A crowded context — the whole codebase plus every bug class plus the output format, all live at once — is exactly where models drift and confabulate, the way that single pass talked itself out of a real finding. Give each agent its own context with a single job, and there's less to lose track of, so less to hallucinate about. They also don't contaminate each other: a wrong lead in the reentrancy pass can't bias the access-control pass, because they never shared a train of thought. It doesn't spend fewer tokens — several calls cost more than one — but you control where the spend goes: caching makes the shared code cheap to re-read, and the easy agents run on a cheap model, so the expensive reasoning is reserved for the ones that earn it.

KEY INSIGHT — One job per agent. The narrower the brief, the sharper the read — and the less room the model has to argue itself out of a real finding.

Multi-Agent Architecture: The Shape of the Pipeline

Four moving parts:

Diagram showing a Solidity codebase (one blob) fanning out to four agents — reentrancy (opus), access-control (sonnet), cross-function-auth (opus), and integer-overflow (haiku) — then merging into a dedup step and final report with confirmation tracking. — The pipeline shape: one codebase, four focused Claude calls, each on a model that fits the job, merged into a single report

The orchestrator handles LLM orchestration — it loads the code once, fans it out to each agent, and merges the findings. No framework, no message bus — a loop, a few prompts, and a merge step. That's the entire multi-agent system.

If Python isn't your first language

You don't need to be fluent to follow the rest — each file does one small job:

File structure of the multi-agent audit pipeline: run.py at the root, and a pipeline/ directory containing agents.py, codebase.py, llm.py, orchestrator.py, models.py, and report.py, each with a one-line description of its role. — Six files, one job each. The entire pipeline fits in a folder you can read in an afternoon

An LLM Agent Is a Focused Prompt + a Schema + a Model

Each LLM agent here isn't a class hierarchy. It's a name, a focused instruction, and a model tier:

# pipeline/agents.py
HARD = "claude-opus-4-7"    # multi-step / cross-function reasoning
MIDDLE = "claude-sonnet-4-6"  # moderate reasoning
SIMPLE = "claude-haiku-4-5"  # mechanical pattern scans

@dataclass(frozen=True)
class Agent:
    name: str
    focus: str   # the instruction; the codebase is shared separately
    model: str   # the tier that fits this agent's job

AGENTS = [
    Agent(
        name="reentrancy",
        model=HARD,
        focus=(
            "Hunt ONLY for reentrancy: external calls before state updates "
            "(checks-effects-interactions violations), missing guards on functions "
            "that move value. Ignore everything else. No reentrancy? Return an empty list."
        ),
    ),
    # access-control (MIDDLE) and integer-overflow (SIMPLE) follow the same shape
]

Every agent returns its findings as structured JSON against a fixed schema (title, severity, contract, location, explanation, recommendation). That's what makes the merge trivial later — you're combining data, not scraping prose.

The model tier is part of the agent because the jobs aren't equally hard. Reentrancy needs the model to follow a call into another contract and reason about ordering — that's Opus work. Scanning for unchecked blocks and narrowing casts is closer to a grep, so it runs on Haiku — overflow on a 0.8 codebase is mostly a commodity check now, and commodity work is exactly what belongs on the cheapest model, not on your most expensive one. Here's the part I didn't expect: the cheapest agent is often the one that earns its keep most reliably. Mechanical checks — argument order, boundary conditions — are where a narrow agent flatly beats a broad one, because there's no reasoning to dilute and nothing to rationalize away. Don't pay Opus prices for bookkeeping, and don't hand cross-function reasoning to Haiku and expect it to hold.

Where specialization actually paid off

Two bugs from that run make the case better than any diagram.

The first is the borrow issue above. Five things had to line up — debt minted to to, solvency checked on msg.sender, a callback window, cumulative state, no pre-callback baseline. The broad pass saw the parts and didn't assemble them. An agent scoped to just that one behavior, handed the same code, did — not because the prompt was clever, but because it wasn't doing anything else.

The second is duller and, to me, more convincing. A fee function was being called with two of its arguments swapped — calc(amount, currentReserve, referenceReserve) where the signature wanted (amount, referenceReserve, currentReserve). It compiles, it runs, it quietly charges the wrong fee. My broad pass saw the call site and marked it "probably fine." The agent that caught it does exactly one thing: at every call site, check that the arguments line up with the signature. It doesn't get to have an opinion. That's not reasoning — it's bookkeeping — and bookkeeping is precisely what a "find all bugs" prompt is too busy to do.

Neither case needed a cleverer prompt — just a narrower one, with nothing else competing for attention.

Here's the comparison, kept honest:

	Single broad pass	Focused agents, same scope
Findings surfaced	3	15
Matched the judges' final report	~1 of 23	Not formally re-scored*
Multi-agent confirmation	-	9 of 15
The critical that mattered	Reached, then dismissed	Recovered

Single broad pass

Focused agents, same scope

Findings surfaced

Matched the judges' final report

~1 of 23

Not formally re-scored*

Multi-agent confirmation

9 of 15

The critical that mattered

Reached, then dismissed

Recovered

*I scored the single pass against the final report; I didn't go back and formally re-score the multi-agent run, so read the 15 as surfaced, not confirmed. The honest signal here isn't the count — it's the bottom row.

Every agent reads the same code. Send it fresh each call and you pay for those tokens every time. Prompt caching fixes that — if you place things right.

Caching is a prefix match: everything up to a marked point is cached and reused as long as those bytes don't change. So the stable thing — the codebase — goes first, in the system prompt, with a cache_control marker. The thing that changes per agent — the focus — goes after it, in the user message.

system=[{
    "type": "text",
    "text": f"{SYSTEM_PREAMBLE}\n# Codebase under audit\n{codebase}",
    "cache_control": {"type": "ephemeral"},   # shared prefix, cached once
}],
messages=[{"role": "user", "content": agent.focus}],

PITFALL — Two things bite here. Caching only engages above a minimum prefix (a few thousand tokens), so on a toy contract it does nothing — you see the benefit on real codebases. And if you fan out fully in parallel, every agent starts before the first one finishes writing the cache, so they all pay full price. Warm it with one call, then fan out the rest.

Merge: dedup, and where it's dumb

Each agent hands back a list. Merging is mostly de-duplication, with one idea worth keeping:

# pipeline/orchestrator.py  (trimmed)
def _merge(findings):
    merged = {}
    for f in findings:
        key = (f.contract.lower(), f.location.lower())
        if key not in merged:
            merged[key] = f
        else:
            merged[key].confirmed_by.append(f.found_by)  # confirmation, not a dup
    return sorted(merged.values(), key=lambda f: SEVERITY_RANK[f.severity])

When two agents land on the same spot, don't drop one — two independent agents agreeing is a stronger signal than one, and "confirmed by two" is a useful thing to sort by later.

Now the honest part: that key is dumb. What you actually want is same root cause = one finding — and the key only approximates it, deduping on (contract, location) as lowercased strings. So two agents that describe the same root cause in different words, or pin it to slightly different line ranges, sail past each other as two findings. The real fix is a semantic merge — cluster by root cause, not by string — and I haven't written it, because over-merging silently hides bugs and I'd rather eyeball two near-duplicates than lose one. So I live with the dumb version. Your call.

Run it

There's a free dry-run that makes no API calls — it just prints the plan:

$ python run.py examples/ --dry-run
  • reentrancy [claude-opus-4-7]: Hunt ONLY for reentrancy.
  • access-control [claude-sonnet-4-6]: Hunt ONLY for access-control issues.
  • cross-function-auth [claude-opus-4-7]: Hunt ONLY for cross-function authorization gaps...
  • integer-overflow [claude-haiku-4-5]: Hunt ONLY for unsafe arithmetic...

Set a key and run it for real against the bundled `Vault.sol` — a real run on this tiny contract costs cents, not dollars:

$ ANTHROPIC_API_KEY=sk-ant-... python run.py examples/Vault.sol

Vault.sol has three planted bugs, and the run shows each agent doing its one job:

reentrancy → flags withdraw().
access-control + cross-function-auth → both flag borrowTo (debt credited to one address, checked against another). You catch the merge caveat live here: they word the location slightly differently, so the string key leaves them as two findings, not one confirmation — I'd still take that over a semantic merge that quietly drops a bug.
integer-overflow → nothing, and that's correct: this contract is Solidity 0.8, where arithmetic reverts on overflow. Point the same agent at a Rust codebase, though — where release builds wrap silently by default — and it stops being a commodity check.

An empty list is a good answer; reward agents for "finding something" and they'll invent things. Keep the frame honest — this is a triage aid, not a verifier. Treat every finding as a lead to confirm.

Where this still fails badly

This pipeline is a triage aid, not a full AI vulnerability scanner — and that distinction matters. Fan-out is good at local bugs, the ones you can see by reading a function and its neighbors. It's bad at everything that needs the whole protocol held in your head at once, and no prompt fixes that:

Multi-step economic exploits, where each step is individually fine and only the sequence is fatal.
Invariant violations that no single function breaks — the contract is wrong as a system.
Stateful assumptions that only fail after a specific history of transactions.
Cross-chain / messaging assumptions, and upgradability traps (storage layout, init order).

An agent that only ever sees one function will never see these, however good its prompt. Catching them is still the human's job — and honestly, it's the more interesting job. This skeleton won't get you there. That's the point of it: it clears the floor so you can spend your attention upstairs.

Recap, and what it's for

You've got an orchestrator that loads a codebase once and fans it out to focused agents, each one prompt + schema + a right-sized model, with a merge that records agreement.

I kept it deliberately primitive. The value isn't the orchestration — it's the agents you write. The plumbing is short and easy to replace; what's worth keeping is the agents, where you can encode expertise a generalist prompt can't match — an accounting-drift agent, a first-deposit-inflation agent, a cross-function-auth agent. The three I ship are textbook on purpose: they teach the mechanics, not the edge. The edge is yours to add.

Where I'd take it next: pipe high-severity findings into a Foundry test and try to actually trigger them; give agents tools (grep, a static analyzer, prior-art lookup) so they reason over facts; scope out tests and mocks and pin the commit.

This doesn't replace auditors. It replaces the low-signal exploratory pass — the part where you skim every function once looking for the obvious — and buys back the hours you'd rather spend on invariants and protocol reasoning, which is where the bugs that actually matter tend to live.

Code: multi-agent-audit-pipeline. AI-augmented security researchers — the ones who drive and build these tools to absorb emerging capabilities — are still an early crowd. It's not too late to be one of them; this is a cheap place to start.

Build a Multi-Agent AI Code Review Pipeline

Single Agent vs Multi-Agent: Why One Prompt Isn't Enough

Multi-Agent Architecture: The Shape of the Pipeline

If Python isn't your first language

An LLM Agent Is a Focused Prompt + a Schema + a Model

Where specialization actually paid off

Merge: dedup, and where it's dumb

Run it

Where this still fails badly

Recap, and what it's for

Read more on HackenProof Blog

Web2 Bugs Hiding in Web3: The Smart Contract Security Blind Spot

Move Smart Contract Arithmetic: How Precision Fails and Economics Break

Get Ready for this Decade's Biggest Opportunity: Bug Bounty Hunting

Single Agent vs Multi-Agent: Why One Prompt Isn't Enough

Multi-Agent Architecture: The Shape of the Pipeline

If Python isn't your first language

An LLM Agent Is a Focused Prompt + a Schema + a Model

Where specialization actually paid off

Share the codebase, cache it once

Merge: dedup, and where it's dumb

Run it

Where this still fails badly

Recap, and what it's for

Read more on HackenProof Blog

Web2 Bugs Hiding in Web3: The Smart Contract Security Blind Spot

Move Smart Contract Arithmetic: How Precision Fails and Economics Break

Get Ready for this Decade's Biggest Opportunity: Bug Bounty Hunting