Building a Multi-Agent Code Reviewer

Code Review Has a Consistency Problem

Code review is one of the most valuable practices in software engineering. It catches bugs, spreads knowledge, and maintains quality. But it has a fundamental limitation: reviewers are human.

Even great reviewers have off days. They get fatigued halfway through a large diff. They fixate on style nitpicks while a subtle race condition slips by. A reviewer who’s an expert in security might not notice an accessibility issue. Someone focused on architecture might miss a swallowed error three files deep. This isn’t a criticism — it’s just how attention works. A single pass through a changeset, by a single person, with a single perspective, will always have blind spots.

I kept running into this when using AI for code review too. A single prompt asking “review this code” produces a generalist review. It catches obvious things but lacks depth in any one area. It’s the AI equivalent of skimming.

So I started thinking about the problem differently.

Specialized Agents Over Generalist Review

The key insight behind claude-deep-review is decomposition. Instead of one reviewer trying to do everything, break the review into specialized agents that each have a single mandate.

This mirrors how expert teams work in practice. When a company does a thorough security audit, they don’t send one person who’s “pretty good at everything.” They send a team: someone who focuses on authentication flows, someone who checks for injection vulnerabilities, someone who reviews cryptographic implementations. Each specialist sees things the others miss, because they’re looking through a different lens.

Claude Deep Review applies this principle to AI code review. It runs 15 specialized agents in parallel, each with a focused area of concern. A Code Reviewer checks against project guidelines and detects bugs. A Silent Failure Hunter audits every catch block for swallowed errors. A Cycle Detector traces circular dependencies. A Performance Analyzer looks for N+1 queries and algorithmic complexity issues. And so on.

The result is a single /deep-review command that produces a comprehensive, multi-perspective analysis of your code.

Deciding What Gets Its Own Agent

One of the more interesting design decisions was granularity. How do you decide what deserves its own agent versus what should be folded into an existing one?

The heuristic I landed on: an aspect gets its own agent when it requires a distinct mode of analysis. Security review requires tracing data flow from untrusted inputs to sensitive operations. Accessibility review requires checking ARIA attributes, keyboard navigation, and screen reader compatibility. These are fundamentally different tasks that benefit from dedicated attention.

Architecture is a good example of where granularity pays off. It’s represented by five agents: Dependency Mapper, Cycle Detector, Hotspot Analyzer, Pattern Scout, and Scale Assessor. At first glance, you might think “architecture” could be a single agent. But mapping module dependencies is a different task than detecting circular imports, which is different from identifying coupling hotspots or checking pattern consistency across modules. Each requires the agent to look at the codebase from a different angle. Collapsing them into one “architecture agent” would produce shallow analysis across all five dimensions.

On the other hand, there’s a cost to over-decomposition. Each agent consumes resources and adds latency to the synthesis step. The balance I struck: if two concerns naturally overlap in what they need to examine and how they examine it, they belong together. If they require fundamentally different analysis strategies, they’re separate agents.

Confidence Thresholds

Not every potential issue is worth reporting. One pattern that made the reviews significantly more useful was introducing confidence thresholds. The Code Reviewer, for example, only reports issues where its confidence is at or above 80%. This sounds simple, but it has a dramatic effect on signal-to-noise ratio. Without a threshold, you get a wall of “this might be a problem” that trains reviewers to ignore findings. With it, the issues that surface are ones worth paying attention to.

New vs. Pre-existing: A Crucial Distinction

Here’s a scenario that kills review adoption: you open a PR that touches 50 lines in a file that has 500 lines of existing tech debt. A naive reviewer flags everything — your 50 lines and the surrounding 500. You’re now responsible for defending or fixing problems you didn’t create, in code you didn’t change. This is demoralizing, and it’s why teams stop trusting automated review tools.

Claude Deep Review separates findings into two categories: NEW issues (introduced in this changeset) and PRE-EXISTING issues (tech debt that was already there). New issues must be addressed before merge. Pre-existing issues are surfaced for awareness and tracking, but they don’t block the PR.

This distinction keeps reviews focused and actionable. It acknowledges reality — most codebases have tech debt, and most PRs aren’t the right place to fix all of it. By categorizing issues this way, the tool stays useful instead of becoming noise.

Building It as a Skill

Claude Deep Review is built as a Claude Code skill (plugin), which means it integrates directly into the development workflow. You type /deep-review and the skill orchestrates the entire process.

The orchestration flow works like this:

Scope detection determines what to review. It auto-detects whether you’re reviewing PR changes, uncommitted work, or a specific path.
Agent selection picks which agents to run. By default it runs the core set, but you can request a full review or cherry-pick specific aspects.
Parallel execution launches all selected agents simultaneously using Claude Code’s Task tool.
Synthesis aggregates results into a single prioritized report.

The modular aspect system is worth highlighting. You don’t always need all 15 agents. A quick sanity check on a small change might only need code and errors. An accessibility-focused review can run just a11y. A deep architectural analysis can target arch. This keeps the tool fast when you want speed and thorough when you want depth.

# Full review of PR changes
/deep-review full --pr

# Quick code quality check on uncommitted work
/deep-review code errors --changes

# Architecture analysis of a specific directory
/deep-review arch src/features

The output follows a consistent structure: Executive Summary, New Issues (categorized as Critical, Important, or Suggestions), Pre-existing Issues, Architecture Health, Strengths, and an Action Plan. The Action Plan is particularly useful — it gives you a prioritized list of what to fix first, so you’re not staring at a wall of findings trying to figure out where to start.

How This Changes the Workflow

The practical impact is less about catching more bugs (though it does) and more about catching different kinds of bugs. A human reviewer and claude-deep-review find different things, and that’s the point. The tool is thorough on the mechanical, exhaustive work — tracing every error path, checking every dependency edge, auditing every catch block — while human reviewers focus on the things they’re best at: design intent, product context, and the “does this actually make sense” judgment calls.

It also shifts when issues get caught. Instead of discovering a circular dependency after it’s been merged and caused a build problem, it’s flagged in the PR. Instead of finding a swallowed error in production, the Silent Failure Hunter catches it during review. The earlier you catch problems, the cheaper they are to fix.

Try It

Claude Deep Review is open source on GitHub. If you’re using Claude Code, you can install it and start using /deep-review on your next PR. Start with the core review and expand to the full suite as you see how it fits your workflow.

The broader takeaway extends beyond this specific tool: when an AI task requires multiple kinds of expertise, decompose it into specialists. A team of focused agents will outperform a single generalist agent, for the same reason a team of expert reviewers outperforms one person trying to check everything at once.