Sonnet 4.5 replaced our first-pass code review and nobody complained

Before Claude Sonnet 4.5, our code review process had a bottleneck that nobody wanted to talk about. Senior engineers spent 30-40% of their review time on mechanical issues: inconsistent error handling, missing null checks, style violations that the linter did not catch, and obvious performance patterns. This was boring, important work that ate into the time they should have spent reviewing architecture and business logic.

After Sonnet 4.5 launched, we integrated it as a CI step on every pull request. The model runs a first-pass review and leaves comments on the PR before any human sees it. The results changed how our team thinks about code review.

Generate a realistic engineering desk scene with terminal diffs, code review panes, model output snippets implied visually, charcoal and violet palette, 16:9, no robots, no sci-fi holograms, no text overlays.

Useful AI looks more like leverage than magic.

I write these AI posts from the far side of the honeymoon phase. It also builds on what I learned earlier in “GPT-5 shipped and my team asked if we still need junior engineers.” The interesting question is no longer whether the models are impressive. It is where they meaningfully improve decision quality across real systems like portfolio search, aigw, jarvis, and the review loops around everyday engineering work.

Editorial supporting image for the section "What the AI Reviews" in the article "Sonnet 4.5 replaced our first-pass code review and nobody complained". Show terminal-based agent workflow, model comparison notes, code review diffs, sticky notes, and scribbled evaluation criteria for "Sonnet 4.5 replaced our first-pass code review and nobody complained". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: charcoal, violet, emerald, electric terminal glow. Mood: measured, confident, strategic, scarred enough to sound calm while saying hard things. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No humanoid robots, no glowing brain illustration, no cyberpunk cityscape, no text overlays.

The workflow, not the hype.

What the AI Reviews

We spent two weeks calibrating the review prompt. The goal was not to replace human review. It was to handle the mechanical first pass so that human reviewers could focus on higher-order concerns. The AI reviews for:

Inconsistent error handling patterns. Our codebase uses a specific Result type for all fallible operations. The AI flags any function that throws instead of returning a Result.
Missing edge cases in input validation. The AI is surprisingly good at identifying inputs that the validation logic does not cover.
Style inconsistencies that fall outside our ESLint config. Naming patterns, file organization, import ordering, and comment quality.
Obvious performance issues. N+1 queries, unnecessary re-renders, missing database indexes in migration files.
Test coverage gaps. Not just “this function has no test” but “this function has a test that does not cover the error path.”

What the AI Does Not Review

We explicitly exclude certain categories from AI review because the model consistently gets them wrong or produces noise:

Architecture decisions. The model does not have enough context about our system topology to evaluate whether a service boundary is in the right place.
Business logic correctness. The model can verify that code does what it says, but it cannot verify that what it says is what the business actually needs.
Security-sensitive code paths. Payment processing, authentication, and encryption deserve human eyes. The stakes of a false negative are too high.
Database schema changes. Migration files affect production data. A human reviewer must verify every migration.

The Impact on Review Turnaround

The numbers tell the story:

Average time from PR open to first review comment dropped from 24 hours to 4 hours. The AI comments appear within 3 minutes of the PR opening.
Average number of human review iterations dropped from 2.8 to 1.4. By the time a human reviewer looks at the PR, the author has already addressed the mechanical issues the AI flagged.
Senior engineer time spent on code review dropped by roughly 35%. They review fewer PRs total but spend more time on each one, focusing on the architectural and business logic concerns that matter.
PR merge time dropped from an average of 3.2 days to 1.8 days.

How the Team Reacted

The most surprising outcome was the absence of pushback. I expected engineers to resist AI review as surveillance or to feel insulted by a robot commenting on their code. Instead, the team welcomed it. The reason was simple: nobody enjoys getting a human review comment that says “you forgot to handle the null case here.” It feels like a failure. Getting that same comment from an AI feels like a linter caught something. No ego, no social dynamics, just a tool doing its job.

Two senior engineers told me they actually enjoy code review more now. They spend their review time on interesting problems: evaluating design tradeoffs, questioning business assumptions, and mentoring junior engineers through architectural feedback. The mechanical checklist work is gone.

What I Would Do Differently

If I were setting this up again, I would start with a two-week silent mode where the AI comments are only visible to the PR author, not the reviewers. This gives the author time to fix mechanical issues before the PR is “officially” ready for review. We implemented this after the fact and it further reduced review iterations.

Create an editorial workflow image showing AI-assisted engineering under human oversight: pull request, checklist, quiet confidence, cobalt and graphite palette, 1:1, no futuristic clichés.

The point is judgment, not novelty.

This is the phase where individual scars finally turned into repeatable operating principles. I cared less about sounding clever and more about leaving behind a system that stayed sane without me in the room. That is how I build jarvis, alfred, and the portfolio RAG stack too.

AI code review is not about replacing human reviewers. It is about freeing human reviewers to do the work that only humans can do: evaluate whether the code solves the right problem, not just whether it follows the right patterns.

The first-pass AI review is now as embedded in our workflow as linting and type checking. It runs on every PR, it catches real issues, and it makes human reviewers better by letting them focus on what matters. This is the pattern I expect to see across the industry within a year: AI handles the mechanical, humans handle the meaningful. The teams that figure out the boundary between those two categories first will move fastest.

The AI-assisted review workflow works because it handles the mechanical checks that humans are bad at maintaining consistently: import ordering, naming conventions, error handling patterns, and test coverage gaps. That frees human reviewers to focus on architectural decisions, business logic correctness, and long-term maintainability — the areas where human judgment still matters most. The combination is better than either alone.