Claude 3.7 Sonnet's extended thinking and what it means for code review at a small team

Claude 3.7 Sonnet shipped in late February 2025 with a feature Anthropic called extended thinking. Instead of generating a response in one pass, the model can spend time deliberating step by step before producing its answer, similar to how a human reviewer might read through a pull request, consider multiple angles, and then write their feedback. I integrated it into our code review workflow that same week and the difference was immediately noticeable.

Generate a realistic engineering desk scene with terminal diffs, code review panes, model output snippets implied visually, charcoal and violet palette, 16:9, no robots, no sci-fi holograms, no text overlays.

Useful AI looks more like leverage than magic.

My stance on AI changed when the tools started surviving real delivery pressure instead of toy demos. It also builds on what I learned earlier in “DeepSeek R1 and the moment I realized open-source AI would change how we build internal tools.” Jarvis, Alfred, and the internal workflow experiments mattered because they made review, triage, and architecture discussions faster without pretending the human judgment disappeared.

Editorial supporting image for the section "The Small Team Review Problem" in the article "Claude 3.7 Sonnet's extended thinking and what it means for code review at a small team". Show terminal-based agent workflow, model comparison notes, code review diffs, sticky notes, and scribbled evaluation criteria for "Claude 3.7 Sonnet's extended thinking and what it means for code review at a small team". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: charcoal, violet, emerald, electric terminal glow. Mood: tense but controlled, operational, carrying team and system weight at the same time. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No humanoid robots, no glowing brain illustration, no cyberpunk cityscape, no text overlays.

The workflow, not the hype.

The Small Team Review Problem

At FinanceOps we have four engineers including me. Everyone is busy shipping features. Code review is something we all know is important but it often gets compressed into a five-minute scan before approving because the PR is blocking someone else. The reviews catch obvious issues: naming, missing tests, clear logic errors. They rarely catch subtle problems: type narrowing gaps, race conditions in async flows, or financial calculation precision issues.

Hiring a dedicated code reviewer makes no sense at our size. Having senior engineers spend two hours per PR is not sustainable when those same engineers are also the ones writing features. The result is that review quality fluctuates with team workload. During calm weeks, reviews are thorough. During crunch periods, they are rubber stamps with a thumbs-up emoji.

I had been experimenting with AI-assisted code review since Claude 3.5 Sonnet. The results were mixed. The model would flag real issues but also produce a lot of noise: stylistic suggestions that did not matter, false positives on patterns it misunderstood, and generic “consider adding error handling” comments that were not actionable. The signal-to-noise ratio was too low for engineers to take the reviews seriously.

Why Extended Thinking Changes the Equation

Extended thinking gives the model time to reason through complex code before producing feedback. Instead of generating comments in a single forward pass, it examines the code, considers the context, checks its assumptions, and then delivers a review. The practical difference is that the feedback is more precise and less noisy.

I tested it on a PR that modified our payment reconciliation matching logic. The change looked straightforward: updating a threshold for fuzzy matching from 0.8 to 0.85. A human reviewer would likely approve it after checking that the constant was used correctly. Claude 3.7 with extended thinking traced the threshold through the entire matching pipeline, identified that the threshold was compared against a similarity score that used floating-point arithmetic, and flagged that the comparison should use an epsilon-based comparison rather than direct equality.

1
// What the PR changed
2
const MATCH_THRESHOLD = 0.85
3

4
// What extended thinking flagged
5
// similarity is computed as: (matched_fields / total_fields)
6
// With 6 fields, similarity = 5/6 = 0.8333...
7
// 0.8333 < 0.85, so a 5-of-6 field match now fails
8
// This changes behavior: previously 5/6 passed (0.833 > 0.8)
9
// Was this intentional?
10

11
// Also: comparing floating point with > is fragile
12
// Consider: Math.abs(similarity - threshold) < Number.EPSILON

This is the kind of issue that takes a human reviewer 15 minutes of tracing to catch, and most human reviewers will not spend that time on what looks like a simple constant change. The extended thinking model spent about 30 seconds deliberating through the implications and surfaced the exact behavioral change the PR would cause.

My Workflow

I integrated Claude 3.7 Sonnet into our GitHub pull request process using a simple CI job that runs on every PR. The workflow is not fully automated. The model produces a review comment on the PR, and a human engineer reads it and decides what to act on.

CI job triggers on PR open and every push to the PR branch
The job sends the full diff plus the files changed to Claude 3.7 Sonnet with extended thinking enabled
The prompt instructs the model to focus on correctness, financial calculation safety, and TypeScript type issues
The model produces a review comment with specific line references
A human engineer reads the AI review alongside their own review and decides what is actionable
We track AI review findings weekly to calibrate prompt instructions and reduce false positives

The prompt engineering matters a lot. Early versions of the prompt produced generic feedback. The current version includes context about our domain, specifies that financial calculations require exact decimal handling, and instructs the model to skip stylistic suggestions entirely. We iterated on the prompt for two weeks before the reviews became consistently useful.

Results After Three Months

In three months of using AI-assisted code review, the model has flagged 14 issues that made it past human review. Three of those were financial calculation precision problems that would have caused incorrect reconciliation results in production. Two were TypeScript type narrowing gaps where a value could be undefined in a code path the developer did not consider. The rest were logic errors in edge cases.

Average AI review time: 30 to 45 seconds per PR
Actionable findings: roughly 2 per week across all PRs
False positive rate: dropped from 40 percent in week one to under 10 percent after prompt tuning
The three financial calculation catches alone justified the effort

Create an editorial workflow image showing AI-assisted engineering under human oversight: pull request, checklist, quiet confidence, cobalt and graphite palette, 1:1, no futuristic clichés.

The point is judgment, not novelty.

This is where the role stopped feeling like senior-engineer-plus. Every decision had a human system wrapped around it: founders, customers, auditors, tired teammates. The same systems thinking bled into jarvis, alfred, and the portfolio RAG stack, where defaults matter more than heroics.

AI code review is not a replacement for human review. It is a safety net that catches the issues human reviewers miss when they are tired, rushed, or unfamiliar with the code. Extended thinking is what makes the safety net actually work because it gives the model time to trace implications instead of pattern-matching surface-level issues.

I am still cautious about depending on AI review. The model does not understand our business context the way a human teammate does. It cannot evaluate whether a feature approach is the right one. It cannot judge whether a test covers the right scenarios. But for mechanical correctness, type safety, and numerical precision, it is already better than our average human review at 11 PM on a Thursday before a release.