Claude Opus 4 and Sonnet 4: the week AI coding tools stopped being novelties and became infrastructure
When Claude Opus 4 hit 72.5% on SWE-bench and solved a TypeScript generics issue that had stumped our team, the conversation shifted from "should we use AI" to "how do we integrate it."
On May 22, 2025, Anthropic released Claude Opus 4 and Sonnet 4. I had been using Claude models for code review and pair programming since the 3.5 Sonnet days, but this release felt different. Opus 4 scored 72.5 percent on SWE-bench, meaning it could independently resolve nearly three quarters of real GitHub issues from open-source projects. Sonnet 4 was faster and cheaper while still outperforming the previous generation on coding tasks. That week, the conversation at FinanceOps shifted from “should we experiment with AI tools” to “how do we make this part of our engineering workflow.”
My stance on AI changed when the tools started surviving real delivery pressure instead of toy demos. It also builds on what I learned earlier in “OpenAI o3 and o4-mini: reasoning models are getting good enough to replace junior code review.” Jarvis, Alfred, and the internal workflow experiments mattered because they made review, triage, and architecture discussions faster without pretending the human judgment disappeared.
The TypeScript Generics Problem
The moment that changed my perspective was not a benchmark number. It was a specific problem that had blocked our team for two days. We were building a type-safe API client generator that needed to infer response types from our OpenAPI specification. The challenge was creating a TypeScript generic that could map a string literal endpoint path to its corresponding response type while handling path parameters, query parameters, and discriminated union error types.
Two engineers had spent a combined 16 hours on the type definition. They had a version that worked for simple cases but broke on endpoints with path parameters. The TypeScript compiler error was 40 lines of nested conditional type mismatches that none of us could fully parse.
// What we needed: a type-safe API clientconst response = await client.get('/transactions/:id')// response should be typed as Transaction// path params should be inferred and required
// What Opus 4 produced after seeing our OpenAPI spectype ExtractPathParams<T extends string> = T extends `${string}/:${infer Param}/${infer Rest}` ? { [K in Param | keyof ExtractPathParams<`/${Rest}`>]: string } : T extends `${string}/:${infer Param}` ? { [K in Param]: string } : Record<string, never>
// This compiled correctly on the first try// Our version had been missing the recursive caseI pasted the problem description, the OpenAPI spec, and our broken type definition into Claude Opus 4. Within 30 seconds, it produced a working solution with a clear explanation of why our recursive type was failing: we were not handling the base case where the path ends with a parameter. It also suggested a simpler approach using template literal types that avoided the recursion entirely. The fix was conceptually obvious in hindsight but none of us had seen it despite 16 hours of combined effort.
From Tool to Infrastructure
Individual productivity gains from AI tools are impressive but not transformative. What is transformative is when AI tools change how the team operates. After the Opus 4 release, we formalized three integration points where AI became part of our engineering infrastructure rather than an optional tool individual engineers might or might not use.
- Code review: Every PR gets an automated review from Sonnet 4 focusing on correctness, type safety, and financial calculation precision. Findings appear as PR comments that human reviewers incorporate into their review.
- Pair programming: Complex TypeScript type definitions, database query optimization, and algorithm design start with an AI pair session. The engineer describes the problem, the model proposes solutions, the engineer evaluates and refines.
- Documentation generation: API endpoint documentation, runbook updates, and architecture decision records get a first draft from the model, which the engineer reviews and edits. Documentation that previously took an hour to write from scratch takes 15 minutes to review and polish.
The Guardrails
Making AI tools infrastructure rather than novelties required guardrails that prevented bad patterns from emerging.
- No AI-generated code merges without human review. The PR process is the same regardless of whether a human or AI wrote the code.
- AI-generated code must pass the same test suite. No relaxation of coverage requirements or linting rules.
- Engineers must understand the code they commit. If you cannot explain why the AI solution works, you cannot commit it.
- Financial calculation code requires manual verification against test cases with known correct outputs, regardless of origin.
- We track the source of bugs found in production. If AI-generated code starts producing more bugs per line than human-generated code, we re-evaluate the integration.
The third guardrail is the most important and the hardest to enforce. When an AI model produces a working solution to a problem you have been stuck on for two days, the temptation is to commit it and move on. But code you do not understand is a liability regardless of who wrote it. The model might be right today and wrong tomorrow when the underlying assumptions change.
What Changed in Our Velocity
In the month after integrating Opus 4 and Sonnet 4 into our workflow, our cycle time from PR open to merge dropped by 25 percent. Documentation coverage increased because the barrier to writing docs dropped. Code review quality improved because the AI reviewer catches mechanical issues that human reviewers skip when rushed.
The most significant change was not measurable in metrics. It was a shift in ambition. Problems that the team would have scoped down or deferred because the implementation was too complex became feasible. The type-safe API client was one example. A migration from callback-based error handling to Result types was another. These were improvements we knew we wanted but could not justify the engineering time for. With AI pair programming reducing the implementation time by roughly 40 percent, the cost-benefit math changed.
By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside jarvis, alfred, and the portfolio RAG stack.
AI coding tools become infrastructure when you stop asking “should we use this” and start asking “how do we integrate this safely.” The safety question is about guardrails, review processes, and understanding requirements. Once you answer it, the productivity gains are not optional. They are competitive advantages that compound over time.
I expect AI coding tools to be as fundamental to engineering teams as version control and CI within two years. The teams that integrate them early with proper guardrails will have a compounding advantage over teams that are still debating whether to use them.