Claude 3.5 Sonnet v2 and the week I mass-refactored our codebase with an AI pair programmer

Claude 3.5 Sonnet v2 shipped on October 22nd with what Anthropic called dramatically improved coding capabilities. I had been using the original Sonnet for code review and documentation but avoided using it for direct code generation because the error rate on complex TypeScript was too high. The v2 marketing promised better. I decided to test it the only way that matters: on real production code with real deadlines.

I spent the following week using Claude as a pair programmer for three large refactoring tasks that had been sitting in our backlog because they were tedious but important. This is not a product review. It is a workflow journal with specific prompts, specific corrections, and honest assessments of where AI saved hours and where it confidently generated code that would have caused production incidents.

Generate a realistic engineering desk scene with terminal diffs, code review panes, model output snippets implied visually, charcoal and violet palette, 16:9, no robots, no sci-fi holograms, no text overlays.

Useful AI looks more like leverage than magic.

Even the early AI experiments were less about trend-chasing and more about buying back engineering time. It also builds on what I learned earlier in “Next.js 15, Turbopack stable, and the mass codebase migration nobody is talking about.” That became a pattern across jarvis, alfred, and eventually the portfolio RAG work: use models to accelerate judgment, not replace it.

Editorial supporting image for the section "Task 1: Error Handling Layer Refactor" in the article "Claude 3.5 Sonnet v2 and the week I mass-refactored our codebase with an AI pair programmer". Show terminal-based agent workflow, model comparison notes, code review diffs, sticky notes, and scribbled evaluation criteria for "Claude 3.5 Sonnet v2 and the week I mass-refactored our codebase with an AI pair programmer". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: charcoal, violet, emerald, electric terminal glow. Mood: hungry, hands-on, slightly sleep-deprived, battle-tested before the title felt real. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No humanoid robots, no glowing brain illustration, no cyberpunk cityscape, no text overlays.

The workflow, not the hype.

Task 1: Error Handling Layer Refactor

Our error handling was inconsistent across 200 API endpoints. Some threw custom error classes, some returned error objects, and some called next(error) with raw strings. I wanted every endpoint to use our standardized AppError class with machine-readable codes.

I gave Claude the AppError class definition, three examples of correctly refactored endpoints, and asked it to refactor a batch of 20 endpoints at a time. The results were mixed in a specific and predictable way.

Straightforward refactors (simple try-catch to AppError): 95% accuracy. Claude identified the error pattern, replaced it with AppError, and chose appropriate error codes. I accepted these with minimal review.
Business logic errors (validation failures, authorization checks): 80% accuracy. Claude usually got the error code and message right but occasionally chose generic codes when a specific one existed in our error catalog.
Edge cases (nested try-catch, error re-throwing, conditional error handling): 60% accuracy. Claude sometimes flattened nested error handling in ways that changed the control flow. These required careful review.

The net result: I refactored 200 endpoints in three days instead of the estimated two weeks. About 140 of those were accept-and-move-on. About 40 needed minor corrections. About 20 needed significant rewriting. Without AI, all 200 would have been manual work.

Task 2: Test Migration

We were migrating from Jest to Vitest. The API is almost identical but there are subtle differences: different mock implementations, different timer handling, and different module resolution. I had 80 test files to migrate.

I gave Claude one fully migrated test file as a reference and asked it to migrate each remaining file. This task had the highest success rate of the three.

Import replacements (jest to vi, describe/it/expect unchanged): 100% accuracy. Mechanical find-and-replace that Claude did flawlessly.
Mock migration (jest.fn() to vi.fn(), jest.mock() to vi.mock()): 95% accuracy. The only failures were in tests with complex mock factory functions where Vitest handles scope differently.
Timer handling (jest.useFakeTimers to vi.useFakeTimers): 90% accuracy. The API is similar but the default behavior differs. Claude occasionally missed the configuration differences.

Eighty test files migrated in one day. The original estimate was a full week. Every migrated file passed the test suite on first run except for 6 that needed mock scope adjustments.

Task 3: OpenAPI Type Generation

Our payment processor provides an OpenAPI 3.1 spec. I wanted to generate TypeScript types from it so our webhook handlers would be type-safe against the processor’s actual API shape. There are tools for this (openapi-typescript) but the generated types needed post-processing to match our naming conventions and add custom utility types.

This was the task where Claude struggled the most. The OpenAPI spec was 3,000 lines of YAML with nested $ref references, discriminated unions, and nullable fields. Claude handled the straightforward types correctly but made three categories of mistakes.

Nullable fields: Claude sometimes generated field?: Type instead of field: Type | null. In TypeScript strict mode, these have different semantics. Optional means the field can be missing. Nullable means the field exists but can be null. For webhook payloads, the distinction matters.
Discriminated unions: The OpenAPI spec uses oneOf with a discriminator field. Claude sometimes generated plain union types instead of proper discriminated unions, losing the type narrowing capability.
Nested references: Deeply nested $ref chains confused the context window. Claude would resolve the first two levels of references correctly but inline the third level instead of following the reference.

The Honest Assessment

Claude 3.5 Sonnet v2 is a genuine productivity multiplier for code refactoring. Not a replacement for engineering judgment, but a tool that handles the mechanical aspects of large-scale changes while the human focuses on correctness and edge cases.

The pattern that worked: give Claude a clear specification, concrete examples, and small batches. The pattern that failed: give Claude a large context, vague instructions, and expect it to make judgment calls about business logic. Every prompt that started with “refactor these 50 files” produced worse results than prompts that started with “here is one correctly refactored file, now do these 5 files the same way.”

Total time saved across three tasks: approximately 8 working days.
Time spent reviewing and correcting AI output: approximately 2 working days.
Net time savings: 6 working days, or roughly 60% reduction in effort.
Production bugs from AI-generated code: zero, because every change was reviewed before commit.

Create an editorial workflow image showing AI-assisted engineering under human oversight: pull request, checklist, quiet confidence, cobalt and graphite palette, 1:1, no futuristic clichés.

The point is judgment, not novelty.

That was the pattern of my first months at FinanceOps: I did not have management scar tissue yet, so I earned trust by making technical decisions that stayed boring under pressure. The same bias toward strict defaults still shows up in jarvis, alfred, and the portfolio RAG stack today.

AI pair programming is not about trusting the AI. It is about using the AI to generate a first draft that is faster to review and correct than writing from scratch. The review step is not optional. Skip it and you will ship bugs that you confidently approved.