portfolio Anshul Bisen
ask my work

Opus 4.6 has a million-token context window and I am still not sure what to do with it

Feeding entire codebases into a single prompt for architecture review produced mixed results. The practical sweet spots are narrower than the capability suggests.

Claude Opus 4.6 shipped with a million-token context window and 128K max output. The immediate reaction on my team was excitement: we could feed our entire codebase into a single prompt and ask the model to review our architecture, find bugs, or suggest improvements. We tried it. The results were more complicated than the excitement suggested.

Useful AI looks more like leverage than magic.

I write these AI posts from the far side of the honeymoon phase. It also builds on what I learned earlier in “S3 Vectors at re:Invent made me reconsider our entire RAG architecture.” The interesting question is no longer whether the models are impressive. It is where they meaningfully improve decision quality across real systems like portfolio search, aigw, jarvis, and the review loops around everyday engineering work.

The workflow, not the hype.

The Experiment

We fed Opus 4.6 three different codebases of increasing size and asked for architecture review:

  • The portfolio site: ~15,000 tokens. Small enough that the model handled it perfectly. Accurate observations about component structure, identification of unused code, and sensible suggestions for simplification.
  • A microservice from FinanceOps: ~80,000 tokens. Good results. The model identified a circular dependency we had missed, flagged an inconsistency in our error handling patterns, and suggested a refactoring that we implemented.
  • The full FinanceOps monorepo: ~600,000 tokens. Mixed results. The model made several accurate high-level observations but also hallucinated two dependencies that do not exist, confused two similarly-named modules, and suggested a refactoring that would have broken a cross-service contract.

Where the Context Window Helps

The million-token context window is genuinely valuable for focused analysis of large but coherent codebases. The sweet spots we found:

  • Single-service architecture review. Feed one service plus its tests and configuration. The model provides detailed, accurate analysis because the context is large but focused.
  • Cross-file refactoring. The model can see how a type is used across 50 files and suggest a change that accounts for all usage sites. This was previously impossible without multiple prompts and manual context management.
  • Documentation generation. The model reads the entire codebase and produces documentation that is more accurate than what an engineer would write from memory, because the model has literally read every file.
  • Test gap analysis. Feed the source and test files together. The model identifies untested code paths with surprising accuracy because it can cross-reference implementation with test coverage.

Where It Falls Short

The failure modes were consistent and instructive:

  • At high token counts, the model’s attention to detail degrades. Details in the middle of the context get less attention than details at the beginning and end. A bug buried in file 47 of 120 is more likely to be missed than a bug in file 1 or file 120.
  • The model struggles with implicit contracts. When service A calls service B through an API, and both are in the context window, the model sometimes misunderstands the contract between them. Explicit interface definitions help. Implicit assumptions about behavior do not.
  • Hallucinated dependencies. At 600K tokens, the model occasionally referenced packages, functions, or modules that do not exist in the codebase. The hallucination rate was low, maybe 2-3% of claims, but each one requires human verification.
  • Confidence calibration. The model is equally confident in its accurate observations and its hallucinations. There is no signal that distinguishes “this is definitely a bug” from “this might be a dependency that I am confusing with something from training data.”

The Practical Workflow

After two weeks of experimentation, we settled on a workflow that uses the large context window effectively:

For architecture review: Feed one service at a time, not the entire codebase. Include the service, its tests, its configuration, and the interface definitions of services it calls. Keep the context under 200K tokens. The quality of analysis is dramatically higher at this scale than at 600K.

For cross-codebase analysis: Ask specific questions, not open-ended ones. “Find all places where we handle database connection errors” produces better results than “review this codebase.” The specific question focuses the model’s attention.

For refactoring: Use the large context window to understand the scope of a change, then execute the refactoring in smaller chunks. The model is better at analysis than at generating 128K tokens of modified code in one pass.

The Honest Assessment

The point is judgment, not novelty.

Earlier in this story I was mostly trying to survive the blast radius myself. Here I was trying to design a system where the team did not need heroics in the first place. The same philosophy now shapes jarvis, alfred, and the portfolio RAG stack.

A million-token context window is a capability in search of a workflow. The capability is real. The workflows that fully exploit it are still being discovered. The most productive use today is focused analysis at moderate scale, not everything-in-one-prompt at maximum scale.

I expect the workflows to improve as models get better at maintaining attention across very large contexts. But today, the practical sweet spot for Opus 4.6 is not “feed it everything.” It is “feed it the right things, at the right scale, with the right questions.” The million-token window is a ceiling, not a target. Working well below that ceiling produces better results than pushing against it.

The million-token context window changes what is possible but not what is practical. Being able to load an entire codebase into context does not mean you should. The most effective use of large context windows is selective: loading the specific files, tests, and documentation relevant to a focused task rather than dumping everything and hoping the model finds what matters. Context window size is a ceiling, not a target. The engineering discipline of providing focused, relevant context produces better results than brute-force inclusion regardless of how many tokens the model can handle.