What Happened
When Claude Sonnet 4 (codenamed "Fable 5" internally) launched, many developers and AI power users did what seemed logical: they flipped the switch and kept their existing prompts running. The benchmarks looked incredible. The reviews were glowing. So why did the model feel worse in practice?
One developer documented exactly this experience. After switching his orchestration pipeline to Claude Sonnet 4 on day one, he noticed something unsettling: a single run clocked in at 90 minutes and consumed 200,000 tokens. The same task on Claude Opus 4.8 had taken significantly less time and far fewer tokens.
The logs told the real story. The model was re-verifying information it had already been given. It was building elaborate plans where it should have been executing. It was spinning its wheels — not because the model was broken, but because the prompts were wrong for it.
### The Diagnosis Most People Get Wrong
The instinctive reaction is to blame the model. "It's still early, wait for a patch." But that's the wrong diagnosis entirely. Claude Sonnet 4 isn't a flawed version of Opus 4.8 — it's a fundamentally different model with different defaults, different tendencies, and different needs. Feeding it prompts optimized for Opus is like running diesel in a petrol engine. The problem isn't the engine.
Anthropics's own documentation confirms this. They now publish three separate prompting guides: one general guide, one specifically for Opus 4.8, and one specifically for Sonnet 4. Reading both model-specific guides back to back reveals something striking — they frequently recommend opposite approaches for the same situations.
Why It Matters
This isn't a niche developer problem. It affects every entrepreneur, marketer, and creator who has built workflows, automations, or content pipelines on top of Claude. If you have saved prompts, skill templates, or system instructions written before mid-2025, there's a strong chance they are actively hurting your results with the newest models.
Anthropics's guidance is blunt: prompting skills written for previous models are "too prescriptive" for Sonnet 4 and measurably reduce output quality. OpenAI has issued similar warnings about GPT-5.5, explicitly telling developers to "treat GPT-5.5 as a new model family to tune for, not a drop-in replacement."
### The Specific Behaviors That Differ
Here are the concrete differences that matter most in practice:
Sub-agent delegation: Opus 4.8 is conservative about spawning sub-agents — you need to push it. Sonnet 4 delegates eagerly — you need to constrain it, or it will over-delegate and bloat your token usage.
Starting effort level: For Opus, the recommended starting effort is `xhigh`. For Sonnet 4, it's `high`. Starting too high with Sonnet 4 triggers over-planning behavior.
Reasoning instructions: This one is critical. If your prompts contain phrases like "explain your reasoning" or "show your thinking," Sonnet 4 may trigger a `reasoning_extraction` refusal and fall back to Opus 4.8 automatically. That's a hidden cost and a hidden performance hit that most users would never notice without checking their logs.
### How the Two Vendors Approach the Problem Differently
Anthropics and OpenAI have taken divergent architectural approaches to model behavior control. OpenAI moves behavioral levers into API parameters — `reasoning_effort`, `verbosity`, and `phase` are all adjustable at the API level. Anthropic keeps behavioral control inside the prompt text itself, providing ready-made snippets for common issues.
Both vendors agree on one thing: `effort` is a meaningful lever. But everything else is handled differently, which means your prompting strategy needs to account for which platform you're on.
How to Use It Today
The practical solution that works across multiple models is a two-layer prompt architecture. The first layer is a vendor-agnostic core framework. The second layer is a short model-specific overlay.
### Build a Vendor-Agnostic Core
Your base prompt should contain only four elements that hold true regardless of which model processes it:
1. Goal — What outcome are you trying to achieve?
2. Success criteria — How will you know it worked?
3. Constraints — What must not happen?
4. Stop rules — When should the model stop and ask rather than continue?
This skeleton is stable. It doesn't fight any model's defaults because it doesn't assume any particular behavior.
### Add Thin Model Overlays
On top of that core, add no more than five lines of model-specific instruction. For Sonnet 4, this means removing any language that asks the model to explain its reasoning, capping delegation explicitly, and setting effort to `high` rather than `xhigh`. For Opus 4.8, you might do the opposite — encourage sub-agent use and set effort higher.
If you're managing multiple AI tools and want a faster way to test and iterate on prompt structures without burning tokens, free tools at [mykreatool.com](https://mykreatool.com) can help you prototype prompt variations quickly before committing them to production pipelines.
### Audit Your Existing Prompts
Search your saved prompts and skill templates for these high-risk phrases:
- "explain your reasoning"
- "show your thinking"
- "check X, then check Y" (step-by-step verification chains)
- "report your progress as you go"
- "make sure to verify"
For Sonnet 4, these phrases are anchors, not accelerators. Remove or replace them.
Who Benefits
This shift matters most for three groups:
Developers and technical founders running agentic pipelines will see the most dramatic token savings and speed improvements. The 200,000-token run described above is an extreme example, but even 20-30% token reduction compounds fast at scale.
Marketers using AI automation for content workflows, research pipelines, or customer communication will notice more consistent output quality once prompts are tuned to the model actually running them.
Creators using AI assistants for writing, ideation, or production work will get better first drafts and fewer frustrating non-answers if they strip prescriptive micro-instructions from their templates.
Risks
### Prompt Debt Accumulates Silently
The biggest risk here is invisible degradation. If your prompts were written for an older model and you've upgraded without re-tuning, you may be getting worse results than you were six months ago — without any obvious error message telling you so. Token costs go up. Output quality drifts down. You assume the model is the problem.
### The Maintenance Burden Is Real
Maintaining separate prompt collections for Opus, Sonnet 4, and GPT-5.5 is genuinely more work. The two-layer architecture reduces this burden, but it doesn't eliminate it. Every major model release now requires a prompt audit. That's a new operational cost that teams need to budget for.
### Universal Prompts Are a Trap
The temptation to find one prompt that works everywhere is understandable but counterproductive. As Max Leiter of Vercel put it: "Prompts overfit to models the same way models overfit to data." Every "Be concise" or "DO NOT BE LAZY" instruction in your prompt is a patch for one model's defaults. Change the model, and half those patches become useless — and the other half start causing harm.
Conclusion
Claude Sonnet 4 is a genuinely more capable model than Opus 4.8 by most measures. But capability doesn't translate to results if your prompts are fighting the model's defaults instead of working with them. The 200,000-token run that opened this story wasn't a model failure — it was a prompt failure.
The fix is straightforward in principle: build a stable vendor-agnostic core, add thin model-specific overlays, and audit your existing prompts for phrases that trigger the wrong behaviors in newer models. Both Anthropic and OpenAI have now made it official policy that new models require new prompting strategies. The sooner you treat prompt maintenance as a recurring task rather than a one-time setup, the less money and time you'll waste on tokens that go nowhere.


