A few months ago, I asked an AI to draft a memo in my voice. I'd given it a page of instructions: "direct, data-first, no fluff, British English." The output was perfectly competent. It was also completely generic. It could have been written by any management consultant with a thesaurus and a deadline.
The frustrating part was that it followed every rule I gave it. Direct? Sure. Data-first? Technically, yes. No fluff? Debatable, but close. The problem wasn't compliance. The problem was that describing a voice is a fundamentally different task from demonstrating one.
That distinction turns out to be the single most important finding in the research.
I spent the last several months reading peer-reviewed papers, digging through Anthropic's and OpenAI's documentation, and building a production system to calibrate writing voice for AI tools. What I found was a field that has moved much faster than most people realise, with clear evidence about what works, what doesn't, and where the hard limits are. Most of it contradicts the generic advice you'll find in a typical "prompt engineering" guide.
- Showing an AI examples of your writing outperforms describing your style by a wide margin. Three good examples beat fifty rules.
- Showing the AI what not to write is more powerful than most people expect. Research suggests each "bad example" teaches roughly 10x more than each good one.
- Where you place instructions inside a prompt matters as much as what those instructions say. AI models have a well-documented blind spot for information in the middle.
- Even the best prompt-based approach hits a ceiling at roughly 75-85% voice fidelity. The rest requires either fine-tuning or human editing.
Why AI writes the way it does
Before getting into solutions, it helps to understand why AI writing sounds so... AI. The answer is structural, not accidental.
Modern language models go through a training process called RLHF (reinforcement learning from human feedback). In plain terms: thousands of human raters compare pairs of AI outputs and pick the one they prefer. The model learns from those preferences. The problem is that those raters are a crowd, not an individual. The model learns to produce text that the average person would rate as "good," which is, by definition, text that has had every distinctive edge smoothed away.
Nathan Lambert, the post-training lead at the Allen Institute for AI and author of The RLHF Book, identified this as a core constraint. The training process optimises for aggregate satisfaction, and aggregate satisfaction penalises quirkiness. Models become verbose, hedge-heavy, and structurally predictable because those patterns score well on average, even if no individual human would choose to write that way.1
Good writing is, quite literally, orthogonal to how these models are trained.
This has practical implications. When you ask an AI to "write in a confident tone," it's fighting against a training baseline that rewards caution. When you ask it to "be concise," it's pushing against a reward function that associates length with quality. The sycophancy problem (where models tell you what you want to hear rather than what's accurate) is another symptom of the same root cause.2
Understanding this explains why voice calibration requires more than instructions. You're not configuring a tool. You're working against its default training.
What actually works: a research-backed hierarchy
Across the papers and practitioner reports from 2024 to early 2026, a consistent hierarchy emerges for how to make an AI reproduce a specific writing voice. The techniques stack. The more you layer, the better the result.
The most effective approach is annotated contrastive pairs: a good example paired with a bad example and an explicit analysis of why one works and the other doesn't. Gao and Das (AAAI 2024) found this combination outperformed all individual techniques tested.
Below that, contrastive pairs alone (showing preferred vs. less preferred outputs) score highly. Critically, the AI's own default output works as the negative example. You don't need to write bad examples yourself.
Few-shot examples (3-5 real writing samples) come next. Anthropic recommends 3-5 diverse, relevant examples. Diminishing returns beyond 5 samples. One excellent example per content type beats multiple examples of the same type.
Explicit voice description (rules like "write in a direct, analytical tone") works for broad strokes but fails on nuance. The AI must interpret subjective terms, which introduces drift.
Anti-pattern lists ("never do this") rank lowest. Anthropic warns that negative instructions can backfire, making the model more likely to do the forbidden thing. Show bad examples instead. Phrase rules positively.
The headline finding is simple: showing beats telling. Three real writing samples will teach an AI more about your voice than fifty carefully worded rules. Anthropic's own documentation puts it bluntly: examples are "pictures worth a thousand words" for a language model.3
But the more interesting finding is about negative examples.
The counterintuitive power of bad examples
In 2024, researchers Xiang Gao and Kamalika Das published a paper at AAAI (one of the top AI conferences) titled "Customizing Language Model Responses with Contrastive In-Context Learning." The core idea was deceptively simple: instead of only showing the AI what good output looks like, show it a good version and a bad version side by side, labelled "preferred" and "less preferred."4
The results were striking. Contrastive examples significantly outperformed standard few-shot prompting across style dimensions including formality, humour, conciseness, and dialect. They also used fewer tokens to do it.
Separately, Hamdan and Yuret (2025) found something even more surprising while studying how language models learn from negative examples: during a critical learning phase, each negative example improved accuracy roughly 10x more than each positive example. And "near-miss" negatives (text that was plausible but slightly wrong) were the most influential of all.5
For voice calibration, the "near-miss" is text that sounds almost right but has telltale AI patterns: the hedge words, the over-structured paragraphs, the corporate smoothness that no individual human would produce.
Here's what this looks like in practice. Consider a brief: "Write a short paragraph about why Africa's LP data is fragmented."
The landscape of limited partner data in Africa presents significant challenges for stakeholders seeking to navigate the continent's private capital markets. The fragmentation of data sources, combined with the absence of standardised reporting frameworks, creates a complex environment that requires careful analysis and strategic thinking.
When we started mapping Africa's LP base, we expected one dataset. We found four different investor categories with almost no overlap in how they report commitments, fund sizes, or portfolio allocations. DFIs file to one standard. Family offices file to none. The data isn't missing. It's scattered across formats that were never designed to talk to each other.
The critical insight from the Gao and Das paper is that you don't need to write negative examples from scratch. The AI's own default output works perfectly as the "less preferred" example. Ask the AI to write something without voice guidance, label it "less preferred," then provide your actual writing labelled "preferred." Add a brief annotation explaining the differences.
Three of these pairs will teach the AI more about your voice than a full page of rules.
There is one caveat, though. While showing bad examples works, writing bad instructions ("Never use passive voice") can backfire. Anthropic's prompt engineering team has found that negative instructions can paradoxically increase the behaviour you're trying to suppress. The fix is to phrase constraints positively: instead of "Don't use passive voice," write "Use direct, active verbs."6
Show bad examples. Phrase rules positively. The distinction matters.
Why structure matters more than content
Once you have good examples and clear rules, the next question is how to organise them. This is where most people's setups fall apart.
The instinct is to write one large document that contains everything the AI needs to know about your voice and hand it over as a single block. A monolithic voice guide. The research is clear that this is the wrong approach.
In 2024, Liu et al. published a now-famous paper called "Lost in the Middle" (Transactions of the Association for Computational Linguistics). They found that language models perform best when key information is at the beginning or end of their input window, and performance degrades significantly when important content is buried in the middle.7
The implications for voice calibration are direct: if you bury your most important voice rules in the middle of a 3,000-word style guide, the AI will pay less attention to them.
The practical response is to split your voice system into modules rather than keeping it as one document. Identity and personality go at the beginning (the "primacy zone"). Examples and reference material go in the middle. Critical constraints and format rules get repeated at the end (the "recency zone"), where they'll carry the most weight when the AI actually generates output.
This modular approach also solves a second problem: format bleed. If you've ever asked an AI to write a social media post and received something structured like a memo (with headers, bullet points, and a conclusion), you've encountered format bleed. It happens because a monolithic style guide treats all content types the same.
A modular voice architecture
The solution is three layers assembled at runtime, not stored as one monolithic document.
Always present. Who you are, how you write, what you never do. Placed in the primacy zone. Cached between sessions.
Injected per content type. Structure, length, required elements, format-specific examples. One card per content format: social post, email, memo, article, report.
Loaded on demand. Audience-specific adjustments, topic-specific rules, sourcing requirements. Changes per task without altering the base voice.
The base voice layer is the constant. It defines who you are as a writer. The task layer is the variable overlay: different structures and constraints for different content types. The context layer adds audience-specific adjustments. At generation time, all three combine. A 500-token assembled prompt built this way dramatically outperforms a 3,000-token monolithic one, precisely because the modular version puts information where the model can actually use it.
The platform problem
If you only use one AI tool, you can skip this section. If you use several (and most people do), this matters.
Claude, ChatGPT, and API-based tools all handle voice instructions differently. A prompt that works brilliantly on one platform can fail on another, even with the same text. The differences aren't superficial.
| Feature | Claude (Projects) | ChatGPT (Custom GPTs) | API / Automation |
|---|---|---|---|
| Instruction limit | 200K tokens | ~8,000 characters | Model context window |
| Best format | XML-tagged sections | Markdown with headers | Canonical Markdown |
| Examples | Inline + uploaded docs | Knowledge files (RAG) | Inline in system prompt |
| Key advantage | Full verbosity, no compression needed | Broad distribution via GPT Store | Full control over assembly |
| Key limitation | Platform-specific | 8K char cap forces trade-offs | No built-in adaptation |
Claude's Projects feature is currently the strongest option for voice calibration because its 200K-token context window means you can include the full voice guide, plus writing samples, plus task cards, without compressing anything. ChatGPT's Custom GPTs force hard trade-offs: at roughly 8,000 characters, you need to compress rules and move examples into separate Knowledge files, which are retrieval-based and not guaranteed to surface every session.
The practical workflow is to maintain one canonical voice guide in Markdown, then adapt it per platform. Full verbosity for Claude. Compressed with examples offloaded for ChatGPT. Conditional formatting for API workflows. Test with identical prompts across platforms using a voice rubric.
The 85% ceiling
Here's the part most voice calibration guides leave out: even with optimal prompting, examples, contrastive pairs, and modular architecture, you'll reach a ceiling at roughly 75-85% voice fidelity. The remaining gap is a structural artefact of how these models are trained.
Ranges based on practitioner reports and Gao & Das (2024). "Voice fidelity" is a composite of vocabulary match, structural pattern adherence, and tone consistency. Not a single standardised metric.
This is not a bug. It's a consequence of the RLHF process described earlier. The model has been trained on millions of examples to produce text that scores well with a general audience. Your voice, by definition, is not general. Every distinctive stylistic choice you make (a preference for fragments, an unusual vocabulary, a specific rhythm) is a deviation from what the model considers "optimal."
Fine-tuning (training the model on your own writing, typically 50+ samples) can push fidelity into the 90-95% range. But it's expensive, less flexible, and overkill for most use cases. The more common production approach is to accept the prompt-based ceiling and design for efficient human editing of the remaining gap.
Andrej Karpathy, formerly of OpenAI and Tesla, has reframed the entire discipline. What used to be called "prompt engineering" is now more accurately described as "context engineering," a term he and Shopify CEO Tobi Lutke popularised in mid-2025. The shift in language matters because it shifts the focus from writing a single clever instruction to designing an information architecture: getting the right voice signals to the model at the right time, in the right format, without dilution from competing instructions.8
Six mistakes that break voice calibration
If you're building a voice system (or even a simple style guide for an AI tool), here are the failure modes I see most often, informed by the research.
1. Describing vibes instead of patterns
Instructions like "bold but approachable" are subjective. The AI has to interpret them, and it will interpret them differently every time. Effective rules specify measurable behaviour: "use direct verbs, average sentence length 14-18 words, no hedge words like 'might' or 'perhaps'."
2. Rules without examples
Teams that write 50 rules but include zero examples consistently get worse results than teams with 5 rules and 3 examples. Anthropic's documentation is direct on this point.3
3. Over-specifying edge cases
Anthropic warns against cramming every possible rule into a prompt. It consumes the model's attention budget, increases the chance of contradictory instructions, and makes the system fragile. A curated set of diverse examples covers the space better than an exhaustive rule set.9
4. Ignoring instruction decay
Voice rules placed at the start of a conversation degrade as the conversation grows. The model pays disproportionate attention to recent messages and "forgets" constraints buried thousands of tokens back. The fix is runtime reinforcement: appending critical voice rules to the end of each prompt, exploiting the recency effect.
5. Not accounting for sycophancy
Models are biased toward flattery and validation, which undermines distinctive voice. They also can't reliably self-evaluate voice quality. External evaluation (human review, or comparison against a rubric) is essential.
6. Same intensity for all tasks
Complex tasks (data analysis, multi-step reasoning) consume the model's attention budget for execution, leaving less capacity for style. Simple tasks maintain voice better. Use lighter prompts for easy tasks and heavier voice reinforcement for complex ones.
The last mile is human
There's a metaphor from photography that applies here. The camera captures roughly 80% of what you want. The remaining 20% happens in post-processing: the colour grading, the crop, the subtle adjustments that make an image feel intentional rather than accidental. No amount of camera technology eliminates that step. The best photographers don't fight it. They design their workflow around it.
Voice calibration with AI is the same. The research is clear about what works: show, don't tell. Use contrastive pairs. Structure for primacy and recency. Keep it modular. These techniques can get you to 80-85%, which is dramatically better than most people's current setup. But the remaining gap is where your voice actually is your voice: the specific word choices, the rhythm of a particular sentence, the instinct for when to break a rule. That last mile requires a human editor.
The systems that work best in production are the ones that acknowledge this ceiling honestly and design for it. They don't chase 100% automation. They build fast, structured workflows for the human review step so that the person writing can spend their time on the 15% that matters most, rather than rewriting everything from scratch.
The AI gets you 85% of the way there. Your job is to make the remaining 15% count.
That's not a limitation. That's a workflow.
Lambert, N. "Why AI writing is mid." Interconnects, November 2025.
Lambert, N. The RLHF Book, Chapter 14: Over-optimization. Manning, 2026.
Anthropic. "Use examples (multishot prompting) to guide Claude's behavior." Claude API Docs, 2025.
Gao, X. & Das, K. "Customizing Language Model Responses with Contrastive In-Context Learning." Proceedings of AAAI 2024, Paper #29760.
Hamdan, S. & Yuret, D. "How much do LLMs learn from negative examples?" arXiv:2503.14391, March 2025.
Anthropic. "Prompt engineering best practices." Claude blog, November 2025.
Liu, N.F. et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 2024, Vol. 12, pp. 157-173.
Karpathy, A. Post on X/Twitter, 25 June 2025. Subsequently endorsed by Shopify CEO Tobi Lutke and adopted by Anthropic's engineering team.
Anthropic. "Effective context engineering for AI agents." Engineering blog, 2025.
