Context Engineering

Teaching an AI to Write Like You (Not Like Every Other AI)

Q: Can AI really learn to write like me?

AI can replicate approximately 85% of your voice with the right system: structured rules, contrastive examples, and task-specific calibration. The remaining 15% requires human judgement. The goal is not perfect imitation but consistent quality that reflects your standards.

Q: Why do AI writing tools sound the same regardless of instructions?

Most AI writing instructions describe vibes ('professional but approachable') rather than patterns ('short sentences, no hedging, always lead with the conclusion'). AI models respond to structural constraints, not adjectives. Rules without examples are especially weak.

Q: What is the three-layer voice system?

The three layers are: Base (identity statement, voice pillars, tone dials, constraints), Task (per-content-type format specs like email vs memo vs report), and Context (audience patterns combined with base and task for generation). Each layer serves a different function and updates at a different cadence.

Q: What are the biggest mistakes in AI voice calibration?

Six common mistakes: describing vibes instead of patterns, giving rules without examples, over-specifying edge cases, ignoring instruction decay over long conversations, not accounting for sycophancy, and applying the same intensity for all task types.

Voice calibration is harder than it looks. Rules alone hit an 85% ceiling. The three-layer architecture that gets closer.

PublishedMar 19, 2026
UpdatedMar 19, 2026
Reading time11 min read

A few months ago, I asked an AI to draft a memo in my voice. I'd given it a page of instructions: "direct, data-first, no fluff, British English." The output was perfectly competent. It was also completely generic. It could have been written by any management consultant with a thesaurus and a deadline.

The frustrating part was that it followed every rule I gave it. Direct? Sure. Data-first? Technically, yes. No fluff? Debatable, but close. The problem wasn't compliance. The problem was that describing a voice is a fundamentally different task from demonstrating one.

That distinction turns out to be the single most important finding in the research.

I spent the last several months reading peer-reviewed papers, digging through Anthropic's and OpenAI's documentation, and building a production system to calibrate writing voice for AI tools. What I found was a field that has moved much faster than most people realise, with clear evidence about what works, what doesn't, and where the hard limits are. Most of it contradicts the generic advice you'll find in a typical "prompt engineering" guide.

Some takeaways

Showing an AI examples of your writing outperforms describing your style by a wide margin. Three good examples beat fifty rules.
Showing the AI what not to write is more powerful than most people expect. Research suggests each "bad example" teaches roughly 10x more than each good one.
Where you place instructions inside a prompt matters as much as what those instructions say. AI models have a well-documented blind spot for information in the middle.
Even the best prompt-based approach hits a ceiling at roughly 75-85% voice fidelity. The rest requires either fine-tuning or human editing.

Why AI writes the way it does

Before getting into solutions, it helps to understand why AI writing sounds so... AI. The answer is structural, not accidental.

Modern language models go through a training process called RLHF (reinforcement learning from human feedback). In plain terms: thousands of human raters compare pairs of AI outputs and pick the one they prefer. The model learns from those preferences. The problem is that those raters are a crowd, not an individual. The model learns to produce text that the average person would rate as "good," which is, by definition, text that has had every distinctive edge smoothed away.

Nathan Lambert, the post-training lead at the Allen Institute for AI and author of The RLHF Book, identified this as a core constraint. The training process optimises for aggregate satisfaction, and aggregate satisfaction penalises quirkiness. Models become verbose, hedge-heavy, and structurally predictable because those patterns score well on average, even if no individual human would choose to write that way.¹

Good writing is, quite literally, orthogonal to how these models are trained.

This has practical implications. When you ask an AI to "write in a confident tone," it's fighting against a training baseline that rewards caution. When you ask it to "be concise," it's pushing against a reward function that associates length with quality. The sycophancy problem (where models tell you what you want to hear rather than what's accurate) is another symptom of the same root cause.²

Understanding this explains why voice calibration requires more than instructions. You're not configuring a tool. You're working against its default training.

What actually works: a research-backed hierarchy

Across the papers and practitioner reports from 2024 to early 2026, a consistent hierarchy emerges for how to make an AI reproduce a specific writing voice. The techniques stack. The more you layer, the better the result.

The most effective approach is annotated contrastive pairs: a good example paired with a bad example and an explicit analysis of why one works and the other doesn't. Gao and Das (AAAI 2024) found this combination outperformed all individual techniques tested.

Below that, contrastive pairs alone (showing preferred vs. less preferred outputs) score highly. Critically, the AI's own default output works as the negative example. You don't need to write bad examples yourself.

Few-shot examples (3-5 real writing samples) come next. Anthropic recommends 3-5 diverse, relevant examples. Diminishing returns beyond 5 samples. One excellent example per content type beats multiple examples of the same type.

Explicit voice description (rules like "write in a direct, analytical tone") works for broad strokes but fails on nuance. The AI must interpret subjective terms, which introduces drift.

Anti-pattern lists ("never do this") rank lowest. Anthropic warns that negative instructions can backfire, making the model more likely to do the forbidden thing. Show bad examples instead. Phrase rules positively.

The headline finding is simple: showing beats telling. Three real writing samples will teach an AI more about your voice than fifty carefully worded rules. Anthropic's own documentation puts it bluntly: examples are "pictures worth a thousand words" for a language model.³

Relative effectiveness when used individually. Techniques compound when layered. Click a bar to see details.

Gao & Das (AAAI 2024); Anthropic documentation (2025); practitioner testing.

Relative scores synthesised from multiple sources. They are directional rather than direct measurements from a single study.

But the more interesting finding is about negative examples.

The counterintuitive power of bad examples

In 2024, researchers Xiang Gao and Kamalika Das published a paper at AAAI (one of the top AI conferences) titled "Customizing Language Model Responses with Contrastive In-Context Learning." The core idea was deceptively simple: instead of only showing the AI what good output looks like, show it a good version and a bad version side by side, labelled "preferred" and "less preferred."⁴

The results were striking. Contrastive examples significantly outperformed standard few-shot prompting across style dimensions including formality, humour, conciseness, and dialect. They also used fewer tokens to do it.

Separately, Hamdan and Yuret (2025) found something even more surprising while studying how language models learn from negative examples: during a critical learning phase, each negative example improved accuracy roughly 10x more than each positive example. And "near-miss" negatives (text that was plausible but slightly wrong) were the most influential of all.⁵

For voice calibration, the "near-miss" is text that sounds almost right but has telltale AI patterns: the hedge words, the over-structured paragraphs, the corporate smoothness that no individual human would produce.

Here's what this looks like in practice. Consider a brief: "Write a short paragraph about why Africa's LP data is fragmented."

Less preferred (AI default)

The landscape of limited partner data in Africa presents significant challenges for stakeholders seeking to navigate the continent's private capital markets. The fragmentation of data sources, combined with the absence of standardised reporting frameworks, creates a complex environment that requires careful analysis and strategic thinking.

What's wrong:No writer identity. "Landscape" and "navigate" are empty filler. "Stakeholders" is unspecific. "Strategic thinking" says nothing. The paragraph could be about any topic in any geography.

Preferred (calibrated voice)

When we started mapping Africa's LP base, we expected one dataset. We found four different investor categories with almost no overlap in how they report commitments, fund sizes, or portfolio allocations. DFIs file to one standard. Family offices file to none. The data isn't missing. It's scattered across formats that were never designed to talk to each other.

What works:First person. Specific ("four categories," "DFIs," "family offices"). Build-build-punch rhythm. The last sentence is the punchline. No throat-clearing, no filler.

Same brief, two voices. The preferred version uses first person, specific data, and build-build-punch rhythm.

The critical insight from the Gao and Das paper is that you don't need to write negative examples from scratch. The AI's own default output works perfectly as the "less preferred" example. Ask the AI to write something without voice guidance, label it "less preferred," then provide your actual writing labelled "preferred." Add a brief annotation explaining the differences.

Three of these pairs will teach the AI more about your voice than a full page of rules.

There is one caveat, though. While showing bad examples works, writing bad instructions ("Never use passive voice") can backfire. Anthropic's prompt engineering team has found that negative instructions can paradoxically increase the behaviour you're trying to suppress. The fix is to phrase constraints positively: instead of "Don't use passive voice," write "Use direct, active verbs."⁶

Show bad examples. Phrase rules positively. The distinction matters.

Why structure matters more than content

Once you have good examples and clear rules, the next question is how to organise them. This is where most people's setups fall apart.

The instinct is to write one large document that contains everything the AI needs to know about your voice and hand it over as a single block. A monolithic voice guide. The research is clear that this is the wrong approach.

In 2024, Liu et al. published a now-famous paper called "Lost in the Middle" (Transactions of the Association for Computational Linguistics). They found that language models perform best when key information is at the beginning or end of their input window, and performance degrades significantly when important content is buried in the middle.⁷

The implications for voice calibration are direct: if you bury your most important voice rules in the middle of a 3,000-word style guide, the AI will pay less attention to them.

The practical response is to split your voice system into modules rather than keeping it as one document. Identity and personality go at the beginning (the "primacy zone"). Examples and reference material go in the middle. Critical constraints and format rules get repeated at the end (the "recency zone"), where they'll carry the most weight when the AI actually generates output.

This modular approach also solves a second problem: format bleed. If you've ever asked an AI to write a social media post and received something structured like a memo (with headers, bullet points, and a conclusion), you've encountered format bleed. It happens because a monolithic style guide treats all content types the same.

A modular voice architecture

The solution is three layers assembled at runtime, not stored as one monolithic document.

BASEVoice identity300–500 tokens

Always present. Who you are, how you write, what you never do. Placed in the primacy zone. Cached between sessions.

identity_statementvoice_pillarstone_dialsgolden_examplesconstraints

TASKFormat card200–400 tokens

Injected per content type. Structure, length, required elements, format-specific examples. One card per content format: social post, email, memo, article, report.

content_typestructuretone_overridesformat_exampleanti_example

CONTEXTAudience & situationAs needed

Loaded on demand. Audience-specific adjustments, topic-specific rules, sourcing requirements. Changes per task without altering the base voice.

audience_profiletopic_rulessourcing_reqsstakeholder_prefs

Three layers assembled at runtime, not stored as one monolithic document.

The base voice layer is the constant. It defines who you are as a writer. The task layer is the variable overlay: different structures and constraints for different content types. The context layer adds audience-specific adjustments. At generation time, all three combine. A 500-token assembled prompt built this way dramatically outperforms a 3,000-token monolithic one, precisely because the modular version puts information where the model can actually use it.

The platform problem

If you only use one AI tool, you can skip this section. If you use several (and most people do), this matters.

Claude, ChatGPT, and API-based tools all handle voice instructions differently. A prompt that works brilliantly on one platform can fail on another, even with the same text. The differences aren't superficial.

Feature	Claude (Projects)	ChatGPT (Custom GPTs)	API / Automation
Instruction limit	200K tokens	~8,000 characters	Model context window
Best format	XML-tagged sections	Markdown with headers	Canonical Markdown
Examples	Inline + uploaded docs	Knowledge files (RAG)	Inline in system prompt
Key advantage	Full verbosity, no compression needed	Broad distribution via GPT Store	Full control over assembly
Key limitation	Platform-specific	8K char cap forces trade-offs	No built-in adaptation

Claude's Projects feature is currently the strongest option for voice calibration because its 200K-token context window means you can include the full voice guide, plus writing samples, plus task cards, without compressing anything. ChatGPT's Custom GPTs force hard trade-offs: at roughly 8,000 characters, you need to compress rules and move examples into separate Knowledge files, which are retrieval-based and not guaranteed to surface every session.

The practical workflow is to maintain one canonical voice guide in Markdown, then adapt it per platform. Full verbosity for Claude. Compressed with examples offloaded for ChatGPT. Conditional formatting for API workflows. Test with identical prompts across platforms using a voice rubric.

The 85% ceiling

Here's the part most voice calibration guides leave out: even with optimal prompting, examples, contrastive pairs, and modular architecture, you'll reach a ceiling at roughly 75-85% voice fidelity. The remaining gap is a structural artefact of how these models are trained.

Prompt-only (best case)

75%

85%

0%100%

Prompt + fine-tuning (50+ samples)

90%

95%

0%100%

Ranges based on practitioner reports and Gao & Das (2024). "Voice fidelity" is a composite of vocabulary match, structural pattern adherence, and tone consistency. Not a single standardised metric.

This is not a bug. It's a consequence of the RLHF process described earlier. The model has been trained on millions of examples to produce text that scores well with a general audience. Your voice, by definition, is not general. Every distinctive stylistic choice you make (a preference for fragments, an unusual vocabulary, a specific rhythm) is a deviation from what the model considers "optimal."

Fine-tuning (training the model on your own writing, typically 50+ samples) can push fidelity into the 90-95% range. But it's expensive, less flexible, and overkill for most use cases. The more common production approach is to accept the prompt-based ceiling and design for efficient human editing of the remaining gap.

Andrej Karpathy, formerly of OpenAI and Tesla, has reframed the entire discipline. What used to be called "prompt engineering" is now more accurately described as "context engineering," a term he and Shopify CEO Tobi Lutke popularised in mid-2025. The shift in language matters because it shifts the focus from writing a single clever instruction to designing an information architecture: getting the right voice signals to the model at the right time, in the right format, without dilution from competing instructions.⁸ For a worked example of this in action, a first-person experiment building a Paul Graham AI profile shows how a structured context document shifts Claude's output on a single prompt.

Six mistakes that break voice calibration

If you're building a voice system (or even a simple style guide for an AI tool), here are the failure modes I see most often, informed by the research.

1. Describing vibes instead of patterns

Instructions like "bold but approachable" are subjective. The AI has to interpret them, and it will interpret them differently every time. Effective rules specify measurable behaviour: "use direct verbs, average sentence length 14-18 words, no hedge words like 'might' or 'perhaps'."

2. Rules without examples

Teams that write 50 rules but include zero examples consistently get worse results than teams with 5 rules and 3 examples. Anthropic's documentation is direct on this point.³

3. Over-specifying edge cases

Anthropic warns against cramming every possible rule into a prompt. It consumes the model's attention budget, increases the chance of contradictory instructions, and makes the system fragile. A curated set of diverse examples covers the space better than an exhaustive rule set.⁹

4. Ignoring instruction decay

Voice rules placed at the start of a conversation degrade as the conversation grows. The model pays disproportionate attention to recent messages and "forgets" constraints buried thousands of tokens back. The fix is runtime reinforcement: appending critical voice rules to the end of each prompt, exploiting the recency effect.

5. Not accounting for sycophancy

Models are biased toward flattery and validation, which undermines distinctive voice. They also can't reliably self-evaluate voice quality. External evaluation (human review, or comparison against a rubric) is essential.

6. Same intensity for all tasks

Complex tasks (data analysis, multi-step reasoning) consume the model's attention budget for execution, leaving less capacity for style. Simple tasks maintain voice better. Use lighter prompts for easy tasks and heavier voice reinforcement for complex ones.

The last mile is human

There's a metaphor from photography that applies here. The camera captures roughly 80% of what you want. The remaining 20% happens in post-processing: the colour grading, the crop, the subtle adjustments that make an image feel intentional rather than accidental. No amount of camera technology eliminates that step. The best photographers don't fight it. They design their workflow around it.

Voice calibration with AI is the same. The research is clear about what works: show, don't tell. Use contrastive pairs. Structure for primacy and recency. Keep it modular. These techniques can get you to 80-85%, which is dramatically better than most people's current setup. But the remaining gap is where your voice actually is your voice: the specific word choices, the rhythm of a particular sentence, the instinct for when to break a rule. That last mile requires a human editor.

The systems that work best in production are the ones that acknowledge this ceiling honestly and design for it. They don't chase 100% automation. They build fast, structured workflows for the human review step so that the person writing can spend their time on the 15% that matters most, rather than rewriting everything from scratch.

The AI gets you 85% of the way there. Your job is to make the remaining 15% count.

That's not a limitation. That's a workflow.

Sources

Lambert, N. "Why AI writing is mid." Interconnects, November 2025.
Lambert, N. The RLHF Book, Chapter 14: Over-optimization. Manning, 2026.
Anthropic. "Use examples (multishot prompting) to guide Claude's behavior." Claude API Docs, 2025.
Gao, X. & Das, K. "Customizing Language Model Responses with Contrastive In-Context Learning." Proceedings of AAAI 2024, Paper #29760.
Hamdan, S. & Yuret, D. "How much do LLMs learn from negative examples?" arXiv:2503.14391, March 2025.
Anthropic. "Prompt engineering best practices." Claude blog, November 2025.
Liu, N.F. et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 2024, Vol. 12, pp. 157-173.
Karpathy, A. Post on X/Twitter, 25 June 2025. Subsequently endorsed by Shopify CEO Tobi Lutke and adopted by Anthropic's engineering team.
Anthropic. "Effective context engineering for AI agents." Engineering blog, 2025.

Frequently asked questions

Can AI really learn to write like me?

Why do AI writing tools sound the same regardless of instructions?

What is the three-layer voice system?

What are the biggest mistakes in AI voice calibration?

Turn the framework into a working system.

Membership gives you the inference engine, conductor, and portable profile behind the method.

Join Membership Audit your AI

Context Engineering

Teaching an AI to Write Like You (Not Like Every Other AI)

Voice calibration is harder than it looks. Rules alone hit an 85% ceiling. The three-layer architecture that gets closer.

PublishedMar 19, 2026
UpdatedMar 19, 2026
Reading time11 min read

That distinction turns out to be the single most important finding in the research.

Some takeaways

Showing an AI examples of your writing outperforms describing your style by a wide margin. Three good examples beat fifty rules.
Showing the AI what not to write is more powerful than most people expect. Research suggests each "bad example" teaches roughly 10x more than each good one.
Where you place instructions inside a prompt matters as much as what those instructions say. AI models have a well-documented blind spot for information in the middle.
Even the best prompt-based approach hits a ceiling at roughly 75-85% voice fidelity. The rest requires either fine-tuning or human editing.

Why AI writes the way it does

Before getting into solutions, it helps to understand why AI writing sounds so... AI. The answer is structural, not accidental.

Good writing is, quite literally, orthogonal to how these models are trained.

Understanding this explains why voice calibration requires more than instructions. You're not configuring a tool. You're working against its default training.

What actually works: a research-backed hierarchy

Explicit voice description (rules like "write in a direct, analytical tone") works for broad strokes but fails on nuance. The AI must interpret subjective terms, which introduces drift.

Relative effectiveness when used individually. Techniques compound when layered. Click a bar to see details.

Gao & Das (AAAI 2024); Anthropic documentation (2025); practitioner testing.

Relative scores synthesised from multiple sources. They are directional rather than direct measurements from a single study.

But the more interesting finding is about negative examples.

The counterintuitive power of bad examples

Here's what this looks like in practice. Consider a brief: "Write a short paragraph about why Africa's LP data is fragmented."

Less preferred (AI default)

Preferred (calibrated voice)

What works:First person. Specific ("four categories," "DFIs," "family offices"). Build-build-punch rhythm. The last sentence is the punchline. No throat-clearing, no filler.

Same brief, two voices. The preferred version uses first person, specific data, and build-build-punch rhythm.

Three of these pairs will teach the AI more about your voice than a full page of rules.

Show bad examples. Phrase rules positively. The distinction matters.

Why structure matters more than content

Once you have good examples and clear rules, the next question is how to organise them. This is where most people's setups fall apart.

The implications for voice calibration are direct: if you bury your most important voice rules in the middle of a 3,000-word style guide, the AI will pay less attention to them.

A modular voice architecture

The solution is three layers assembled at runtime, not stored as one monolithic document.

BASEVoice identity300–500 tokens

Always present. Who you are, how you write, what you never do. Placed in the primacy zone. Cached between sessions.

identity_statementvoice_pillarstone_dialsgolden_examplesconstraints

TASKFormat card200–400 tokens

Injected per content type. Structure, length, required elements, format-specific examples. One card per content format: social post, email, memo, article, report.

content_typestructuretone_overridesformat_exampleanti_example

CONTEXTAudience & situationAs needed

Loaded on demand. Audience-specific adjustments, topic-specific rules, sourcing requirements. Changes per task without altering the base voice.

audience_profiletopic_rulessourcing_reqsstakeholder_prefs

Three layers assembled at runtime, not stored as one monolithic document.

The platform problem

If you only use one AI tool, you can skip this section. If you use several (and most people do), this matters.

Feature	Claude (Projects)	ChatGPT (Custom GPTs)	API / Automation
Instruction limit	200K tokens	~8,000 characters	Model context window
Best format	XML-tagged sections	Markdown with headers	Canonical Markdown
Examples	Inline + uploaded docs	Knowledge files (RAG)	Inline in system prompt
Key advantage	Full verbosity, no compression needed	Broad distribution via GPT Store	Full control over assembly
Key limitation	Platform-specific	8K char cap forces trade-offs	No built-in adaptation

The 85% ceiling

Prompt-only (best case)

75%

85%

0%100%

Prompt + fine-tuning (50+ samples)

90%

95%

0%100%

Ranges based on practitioner reports and Gao & Das (2024). "Voice fidelity" is a composite of vocabulary match, structural pattern adherence, and tone consistency. Not a single standardised metric.

Six mistakes that break voice calibration

If you're building a voice system (or even a simple style guide for an AI tool), here are the failure modes I see most often, informed by the research.

1. Describing vibes instead of patterns

2. Rules without examples

Teams that write 50 rules but include zero examples consistently get worse results than teams with 5 rules and 3 examples. Anthropic's documentation is direct on this point.³

3. Over-specifying edge cases

4. Ignoring instruction decay

5. Not accounting for sycophancy

6. Same intensity for all tasks

The last mile is human

The AI gets you 85% of the way there. Your job is to make the remaining 15% count.

That's not a limitation. That's a workflow.

Sources

Lambert, N. "Why AI writing is mid." Interconnects, November 2025.
Lambert, N. The RLHF Book, Chapter 14: Over-optimization. Manning, 2026.
Anthropic. "Use examples (multishot prompting) to guide Claude's behavior." Claude API Docs, 2025.
Gao, X. & Das, K. "Customizing Language Model Responses with Contrastive In-Context Learning." Proceedings of AAAI 2024, Paper #29760.
Hamdan, S. & Yuret, D. "How much do LLMs learn from negative examples?" arXiv:2503.14391, March 2025.
Anthropic. "Prompt engineering best practices." Claude blog, November 2025.
Liu, N.F. et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 2024, Vol. 12, pp. 157-173.
Karpathy, A. Post on X/Twitter, 25 June 2025. Subsequently endorsed by Shopify CEO Tobi Lutke and adopted by Anthropic's engineering team.
Anthropic. "Effective context engineering for AI agents." Engineering blog, 2025.

Frequently asked questions

Can AI really learn to write like me?

Why do AI writing tools sound the same regardless of instructions?

What is the three-layer voice system?

What are the biggest mistakes in AI voice calibration?

Turn the framework into a working system.

Membership gives you the inference engine, conductor, and portable profile behind the method.

Join Membership Audit your AI

Why AI writes the way it does

What actually works: a research-backed hierarchy

The counterintuitive power of bad examples

Why structure matters more than content

A modular voice architecture

The platform problem

The 85% ceiling

Six mistakes that break voice calibration

1. Describing vibes instead of patterns

2. Rules without examples

3. Over-specifying edge cases

4. Ignoring instruction decay

5. Not accounting for sycophancy

6. Same intensity for all tasks

The last mile is human

Frequently asked questions

Turn the framework into a working system.

Continue reading

Why AI writes the way it does

What actually works: a research-backed hierarchy

The counterintuitive power of bad examples

Why structure matters more than content

A modular voice architecture

The platform problem

The 85% ceiling

Six mistakes that break voice calibration

1. Describing vibes instead of patterns

2. Rules without examples

3. Over-specifying edge cases

4. Ignoring instruction decay

5. Not accounting for sycophancy

6. Same intensity for all tasks

The last mile is human

Frequently asked questions

Turn the framework into a working system.

Continue reading