Context Engineering

The Architecture of Teaching AI Your Job

Q: What are cognitive layers in AI context?

Cognitive layers are a three-tier architecture for structuring the context AI needs to do your job well. Tier 1 captures how you think (reasoning architecture, voice). Tier 2 captures what you do (role, domain expertise). Tier 3 captures what is happening now (stakeholders, decisions, delegations).

Q: Do all AI tools agree on what context professionals need?

Four major AI tools (Claude, ChatGPT, Grok, Perplexity) agree strongly on the need for structured reasoning context and tiered update cadences. They diverge on implementation details: how to handle instruction decay, where to store context, and how much structure is enough.

Q: What is the personalisation threshold?

The personalisation threshold is the point where AI output shifts from generic to genuinely useful. Research suggests it requires at minimum: a reasoning architecture (how you think), domain context (what you know), and one operational layer (what is happening now). Preferences alone do not cross the threshold.

AI that understands your job needs three layers of context: cognitive, professional, and operational.

PublishedMar 16, 2026
UpdatedMar 16, 2026
Reading time10 min read

Last month I gave Claude the same brief twice. The first time, cold. No context, no history, no instructions beyond the task itself. The output was competent. Grammatically sound. Completely useless. It read like a well-educated stranger had overheard a conversation at a conference and tried to reconstruct the argument from memory.

The second time, I loaded a context document I'd been refining for weeks: my reasoning patterns, decision frameworks, the names of people I work with, the projects currently on my plate, the specific way I weigh trade-offs between speed and rigour. Same model, same prompt.

The difference was not incremental. It was categorical.

If you want to see the same experiment run on a public figure, building a Paul Graham AI profile is a concrete demonstration of how structured context changes what a model produces.

That experience sent me down a research path. I asked four different AI tools, Claude, ChatGPT, Grok, and Perplexity, the same set of questions about what it takes to build a system that teaches AI how a professional thinks and works. Not a chatbot personality. Not a tone-of-voice setting. A genuine cognitive architecture that compounds over time.

What came back was surprisingly convergent on some points and revealingly divergent on others. This piece synthesises what I found, cross-referenced against the academic literature and documented product patterns. It's opinionated. I'll tell you where I think the research is strong and where it's thin.

Some takeaways

AI personalisation has three distinct layers: how you think, what you do, and what's happening now. Most people stop at one.
The "lost-in-the-middle" problem in long-context models means structure matters more than volume.
Five to seven intake questions cross the threshold from "generic template" to "this knows my work."
The biggest risk isn't bad setup. It's abandonment on day two.

The problem with "preferences"

Every major AI platform now offers some version of personalisation. Claude has memory and user preferences. ChatGPT has custom instructions and memory. Gemini has Gems. These are useful, and for most people, they're enough.

But they share a common ceiling: they capture what you like, not how you think.

Telling an AI that you prefer British English and concise responses is personalisation. Teaching it that you evaluate strategic decisions by weighing reversibility before expected value, that when you see a trade-off between speed and thoroughness, you default to speed unless the downside is reputational: that's something else entirely. That's a cognitive layer.

Anthropic's own engineering team published a blog post in September 2025 framing this distinction. They called it "context engineering," the successor to prompt engineering, and defined it as the art of curating the optimal set of tokens during inference. The core insight: context is a finite resource with diminishing marginal returns. You can't dump everything about yourself into a prompt and hope the model sorts it out.

You need an architecture.

Three tiers, three update cadences

Across the four research reports I commissioned, all four converged on a layered architecture. The specific labels differed, but the structural logic was identical: separate what's stable from what changes, and load each layer only when the task demands it.

Here's how I think about it, as three tiers with distinct purposes, token budgets, and maintenance cadences.

1Cognitive layerAlways loaded~800–1,200 tokens

Reasoning architecture

Decision frameworks, quality standards, anti-patterns, risk tolerance

Voice and identity

How you write, argue, and communicate. Tone, rhythm, constraints

2Professional layerLoaded per task~500–1,500 tokens

Identity

Role, company, domain, tools

Domain expertise

Industry, frameworks, recurring themes

Key relationships

Top 10 stakeholders: name, role, preferences

3Operational layerOn demandUnbounded

Decisions

Recent decision log

Delegations

Task tracking, follow-ups

Knowledge base

Searchable entries

Active pipeline

Projects, deals, live data

Click a task scenario to see which tiers activate and why. Each workflow loads different layers of your cognitive architecture.

Tier 1: How you think

The cognitive layer is the most underbuilt and the most valuable. It encodes your reasoning architecture: the decision frameworks you actually use (not the ones you learned in business school), your quality standards, your risk tolerance, your anti-patterns.

This is also where your voice lives. Not "tone of voice" in the marketing sense. The actual mechanics of how you construct an argument, whether you lead with data or narrative, how you handle concessions, your sentence rhythm, the words you'd never use.

The cognitive layer updates rarely, annually, or when your role changes. It's small: 800 to 1,200 tokens. And it loads on every single interaction, because it shapes how the AI approaches any task, not just what it produces.

I think of it like this: the cognitive layer is the difference between an AI that writes a memo for someone in your role and one that writes a memo the way you'd write it if you had the time.

Tier 2: What you do

The professional layer is contextual. It captures your role, your company, your domain expertise, your key relationships. A strategy director at a DFI needs different professional context than a commercial lead at a fintech, even if their cognitive architectures share patterns.

This layer loads per task, not universally. If you're drafting a client email, the model needs your stakeholder map. If you're running a scenario analysis, it needs your industry frameworks. The professional layer also includes what the ICMCI's competency framework calls "business insight": the domain-specific knowledge that separates a generalist from a practitioner.

Update cadence: quarterly. When your team structure shifts, when you take on a new client segment, when your competitive environment changes.

Tier 3: What's happening now

The operational layer is unbounded and on-demand. It includes your recent decisions, your open delegations, your active pipeline, your searchable knowledge base. This is the most volatile layer, it changes weekly or daily, and the most dangerous to load indiscriminately.

Why dangerous? Because of a well-documented phenomenon in language models called "lost-in-the-middle." Liu et al. published this in the Transactions of the Association for Computational Linguistics in 2024. Their finding: model performance follows a U-shaped curve. Information at the beginning and end of a context window gets attended to well. Information in the middle gets missed, sometimes dramatically. Performance can degrade by over 30% when critical information moves from the edges to the centre of the input.

The practical implication is clear: don't dump your entire operational context into a single prompt. Pull only what the specific task requires, and structure it so the model can parse it cleanly.

What four AI tools agreed on (and where they split)

I gave Claude, ChatGPT, Grok, and Perplexity identical research briefs: six questions about building professional AI context systems. The convergences were striking. The disagreements were revealing.

Dimension	Consensus	Key divergence	Verdict
Format	Structured templates beat blank canvases. All four cited scaffolding research.	Claude's research recommended markdown as the portable standard. ChatGPT and Perplexity leaned toward database-centric workspaces (Notion-style).	Strong
Domain layers	Common core + domain-specific layers. All four arrived here independently.	Grok and Perplexity were thinner on advisory. No equivalent professional body of knowledge exists for fractional executives.	Strong
Delivery	"Show the outcome first" was universal. Progressive disclosure over information dump.	Claude's research cited specific SaaS activation benchmarks (36% average, 54.8% for AI tools). Others were more anecdotal.	Strong
Personalisation	5 to 10 intake questions. All four landed in this range.	Claude's research flagged the "uncanny valley of personalisation," the risk that over-reaching feels creepier than under-personalising. Others didn't mention it.	Partial
Architecture	Modular, hierarchical, chunked. All four cited RAG best practices.	Optimal chunk size ranged from 200 to 512 tokens (Perplexity, Grok) to 250 to 1,024 (Claude). No single answer.	Partial
Retention	"Use must be maintenance." The system should update as a byproduct of real work.	Only Claude's research cited BJ Fogg's B=MAP model and framed the day-two problem as a behaviour design challenge rather than a feature gap.	Divergent

Six research dimensions across four AI tools. Green = strong agreement. Amber = partial consensus. Red = meaningful divergence.

The strongest consensus: structured templates outperform blank canvases. This isn't a new finding. Wood, Bruner, and Ross established the scaffolding principle in 1976, and it's been validated since. But the AI-specific data is newer. A 2015 study by Van de Pol et al. (768 students, 30 teachers, published in Instructional Science) found that frequent, lightweight prompts outperformed comprehensive pre-filled content. In other words: show structure and nudge often, rather than doing all the work upfront.

The most interesting divergence was on retention. Grok and Perplexity treated the day-two problem as a feature question: add review checklists, send reminder emails. Claude's research went deeper, grounding the analysis in BJ Fogg's behavioural model from Stanford: for a behaviour to occur, motivation, ability, and a prompt must converge at the same moment. The critical insight from Fogg's work is that habit formation is a function of emotion, not repetition.

That reframes the entire retention challenge.

It means the question isn't "how do we remind people to update their context?" It's "how do we make each use of the context system feel like a small win?" If every interaction with the system produces something you'd actually use, a meeting brief, a decision draft, a stakeholder summary, the maintenance happens as a side effect.

The personalisation threshold

How much does the system need to know about you before it stops feeling generic?

Less than you'd think. McKinsey's consumer personalisation research from 2021 to 2023 found that 71% of consumers expect personalised interactions and 76% get frustrated when it doesn't happen. In SaaS onboarding, personalised flows show 65% higher completion rates than generic ones. But the threshold for "personalised enough" is surprisingly low.

Campaign Monitor's data shows that adding a recipient's first name to an email subject line increases open rates by 26%. The minimum viable personalisation isn't deep. It's accurate. Getting your domain language right (saying "engagement pipeline" instead of "deal pipeline" for a consultant) matters more than capturing every detail of your working style.

Language prefs

Saturated

Tone / voice

Role context

Domain expertise

Reasoning style

Decision patterns

Live work context

Where most AI personalisation stops today

The research across all four tools converged on five to seven intake questions as the sweet spot. Chameleon.io found that three-step onboarding tours have a 72% completion rate while seven-step tours drop to 16%. SurveyMonkey's analysis of 100,000 surveys showed the sharpest drop-off occurs with each additional question up to fifteen.

The five questions that emerged as highest-leverage across the research: your current role and who you serve (determines domain language), your primary objective right now (focuses the system), your biggest obstacle (enables pain-point targeting), how you prefer to receive information (determines output depth), and what tools or frameworks you already use (prevents redundancy).

But here's the concession-then-counter: those five questions get you to "useful." They don't get you to "indispensable." The gap between the two is the cognitive layer: the reasoning architecture, the decision patterns, the voice. And that layer can't be captured in an intake form. It has to be built over time, through use, through the compounding of interactions where the system observes how you actually work.

The structure problem

Even if you capture the right context, how you organise it matters enormously. This is where the "lost-in-the-middle" research has practical teeth.

NVIDIA published empirical chunking research in 2025, one of the most rigorous studies on this topic. Their key finding: page-level chunking achieved the highest average accuracy (0.648) with the lowest standard deviation. For factual queries, smaller chunks of 256 to 512 tokens worked well. For complex analytical queries, larger chunks of around 1,024 tokens performed better. Across all scenarios, a 15% overlap between chunks improved retrieval.

The practical architecture that emerged from the research is a four-layer separation by update frequency. Stable foundations (your professional identity, values, methodology) update annually. Semi-stable context (key relationships, recurring workflows) updates quarterly. Dynamic context (active projects, pipeline data, recent decisions) updates weekly to monthly. Ephemeral context (meeting-specific information, documents under review) loads per-session and doesn't persist.

This separation matters because it lets you update the volatile parts without disrupting the stable ones. And it keeps your total token footprint under the threshold where "lost-in-the-middle" degradation becomes a problem.

The day-two problem

Minute 0–10: First value

Show a completed example of the system in action before asking the user to build anything. Keyhole saw 45% activation with demo data vs. empty states. Chameleon found interactive demos boosted conversions by 4x.

Minute 10–30: First real output

The user generates a briefing document or preparation note for a real upcoming meeting using their context. This is the "aha" moment: the system produced something they’d actually use today.

Day 2–7: The danger zone

Amplitude’s 2025 benchmark: within 14 days, up to 91% of new SaaS users drop off. Average online course completion is 5–15%. This is where most knowledge systems become shelfware.

Week 2+: Use = maintenance

The system must anchor to an existing daily behaviour. Fogg’s Tiny Habits recipe: "After I review my calendar each morning, I glance at my context for today’s key interaction." Every use updates the system as a side effect.

Where most context systems lose users, and where the research says to intervene.

Setup isn't the hard part. Survival is.

Flexera reports that 93% of organisations have some shelfware. The personal knowledge management community has a name for the pattern: "productivity theatre." The first hundred notes are exciting, the next four hundred are a chore. Tiago Forte, who literally wrote the book on building a second brain, acknowledges that his own Weekly Review is "the most difficult habit to maintain."

The honest assessment: no amount of onboarding optimisation solves this problem alone. The evidence from SaaS activation benchmarks (Userpilot's 2025 report across 547 companies found a median time-to-first-value of one day, twelve hours, and twenty-three minutes) suggests that the window for proving value is narrow.

The only sustainable answer I've found in the research is the one Fogg's model predicts: anchor the system to an existing behaviour, make each use feel like a win, and design it so that maintenance is a byproduct, not a separate task. If the system saves you fifteen minutes of preparation for your most important meeting each day, the retention problem takes care of itself.

If it doesn't, no amount of gamification will rescue it.

What the research doesn't cover

I want to be direct about the gaps, because some of them are significant.

No peer-reviewed study directly measures the optimal "fill rate" for professional AI context templates. Claude's research inferred 50 to 60% from converging proxy evidence: form completion data, scaffolding research, and UX studies. Grok estimated 25 to 35%. Perplexity suggested 30 to 40%. ChatGPT flagged that defaults research (a meta-analysis of 73,000 participants across 58 studies) shows defaults can sometimes backfire depending on the domain. The honest answer: we don't know the right number, and it probably varies by professional seniority and AI familiarity.

Advisory work, the domain of independent advisors, board advisors, fractional executives, is under-researched relative to consulting, strategy, and commercial roles. The IoD's Director Competency Framework and the ICMCI's consulting body of knowledge provide robust taxonomies. Advisory has no equivalent. All four tools acknowledged this gap explicitly.

And perhaps most critically: there's no published data on abandonment rates for AI context systems specifically. We're extrapolating from SaaS benchmarks, online course completion rates, and personal knowledge management community patterns. The assumption that these proxies transfer is reasonable but unvalidated.

The compounding asset

There's a metaphor from finance that I keep returning to. In investing, the most powerful returns come not from picking the right asset once, but from compounding: small, consistent gains that accumulate over time. The maths is well-understood. A 7% annual return doubles your money in roughly a decade. The mechanism is simple. The discipline is the hard part.

A cognitive architecture for AI works the same way. The first week, your context document is thin and the outputs are only marginally better than a cold prompt. By the third month, if you've maintained it, if every meeting brief, decision memo, and stakeholder email has refined the system's understanding of how you think, the gap between a contextualised and a cold interaction becomes difficult to close. The system has compounded.

That's the aspiration, anyway. The research is clear on the architecture: three tiers, structured and modular, anchored to existing behaviours, updated through use rather than as a separate chore. The research is thin on whether senior professionals will actually sustain it.

My bet is that the answer depends entirely on whether the system earns its place in the daily routine, whether it feels less like a tool to configure and more like a colleague who's been paying attention.

We'll find out soon enough.

Sources

Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.
Anthropic Engineering, "Effective context engineering for AI agents," September 2025.
BJ Fogg, Tiny Habits, 2020; Fogg Behavior Model, Stanford Behavior Design Lab.
ICMCI CMC Common Body of Knowledge v4.0, 2021.
IoD Director Competency Framework, 2024.
IASP Strategy Body of Knowledge.
Userpilot 2025 SaaS Benchmark Report.
Amplitude 2025 Product Benchmark Report.
NVIDIA chunking research, 2025.
Van de Pol et al., Instructional Science, 2015.
Wood, Bruner & Ross, "The role of tutoring in problem solving," 1976.
The four AI research reports were generated using Claude (Anthropic), ChatGPT (OpenAI), Grok (xAI), and Perplexity, each given identical prompts in March 2026. Full research documents available on request.

Frequently asked questions

What are cognitive layers in AI context?

Do all AI tools agree on what context professionals need?

What is the personalisation threshold?

Turn the framework into a working system.

Membership gives you the inference engine, conductor, and portable profile behind the method.

Join Membership Audit your AI

Context Engineering

The Architecture of Teaching AI Your Job

AI that understands your job needs three layers of context: cognitive, professional, and operational.

PublishedMar 16, 2026
UpdatedMar 16, 2026
Reading time10 min read

The difference was not incremental. It was categorical.

If you want to see the same experiment run on a public figure, building a Paul Graham AI profile is a concrete demonstration of how structured context changes what a model produces.

Some takeaways

AI personalisation has three distinct layers: how you think, what you do, and what's happening now. Most people stop at one.
The "lost-in-the-middle" problem in long-context models means structure matters more than volume.
Five to seven intake questions cross the threshold from "generic template" to "this knows my work."
The biggest risk isn't bad setup. It's abandonment on day two.

The problem with "preferences"

But they share a common ceiling: they capture what you like, not how you think.

You need an architecture.

Three tiers, three update cadences

Here's how I think about it, as three tiers with distinct purposes, token budgets, and maintenance cadences.

1Cognitive layerAlways loaded~800–1,200 tokens

Reasoning architecture

Decision frameworks, quality standards, anti-patterns, risk tolerance

Voice and identity

How you write, argue, and communicate. Tone, rhythm, constraints

2Professional layerLoaded per task~500–1,500 tokens

Identity

Role, company, domain, tools

Domain expertise

Industry, frameworks, recurring themes

Key relationships

Top 10 stakeholders: name, role, preferences

3Operational layerOn demandUnbounded

Decisions

Recent decision log

Delegations

Task tracking, follow-ups

Knowledge base

Searchable entries

Active pipeline

Projects, deals, live data

Click a task scenario to see which tiers activate and why. Each workflow loads different layers of your cognitive architecture.

Tier 1: How you think

I think of it like this: the cognitive layer is the difference between an AI that writes a memo for someone in your role and one that writes a memo the way you'd write it if you had the time.

Tier 2: What you do

Update cadence: quarterly. When your team structure shifts, when you take on a new client segment, when your competitive environment changes.

Tier 3: What's happening now

The practical implication is clear: don't dump your entire operational context into a single prompt. Pull only what the specific task requires, and structure it so the model can parse it cleanly.

What four AI tools agreed on (and where they split)

I gave Claude, ChatGPT, Grok, and Perplexity identical research briefs: six questions about building professional AI context systems. The convergences were striking. The disagreements were revealing.

Dimension	Consensus	Key divergence	Verdict
Format	Structured templates beat blank canvases. All four cited scaffolding research.	Claude's research recommended markdown as the portable standard. ChatGPT and Perplexity leaned toward database-centric workspaces (Notion-style).	Strong
Domain layers	Common core + domain-specific layers. All four arrived here independently.	Grok and Perplexity were thinner on advisory. No equivalent professional body of knowledge exists for fractional executives.	Strong
Delivery	"Show the outcome first" was universal. Progressive disclosure over information dump.	Claude's research cited specific SaaS activation benchmarks (36% average, 54.8% for AI tools). Others were more anecdotal.	Strong
Personalisation	5 to 10 intake questions. All four landed in this range.	Claude's research flagged the "uncanny valley of personalisation," the risk that over-reaching feels creepier than under-personalising. Others didn't mention it.	Partial
Architecture	Modular, hierarchical, chunked. All four cited RAG best practices.	Optimal chunk size ranged from 200 to 512 tokens (Perplexity, Grok) to 250 to 1,024 (Claude). No single answer.	Partial
Retention	"Use must be maintenance." The system should update as a byproduct of real work.	Only Claude's research cited BJ Fogg's B=MAP model and framed the day-two problem as a behaviour design challenge rather than a feature gap.	Divergent

Six research dimensions across four AI tools. Green = strong agreement. Amber = partial consensus. Red = meaningful divergence.

That reframes the entire retention challenge.

The personalisation threshold

How much does the system need to know about you before it stops feeling generic?

Language prefs

Saturated

Tone / voice

Role context

Domain expertise

Reasoning style

Decision patterns

Live work context

Where most AI personalisation stops today

The structure problem

Even if you capture the right context, how you organise it matters enormously. This is where the "lost-in-the-middle" research has practical teeth.

The day-two problem

Minute 0–10: First value

Minute 10–30: First real output

The user generates a briefing document or preparation note for a real upcoming meeting using their context. This is the "aha" moment: the system produced something they’d actually use today.

Day 2–7: The danger zone

Amplitude’s 2025 benchmark: within 14 days, up to 91% of new SaaS users drop off. Average online course completion is 5–15%. This is where most knowledge systems become shelfware.

Week 2+: Use = maintenance

Where most context systems lose users, and where the research says to intervene.

Setup isn't the hard part. Survival is.

If it doesn't, no amount of gamification will rescue it.

What the research doesn't cover

I want to be direct about the gaps, because some of them are significant.

The compounding asset

We'll find out soon enough.

Sources

Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.
Anthropic Engineering, "Effective context engineering for AI agents," September 2025.
BJ Fogg, Tiny Habits, 2020; Fogg Behavior Model, Stanford Behavior Design Lab.
ICMCI CMC Common Body of Knowledge v4.0, 2021.
IoD Director Competency Framework, 2024.
IASP Strategy Body of Knowledge.
Userpilot 2025 SaaS Benchmark Report.
Amplitude 2025 Product Benchmark Report.
NVIDIA chunking research, 2025.
Van de Pol et al., Instructional Science, 2015.
Wood, Bruner & Ross, "The role of tutoring in problem solving," 1976.
The four AI research reports were generated using Claude (Anthropic), ChatGPT (OpenAI), Grok (xAI), and Perplexity, each given identical prompts in March 2026. Full research documents available on request.

Frequently asked questions

What are cognitive layers in AI context?

Do all AI tools agree on what context professionals need?

What is the personalisation threshold?

Turn the framework into a working system.

Membership gives you the inference engine, conductor, and portable profile behind the method.

Join Membership Audit your AI

The problem with "preferences"

Three tiers, three update cadences

Tier 1: How you think

Tier 2: What you do

Tier 3: What's happening now

What four AI tools agreed on (and where they split)

The personalisation threshold

The structure problem

The day-two problem

What the research doesn't cover

The compounding asset

Frequently asked questions

Turn the framework into a working system.

Continue reading

The problem with "preferences"

Three tiers, three update cadences

Tier 1: How you think

Tier 2: What you do

Tier 3: What's happening now

What four AI tools agreed on (and where they split)

The personalisation threshold

The structure problem

The day-two problem

What the research doesn't cover

The compounding asset

Frequently asked questions

Turn the framework into a working system.

Continue reading