Last month I gave Claude the same brief twice. The first time, cold. No context, no history, no instructions beyond the task itself. The output was competent. Grammatically sound. Completely useless. It read like a well-educated stranger had overheard a conversation at a conference and tried to reconstruct the argument from memory.
The second time, I loaded a context document I'd been refining for weeks: my reasoning patterns, decision frameworks, the names of people I work with, the projects currently on my plate, the specific way I weigh trade-offs between speed and rigour. Same model, same prompt.
The difference was not incremental. It was categorical.
That experience sent me down a research path. I asked four different AI tools, Claude, ChatGPT, Grok, and Perplexity, the same set of questions about what it takes to build a system that teaches AI how a professional thinks and works. Not a chatbot personality. Not a tone-of-voice setting. A genuine cognitive architecture that compounds over time.
What came back was surprisingly convergent on some points and revealingly divergent on others. This piece synthesises what I found, cross-referenced against the academic literature and documented product patterns. It's opinionated. I'll tell you where I think the research is strong and where it's thin.
- AI personalisation has three distinct layers: how you think, what you do, and what's happening now. Most people stop at one.
- The "lost-in-the-middle" problem in long-context models means structure matters more than volume.
- Five to seven intake questions cross the threshold from "generic template" to "this knows my work."
- The biggest risk isn't bad setup. It's abandonment on day two.
The problem with "preferences"
Every major AI platform now offers some version of personalisation. Claude has memory and user preferences. ChatGPT has custom instructions and memory. Gemini has Gems. These are useful, and for most people, they're enough.
But they share a common ceiling: they capture what you like, not how you think.
Telling an AI that you prefer British English and concise responses is personalisation. Teaching it that you evaluate strategic decisions by weighing reversibility before expected value, that when you see a trade-off between speed and thoroughness, you default to speed unless the downside is reputational: that's something else entirely. That's a cognitive layer.
Anthropic's own engineering team published a blog post in September 2025 framing this distinction. They called it "context engineering," the successor to prompt engineering, and defined it as the art of curating the optimal set of tokens during inference. The core insight: context is a finite resource with diminishing marginal returns. You can't dump everything about yourself into a prompt and hope the model sorts it out.
You need an architecture.
Three tiers, three update cadences
Across the four research reports I commissioned, all four converged on a layered architecture. The specific labels differed, but the structural logic was identical: separate what's stable from what changes, and load each layer only when the task demands it.
Here's how I think about it, as three tiers with distinct purposes, token budgets, and maintenance cadences.
Tier 1: How you think
The cognitive layer is the most underbuilt and the most valuable. It encodes your reasoning architecture: the decision frameworks you actually use (not the ones you learned in business school), your quality standards, your risk tolerance, your anti-patterns.
This is also where your voice lives. Not "tone of voice" in the marketing sense. The actual mechanics of how you construct an argument, whether you lead with data or narrative, how you handle concessions, your sentence rhythm, the words you'd never use.
The cognitive layer updates rarely, annually, or when your role changes. It's small: 800 to 1,200 tokens. And it loads on every single interaction, because it shapes how the AI approaches any task, not just what it produces.
I think of it like this: the cognitive layer is the difference between an AI that writes a memo for someone in your role and one that writes a memo the way you'd write it if you had the time.
Tier 2: What you do
The professional layer is contextual. It captures your role, your company, your domain expertise, your key relationships. A strategy director at a DFI needs different professional context than a commercial lead at a fintech, even if their cognitive architectures share patterns.
This layer loads per task, not universally. If you're drafting a client email, the model needs your stakeholder map. If you're running a scenario analysis, it needs your industry frameworks. The professional layer also includes what the ICMCI's competency framework calls "business insight": the domain-specific knowledge that separates a generalist from a practitioner.
Update cadence: quarterly. When your team structure shifts, when you take on a new client segment, when your competitive environment changes.
Tier 3: What's happening now
The operational layer is unbounded and on-demand. It includes your recent decisions, your open delegations, your active pipeline, your searchable knowledge base. This is the most volatile layer, it changes weekly or daily, and the most dangerous to load indiscriminately.
Why dangerous? Because of a well-documented phenomenon in language models called "lost-in-the-middle." Liu et al. published this in the Transactions of the Association for Computational Linguistics in 2024. Their finding: model performance follows a U-shaped curve. Information at the beginning and end of a context window gets attended to well. Information in the middle gets missed, sometimes dramatically. Performance can degrade by over 30% when critical information moves from the edges to the centre of the input.
The practical implication is clear: don't dump your entire operational context into a single prompt. Pull only what the specific task requires, and structure it so the model can parse it cleanly.
What four AI tools agreed on (and where they split)
I gave Claude, ChatGPT, Grok, and Perplexity identical research briefs: six questions about building professional AI context systems. The convergences were striking. The disagreements were revealing.
| Dimension | Consensus | Key divergence | Verdict |
|---|---|---|---|
| Format | Structured templates beat blank canvases. All four cited scaffolding research. | Claude's research recommended markdown as the portable standard. ChatGPT and Perplexity leaned toward database-centric workspaces (Notion-style). | Strong |
| Domain layers | Common core + domain-specific layers. All four arrived here independently. | Grok and Perplexity were thinner on advisory. No equivalent professional body of knowledge exists for fractional executives. | Strong |
| Delivery | "Show the outcome first" was universal. Progressive disclosure over information dump. | Claude's research cited specific SaaS activation benchmarks (36% average, 54.8% for AI tools). Others were more anecdotal. | Strong |
| Personalisation | 5 to 10 intake questions. All four landed in this range. | Claude's research flagged the "uncanny valley of personalisation," the risk that over-reaching feels creepier than under-personalising. Others didn't mention it. | Partial |
| Architecture | Modular, hierarchical, chunked. All four cited RAG best practices. | Optimal chunk size ranged from 200 to 512 tokens (Perplexity, Grok) to 250 to 1,024 (Claude). No single answer. | Partial |
| Retention | "Use must be maintenance." The system should update as a byproduct of real work. | Only Claude's research cited BJ Fogg's B=MAP model and framed the day-two problem as a behaviour design challenge rather than a feature gap. | Divergent |
The strongest consensus: structured templates outperform blank canvases. This isn't a new finding. Wood, Bruner, and Ross established the scaffolding principle in 1976, and it's been validated since. But the AI-specific data is newer. A 2015 study by Van de Pol et al. (768 students, 30 teachers, published in Instructional Science) found that frequent, lightweight prompts outperformed comprehensive pre-filled content. In other words: show structure and nudge often, rather than doing all the work upfront.
The most interesting divergence was on retention. Grok and Perplexity treated the day-two problem as a feature question: add review checklists, send reminder emails. Claude's research went deeper, grounding the analysis in BJ Fogg's behavioural model from Stanford: for a behaviour to occur, motivation, ability, and a prompt must converge at the same moment. The critical insight from Fogg's work is that habit formation is a function of emotion, not repetition.
That reframes the entire retention challenge.
It means the question isn't "how do we remind people to update their context?" It's "how do we make each use of the context system feel like a small win?" If every interaction with the system produces something you'd actually use, a meeting brief, a decision draft, a stakeholder summary, the maintenance happens as a side effect.
The personalisation threshold
How much does the system need to know about you before it stops feeling generic?
Less than you'd think. McKinsey's consumer personalisation research from 2021 to 2023 found that 71% of consumers expect personalised interactions and 76% get frustrated when it doesn't happen. In SaaS onboarding, personalised flows show 65% higher completion rates than generic ones. But the threshold for "personalised enough" is surprisingly low.
Campaign Monitor's data shows that adding a recipient's first name to an email subject line increases open rates by 26%. The minimum viable personalisation isn't deep. It's accurate. Getting your domain language right (saying "engagement pipeline" instead of "deal pipeline" for a consultant) matters more than capturing every detail of your working style.
The research across all four tools converged on five to seven intake questions as the sweet spot. Chameleon.io found that three-step onboarding tours have a 72% completion rate while seven-step tours drop to 16%. SurveyMonkey's analysis of 100,000 surveys showed the sharpest drop-off occurs with each additional question up to fifteen.
The five questions that emerged as highest-leverage across the research: your current role and who you serve (determines domain language), your primary objective right now (focuses the system), your biggest obstacle (enables pain-point targeting), how you prefer to receive information (determines output depth), and what tools or frameworks you already use (prevents redundancy).
But here's the concession-then-counter: those five questions get you to "useful." They don't get you to "indispensable." The gap between the two is the cognitive layer: the reasoning architecture, the decision patterns, the voice. And that layer can't be captured in an intake form. It has to be built over time, through use, through the compounding of interactions where the system observes how you actually work.
The structure problem
Even if you capture the right context, how you organise it matters enormously. This is where the "lost-in-the-middle" research has practical teeth.
NVIDIA published empirical chunking research in 2025, one of the most rigorous studies on this topic. Their key finding: page-level chunking achieved the highest average accuracy (0.648) with the lowest standard deviation. For factual queries, smaller chunks of 256 to 512 tokens worked well. For complex analytical queries, larger chunks of around 1,024 tokens performed better. Across all scenarios, a 15% overlap between chunks improved retrieval.
The practical architecture that emerged from the research is a four-layer separation by update frequency. Stable foundations (your professional identity, values, methodology) update annually. Semi-stable context (key relationships, recurring workflows) updates quarterly. Dynamic context (active projects, pipeline data, recent decisions) updates weekly to monthly. Ephemeral context (meeting-specific information, documents under review) loads per-session and doesn't persist.
This separation matters because it lets you update the volatile parts without disrupting the stable ones. And it keeps your total token footprint under the threshold where "lost-in-the-middle" degradation becomes a problem.
The day-two problem
Setup isn't the hard part. Survival is.
Flexera reports that 93% of organisations have some shelfware. The personal knowledge management community has a name for the pattern: "productivity theatre." The first hundred notes are exciting, the next four hundred are a chore. Tiago Forte, who literally wrote the book on building a second brain, acknowledges that his own Weekly Review is "the most difficult habit to maintain."
The honest assessment: no amount of onboarding optimisation solves this problem alone. The evidence from SaaS activation benchmarks (Userpilot's 2025 report across 547 companies found a median time-to-first-value of one day, twelve hours, and twenty-three minutes) suggests that the window for proving value is narrow.
The only sustainable answer I've found in the research is the one Fogg's model predicts: anchor the system to an existing behaviour, make each use feel like a win, and design it so that maintenance is a byproduct, not a separate task. If the system saves you fifteen minutes of preparation for your most important meeting each day, the retention problem takes care of itself.
If it doesn't, no amount of gamification will rescue it.
What the research doesn't cover
I want to be direct about the gaps, because some of them are significant.
No peer-reviewed study directly measures the optimal "fill rate" for professional AI context templates. Claude's research inferred 50 to 60% from converging proxy evidence: form completion data, scaffolding research, and UX studies. Grok estimated 25 to 35%. Perplexity suggested 30 to 40%. ChatGPT flagged that defaults research (a meta-analysis of 73,000 participants across 58 studies) shows defaults can sometimes backfire depending on the domain. The honest answer: we don't know the right number, and it probably varies by professional seniority and AI familiarity.
Advisory work, the domain of independent advisors, board advisors, fractional executives, is under-researched relative to consulting, strategy, and commercial roles. The IoD's Director Competency Framework and the ICMCI's consulting body of knowledge provide robust taxonomies. Advisory has no equivalent. All four tools acknowledged this gap explicitly.
And perhaps most critically: there's no published data on abandonment rates for AI context systems specifically. We're extrapolating from SaaS benchmarks, online course completion rates, and personal knowledge management community patterns. The assumption that these proxies transfer is reasonable but unvalidated.
The compounding asset
There's a metaphor from finance that I keep returning to. In investing, the most powerful returns come not from picking the right asset once, but from compounding: small, consistent gains that accumulate over time. The maths is well-understood. A 7% annual return doubles your money in roughly a decade. The mechanism is simple. The discipline is the hard part.
A cognitive architecture for AI works the same way. The first week, your context document is thin and the outputs are only marginally better than a cold prompt. By the third month, if you've maintained it, if every meeting brief, decision memo, and stakeholder email has refined the system's understanding of how you think, the gap between a contextualised and a cold interaction becomes difficult to close. The system has compounded.
That's the aspiration, anyway. The research is clear on the architecture: three tiers, structured and modular, anchored to existing behaviours, updated through use rather than as a separate chore. The research is thin on whether senior professionals will actually sustain it.
My bet is that the answer depends entirely on whether the system earns its place in the daily routine, whether it feels less like a tool to configure and more like a colleague who's been paying attention.
We'll find out soon enough.
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.
Anthropic Engineering, "Effective context engineering for AI agents," September 2025.
BJ Fogg, Tiny Habits, 2020; Fogg Behavior Model, Stanford Behavior Design Lab.
- ICMCI CMC Common Body of Knowledge v4.0, 2021.
- IoD Director Competency Framework, 2024.
- IASP Strategy Body of Knowledge.
- Userpilot 2025 SaaS Benchmark Report.
- Amplitude 2025 Product Benchmark Report.
- NVIDIA chunking research, 2025.
Van de Pol et al., Instructional Science, 2015.
Wood, Bruner & Ross, "The role of tutoring in problem solving," 1976.
The four AI research reports were generated using Claude (Anthropic), ChatGPT (OpenAI), Grok (xAI), and Perplexity, each given identical prompts in March 2026. Full research documents available on request.
