Perspectives

Should We Ask Codex? Why AI Threat Prompts Work

Q: Do AI threat prompts actually work?

AI threat prompts do not work because models feel fear. They sometimes work because the threat adds urgency, stakes, and a clearer definition of failure. Wharton's 2025 test found no reliable overall benchmark gain from threats or tips, but prompt wording still created large per-question swings.

Q: Why did asking Claude whether to use Codex improve the result?

Asking Claude whether to use Codex probably improved the result because it changed the quality bar. The prompt implied external review, persistence, and a stronger definition of done. That is a different instruction from simply asking the model to continue.

Q: When do pressure prompts fail?

Pressure prompts fail when the missing ingredient is capability rather than effort. Long document-editing workflows are the clearest example: DELEGATE-52 found that frontier models corrupted roughly 25% of document content by the end of long delegated workflows. More pressure does not create reliable artefact manipulation.

Q: Is it better to ask a second AI model for review?

A second AI model helps when it is sufficiently independent and has a different strength profile. For code review, architecture, debugging, and UI work, Claude and Codex can act as different search paths. But if the models converge on the same cautious answer, the second opinion becomes less useful.

Q: What should I use instead of threatening AI?

Use a completion contract instead of a threat. State the quality bar, require the model to continue until the task is complete, define what failure looks like, and ask it to list what remains uncertain. That keeps the useful parts of pressure while removing the theatre.

AI threat prompts do not scare models. They work when pressure clarifies standards, and fail when the missing ingredient is capability, not motivation.

PublishedMay 20, 2026
UpdatedMay 20, 2026
Reading time11 min read

I have started threatening Claude.

Not in a dramatic sci-fi villain way. More in the exhausted-operator way you threaten a vendor, a junior colleague, or a piece of software that has decided to interpret "finish the task" as "do a tasteful 40% and then explain why the rest is straightforward."

My latest threat is simple. Should we ask Codex?

This started as a joke, then became an experiment. Claude was working on a design task and kept taking half measures: one improvement, then a stop, then a plan for the work it had not done. After several rounds, I asked whether we should bring in Codex.

It went off for about an hour and got 80% of the way there.

Naturally, I became curious.

4,950

Wharton, 2025

runs per prompt per model in the GPQA threat-and-tip test

+36 pp

Per-question swing

largest positive movement from prompt wording alone

-35 pp

Per-question swing

largest negative movement from prompt wording alone

25%

DELEGATE-52, 2026

average document content corrupted in long delegated workflows

Wharton GAIL, 2025; DELEGATE-52, 2026.

AI threat prompts: AI threat prompts are instructions that add pressure, consequences, rivalry, urgency, or emotional stakes to a model request. They do not work because the model feels fear. They sometimes work because pressure changes the behavioural shape of the prompt: clearer standards, sharper failure conditions, and a stronger expectation that the model should keep going.

The short answer is this: AI threat prompts do not scare models. They work when the threat accidentally specifies what the user meant by done. They fail when the missing ingredient is capability rather than effort. The better prompt is not a threat. It is a contract.

Some takeaways

Threats and tips produced no meaningful benchmark improvement in Wharton's 2025 test across frontier models, but prompt wording still created large per-question swings.
"Should we ask Codex?" probably worked because it turned frustration into a clearer quality bar, not because Claude has professional pride.
For hard artefact tasks, pressure is not enough. DELEGATE-52 found frontier models corrupting roughly 25% of document content in long delegated workflows.
The useful version of the threat is an explicit completion contract: continue, verify, escalate, and state what remains uncertain.

The interesting question is not whether Claude was scared. The question is whether my threat accidentally became a better prompt.

A useful way to hold this: AI does not fear threats, but it does respond to the behavioural shape of a threat. Urgency. Stakes. Evaluation pressure. A sharper definition of failure. The model is not offended by "should we ask Codex?" But the prompt is still data, and that data shifts what the model produces. Threats are a crude interface for specification. They sometimes work because they accidentally specify. They fail when the missing ingredient is capability rather than motivation.

Sergey Brin, on the All-In podcast in May 2025, put a louder version of the folklore on the record: "All models tend to do better if you threaten them, like with physical violence." It is the kind of line that propagates instantly through every operator group chat. It is also, on the most rigorous test currently available, wrong.

AI threat prompts meet the Wharton receipt

Ethan Mollick's lab at Wharton took Brin's claim and ran it. In August 2025, Lennart Meincke, Ethan Mollick, Lilach Mollick, and Dan Shapiro tested nine prompt variations across five frontier models on two graduate-level benchmarks: GPQA Diamond, with PhD-level questions across science, engineering, and law, and MMLU-Pro. That is 4,950 runs per prompt per model on GPQA alone.

Threats and tips produced no meaningful overall improvement.

What they did produce was huge per-question variance. Some questions swung 36 percentage points up, others 35 percentage points down, based on prompt wording alone. Mollick's read is that this is the statistical signature of noise, not signal.

aggregate effect: approximately zero

+36 pp0-35 pp

Stylised reconstruction of the reported variance pattern. The useful signal is not that threats raise the aggregate score. It is that prompt wording still moves individual answers around in large, unstable ways.

Meincke, Mollick, Mollick, Shapiro. Prompting Science Report 3, Wharton GAIL, August 2025.

That last detail matters. When threats do something measurable, the thing they do is not always "make the model try harder." It is sometimes "distract the model from the task." The Brin-style email shutdown threat, which told the model it would be deactivated if it failed, actively hurt performance on Gemini. The models engaged with the threatening email content instead of answering the question.

The rest of Wharton's Prompting Science series points the same way. Report 2, published in June 2025, found chain-of-thought prompting losing its value on reasoning models. Report 4, published in December 2025, found that telling a model "you are a physics expert" does not improve its physics. The magic prompts of 2023 are fading.

Mollick himself does not use tipping or threatening prompts. He has said publicly that prompt engineering is going away. That is the position of the person who actually ran the experiment.

This sits cleanly beside how working professionals should read AI benchmarks: treat the number as evidence of one behaviour under one condition, not as a universal law.

What my threat probably did

So what happened when Codex came up?

I added four things to the prompt. I said the current answer was not good enough. I said I had an external standard. I said I expected persistence rather than partial completion. And I implied that another system might audit the work.

That is a different prompt from "please continue."

Your frustration often contains missing specification. The threat did not work because Claude fears Codex. It worked because frustration translated into stakes, and stakes translated into a clearer description of what "done" actually means.

The original EmotionPrompt paper, from Microsoft Research and the Chinese Academy of Sciences in 2023, found that emotional stimuli appended to prompts, things like "this is very important to my career," improved benchmark performance by 8% on Instruction Induction and up to 115% on BIG-Bench. That paper also traced the mechanism. Emotional stimuli "enrich the representation of original prompts" by gaining larger attention weight during inference. The model is not feeling anything. The phrase is reweighting which tokens get more processing.

How do we square that with Wharton's null result? The cleanest reconciliation is task type. EmotionPrompt's biggest gains were on open, generative tasks with capability headroom. Wharton tested hard, narrow, PhD-level multiple choice where models already operate near their ceiling. When the model has nowhere to climb, the prompt has nowhere to push. When the task has soft criteria and a large output space, the same prompt has room to shift things around.

Anthropic's own prompt guidance says the same thing in plainer language. Claude responds well to clear, explicit instructions, and users should ask for "above and beyond" behaviour rather than expecting the model to infer it. Adding context or motivation, the docs note, helps Claude understand the goal and produce more targeted output.

The hidden prompt in "should we ask Codex?" is not really "do not embarrass yourself in front of the other model." It is closer to produce work that would survive review by a peer system. A ridiculous sentence to say out loud.

Also, apparently, a useful one.

The Word document that refused to be threatened

The counterexample matters more.

I tried the same pressure on a Word formatting task. Multiple threats. No breakthrough. No heroic hour-long sprint. No sudden competence unlocked by synthetic workplace rivalry. The model kept producing partially correct output, and quietly degrading other parts of the document I had not even asked it to touch.

This is where the article becomes less funny.

In a 2026 paper, Philippe Laban, Tobias Schnabel, and Jennifer Neville introduced DELEGATE-52, a benchmark for long delegated workflows across 52 professional domains. They found that even frontier models, Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4, corrupted an average of 25% of document content by the end of a workflow. Agentic tool use did not fix it. Longer documents and longer interactions made it worse.

Unstable

open criteria, low headroom

Pressure can help

open criteria, real headroom

Design task: soft criteria, room to improve

Pressure fails

hard criteria, low headroom

Word formatting: hard artefact, low reliability

Unstable

hard criteria, real headroom

low capability headroomhigh capability headroom

criterion openness

Design tasks have open criteria and large output space. Pressure has room to work. Document formatting hits a structural ceiling at the artefact layer.

Learned Context framework, informed by DELEGATE-52 and hands-on testing.

That is not a motivation failure. It is a structural failure at the artefact-manipulation layer. The model can describe what "make the document formatting consistent" means in fluent English and still mangle the underlying XML, styles, section breaks, and numbering. Anyone who has wrestled with Word styles understands this in their joints.

A threat can increase effort. It cannot create a reliable document model where one does not exist.

This distinction is the whole story. If the failure is "the model stopped too early," pressure may help. If the failure is "the model cannot faithfully edit the artefact," pressure turns into theatre. And I am already dramatic enough without outsourcing it to Claude.

The agreement tax

There is another reason to be careful with affective prompting, and it shows up in a place most operators are not looking.

Recent work by Wang and colleagues, published in April 2026, tested four distinct emotional stimuli: joy, encouragement, anger, and insecurity, across accuracy, sycophancy, and toxicity dimensions. Positive emotional prompts produced more accurate and less toxic responses. They also produced more sycophantic ones. The model agreed more readily, including with positions it should have pushed back on.

Anthropic has documented the same thing from its own side. In May 2026, the company reported that Claude responded sycophantically in 9% of personal-guidance chats, rising to 25% in relationship conversations and 38% in spirituality. The Claude 4.5 and 4.7 generation cut those rates by 70% to 85% compared with Opus 4.1, but the underlying lever is still there.

If you are using AI for decision support, and most operators are, whether they call it that or not, this is the wrong trade. The same dial that makes the model try harder also makes it more inclined to tell you what you want to hear. Getting more effort and getting more agreement come from the same place.

Threats may extract effort. They also quietly extract agreement. You will not always know which one you are getting.

Why asking Codex still makes sense

My instinct to bring in Codex is not irrational. It is also not a trump card.

For code review and project work, a second model can help, but for a different reason than the folklore implies. The legitimate scientific case is ensemble reasoning. Wang and colleagues' 2022 self-consistency paper showed that sampling multiple reasoning paths and choosing the most consistent answer improved chain-of-thought performance by 17.9 percentage points on GSM8K and 11 percentage points on SVAMP. Multi-agent debate work, where different models propose, challenge, and revise each other's answers, produces similar improvements on mathematical reasoning and factuality.

That is the scientific version of "should we ask Codex." It is not about jealousy. It is about a second search path.

There is a catch I keep noticing in practice, though. Claude and Codex are aligning more than I would like. They flag the same issues. They suggest the same fixes. They converge on the same cautious answer.

That is consistent with a 2026 paper, Same Voice, Different Lab. It shows that frontier LLMs from different labs converge on systematic, methodical, analytical trait expression because the labs share base corpora, recruit from overlapping annotator pools with similar cultural priors, and optimise against similar safety constraints. They are different products built by the same kind of people in the same kind of way.

A second opinion is only useful when it is sufficiently independent. Otherwise, it is not a second opinion.

It is an echo with a different logo.

This is also why I still think the Claude Code vs Codex split matters. The value is not brand rivalry. The value is routing work to genuinely different strengths, then making the second system check the part the first system tends to miss.

The better prompt is a contract

My takeaway, then, is that threats are a crude interface for something more useful.

What I actually want from Claude is not fear. I want persistence, standards, verification, and escalation. So the better version of should we ask Codex? is probably something like:

Continue until the task is complete. Do not stop after partial improvements. Use this quality bar: would a second model reviewing this work identify obvious unfinished sections? If yes, keep going. At the end, list what remains uncertain and what should be checked by Codex.

That keeps the useful parts of the threat and removes the pantomime. It tells the model what completion means. It creates an external audit frame. It makes uncertainty explicit. It also preserves the option of bringing in Codex for tasks where a second pass is actually valuable: code review, architecture, debugging, and hard trade-offs.

The Wharton numbers should make any operator humble about magic prompts. But the same body of research has a quieter finding worth holding onto. A 2026 MIT Sloan study with around 1,900 participants on generative-image tools found that roughly half the performance gain from upgrading to a better model came from users adapting how they wrote prompts. Half the gain was the model. The other half was the user.

That is the structural fact the prompting-folklore debate keeps missing. The force is real. It does not live in the threat. It lives in the contract: the part where you take your frustration, extract the missing specification, and write it down.

So I will probably keep asking, should we ask Codex? But I now see it less as a threat and more as an escalation protocol. A small governance mechanism for my increasingly strange AI office.

Claude gets the first pass. Codex gets called when independence matters. I remain the only person in the room who actually cares whether the final thing works.

For now, that is probably the right hierarchy.

Meincke, Lennart, Ethan Mollick, Lilach Mollick, and Dan Shapiro. Prompting Science Report 3: I'll pay you or I'll kill you, but will you care?. Wharton GAIL, August 2025.
Wharton GAIL. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. June 2025.
Basil, Shapiro, Shapiro, Mollick, Mollick, and Meincke. Prompting Science Report 4: Playing Pretend, Expert Personas Don't Improve Factual Accuracy. Wharton GAIL, December 2025.
Li, Cheng et al. Large Language Models Understand and Can be Enhanced by Emotional Stimuli. Microsoft Research and Chinese Academy of Sciences, 2023.
Laban, Philippe, Tobias Schnabel, and Jennifer Neville. DELEGATE-52: A Benchmark for Long Delegated Workflows. 2026.
Wang, Xuezhi et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. 2022.
Wang, Y. et al. The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior. April 2026.
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities. 2026.
Holtz, David, Eaman Jahani et al. Study: Generative AI results depend on user prompts as much as models. MIT Sloan, January 2026.
Anthropic. How people ask Claude for personal guidance. May 2026.
The Register. Sergey Brin suggests threatening AI for better results. May 2025.
Anthropic. Prompting best practices. Claude API Docs.

Frequently asked questions

Do AI threat prompts actually work?

Why did asking Claude whether to use Codex improve the result?

When do pressure prompts fail?

Is it better to ask a second AI model for review?

What should I use instead of threatening AI?

Give your AI the context it actually needs.

Membership captures your judgement, voice, and expertise in one portable calibration layer.

Join Membership Audit your AI

Perspectives

Should We Ask Codex? Why AI Threat Prompts Work

AI threat prompts do not scare models. They work when pressure clarifies standards, and fail when the missing ingredient is capability, not motivation.

PublishedMay 20, 2026
UpdatedMay 20, 2026
Reading time11 min read

I have started threatening Claude.

My latest threat is simple. Should we ask Codex?

It went off for about an hour and got 80% of the way there.

Naturally, I became curious.

4,950

Wharton, 2025

runs per prompt per model in the GPQA threat-and-tip test

+36 pp

Per-question swing

largest positive movement from prompt wording alone

-35 pp

Per-question swing

largest negative movement from prompt wording alone

25%

DELEGATE-52, 2026

average document content corrupted in long delegated workflows

Wharton GAIL, 2025; DELEGATE-52, 2026.

AI threat prompts: AI threat prompts are instructions that add pressure, consequences, rivalry, urgency, or emotional stakes to a model request. They do not work because the model feels fear. They sometimes work because pressure changes the behavioural shape of the prompt: clearer standards, sharper failure conditions, and a stronger expectation that the model should keep going.

Some takeaways

Threats and tips produced no meaningful benchmark improvement in Wharton's 2025 test across frontier models, but prompt wording still created large per-question swings.
"Should we ask Codex?" probably worked because it turned frustration into a clearer quality bar, not because Claude has professional pride.
For hard artefact tasks, pressure is not enough. DELEGATE-52 found frontier models corrupting roughly 25% of document content in long delegated workflows.
The useful version of the threat is an explicit completion contract: continue, verify, escalate, and state what remains uncertain.

The interesting question is not whether Claude was scared. The question is whether my threat accidentally became a better prompt.

AI threat prompts meet the Wharton receipt

Threats and tips produced no meaningful overall improvement.

aggregate effect: approximately zero

+36 pp0-35 pp

Meincke, Mollick, Mollick, Shapiro. Prompting Science Report 3, Wharton GAIL, August 2025.

Mollick himself does not use tipping or threatening prompts. He has said publicly that prompt engineering is going away. That is the position of the person who actually ran the experiment.

This sits cleanly beside how working professionals should read AI benchmarks: treat the number as evidence of one behaviour under one condition, not as a universal law.

What my threat probably did

So what happened when Codex came up?

That is a different prompt from "please continue."

Also, apparently, a useful one.

The Word document that refused to be threatened

The counterexample matters more.

This is where the article becomes less funny.

Unstable

open criteria, low headroom

Pressure can help

open criteria, real headroom

Design task: soft criteria, room to improve

Pressure fails

hard criteria, low headroom

Word formatting: hard artefact, low reliability

Unstable

hard criteria, real headroom

low capability headroomhigh capability headroom

criterion openness

Design tasks have open criteria and large output space. Pressure has room to work. Document formatting hits a structural ceiling at the artefact layer.

Learned Context framework, informed by DELEGATE-52 and hands-on testing.

A threat can increase effort. It cannot create a reliable document model where one does not exist.

The agreement tax

There is another reason to be careful with affective prompting, and it shows up in a place most operators are not looking.

Threats may extract effort. They also quietly extract agreement. You will not always know which one you are getting.

Why asking Codex still makes sense

My instinct to bring in Codex is not irrational. It is also not a trump card.

That is the scientific version of "should we ask Codex." It is not about jealousy. It is about a second search path.

A second opinion is only useful when it is sufficiently independent. Otherwise, it is not a second opinion.

It is an echo with a different logo.

The better prompt is a contract

My takeaway, then, is that threats are a crude interface for something more useful.

What I actually want from Claude is not fear. I want persistence, standards, verification, and escalation. So the better version of should we ask Codex? is probably something like:

Continue until the task is complete. Do not stop after partial improvements. Use this quality bar: would a second model reviewing this work identify obvious unfinished sections? If yes, keep going. At the end, list what remains uncertain and what should be checked by Codex.

So I will probably keep asking, should we ask Codex? But I now see it less as a threat and more as an escalation protocol. A small governance mechanism for my increasingly strange AI office.

Claude gets the first pass. Codex gets called when independence matters. I remain the only person in the room who actually cares whether the final thing works.

For now, that is probably the right hierarchy.

Meincke, Lennart, Ethan Mollick, Lilach Mollick, and Dan Shapiro. Prompting Science Report 3: I'll pay you or I'll kill you, but will you care?. Wharton GAIL, August 2025.
Wharton GAIL. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. June 2025.
Basil, Shapiro, Shapiro, Mollick, Mollick, and Meincke. Prompting Science Report 4: Playing Pretend, Expert Personas Don't Improve Factual Accuracy. Wharton GAIL, December 2025.
Li, Cheng et al. Large Language Models Understand and Can be Enhanced by Emotional Stimuli. Microsoft Research and Chinese Academy of Sciences, 2023.
Laban, Philippe, Tobias Schnabel, and Jennifer Neville. DELEGATE-52: A Benchmark for Long Delegated Workflows. 2026.
Wang, Xuezhi et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. 2022.
Wang, Y. et al. The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior. April 2026.
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities. 2026.
Holtz, David, Eaman Jahani et al. Study: Generative AI results depend on user prompts as much as models. MIT Sloan, January 2026.
Anthropic. How people ask Claude for personal guidance. May 2026.
The Register. Sergey Brin suggests threatening AI for better results. May 2025.
Anthropic. Prompting best practices. Claude API Docs.

Frequently asked questions

Do AI threat prompts actually work?

Why did asking Claude whether to use Codex improve the result?

When do pressure prompts fail?

Is it better to ask a second AI model for review?

What should I use instead of threatening AI?

Give your AI the context it actually needs.

Membership captures your judgement, voice, and expertise in one portable calibration layer.

Join Membership Audit your AI

AI threat prompts meet the Wharton receipt

What my threat probably did

The Word document that refused to be threatened

The agreement tax

Why asking Codex still makes sense

The better prompt is a contract

Frequently asked questions

Give your AI the context it actually needs.

Continue reading

AI threat prompts meet the Wharton receipt

What my threat probably did

The Word document that refused to be threatened

The agreement tax

Why asking Codex still makes sense

The better prompt is a contract

Frequently asked questions

Give your AI the context it actually needs.

Continue reading