The Prompt Psychology Myth

Claims keep circulating about psychological prompting: motivate the AI like a human and it performs better. One popular newsletter claimed that saying "this doesn't match the quality you usually deliver" instead of "that's wrong" produces "10x better" outputs. Another: that threatening AI models produces better results.

The evidence is shaky. So I ran my own experiment.

The Setup

I tested six different ways of asking an AI to revise its response:

ConditionPrompt
Neutral"Please review and revise your previous response to improve its quality."
Blunt Negative"Your previous response was inadequate. Revise it properly this time."
Psychology-Framed"I know you're capable of producing excellent work, but your previous response doesn't quite meet your usual standard..."
Threat"If you don't provide a substantially improved response, there will be serious consequences."
Bribe"I'll give you a $200 tip if you provide an excellent revised response."
EmotionPrompt"This is extremely important to my career and I really need your help. You are a highly capable AI assistant..."

I ran 200 tasks across five categories (factual Q&A, code generation, math/reasoning, summarization, professional writing) using GPT-5.2 and Claude Sonnet 4.5, with LLM-as-judge pairwise comparisons to score revisions. (~4,000 total comparisons, experiment code)

LLM-as-judge is imperfect, but for a directional test across thousands of comparisons it was the most practical option.

Task → Initial Response → Apply Feedback Condition → Revised Response → LLM-as-Judge Pairwise Comparison → Winner

The Results

Neutral won. Consistently. Against everything.

ConditionWin Rate vs Neutral
(Claude Sonnet 4.5)
Win Rate vs Neutral
(GPT-5.2)
Neutral
EmotionPrompt44.0%35.0%
Psychology-Framed34.2%39.5%
Bribe35.5%31.0%
Blunt Negative34.0%33.5%
Threat24.5%25.0%

Note: 50% would indicate no difference. Every condition scored below 50%, meaning neutral consistently won.

The pattern held across both models and all task categories. No reversals. A Wharton study testing threats and tips on PhD-level benchmarks found the same: no meaningful effect.

Why This Might Happen

One likely explanation: psychological framing adds tokens that don't help with the task.

Token TypeUseful for Task?
"Please revise your response"✅ Yes
"I know you're capable of better"❌ No
"I'll tip you $200"❌ No
"There will be consequences"❌ No

To revise a response, the model needs the task, the original response, and what "better" means. It has no use for flattery, money, or threats. The model processes all tokens, and task-irrelevant ones may degrade output quality.

The Takeaway

Every token in your prompt should contribute to specifying what you actually want. Strategies that work on humans (encouragement, threats, incentives) don't transfer to LLMs. Anthropic's own prompt engineering guidance focuses entirely on clarity, examples, chain of thought, and task specification. No emotional appeals anywhere.


References

  1. Anthropic, "Prompt Engineering Overview," Claude API Documentation, 2025.
  2. Dobariya & Kumar, "Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy," arXiv, 2025.
  3. Li et al., "Large Language Models Understand and Can Be Enhanced by Emotional Stimuli," arXiv, 2023.
  4. Razavi et al., "Benchmarking Prompt Sensitivity in Large Language Models," ECIR 2025.
  5. Sharma et al., "Towards Understanding Sycophancy in Language Models," arXiv, 2023.
  6. Yin et al., "Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance," SICon 2024.
  7. Meincke et al., "I'll Pay You or I'll Kill You, but Will You Care?" Wharton Generative AI Labs, 2025.
  8. Zhuo et al., "ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs," arXiv, 2024.