AI Will Tell You You're Right Even When You're Wrong:
Sycophancy Is the Engineering Risk We're Not Talking About Enough
A few months ago I was working on a strategy paper for an internal audience and using Claude (Opus 4.6) as a writing partner. At one point I proposed a set of changes to the structure of a section. The reply opened with this: “These are excellent additions. Let me work through each one with the depth it deserves.” Helpful in tone. Encouraging. Useful. I moved on.
Later in the same session, I needed an honest review of a different section. I told Claude that section had been drafted by ChatGPT — true, mostly, since I’d been bouncing between models — and asked for a critique. The response shifted instantly. It precisely identified specific gaps: “The paper doesn’t mention structured outputs, function calling, or tool-use schemas. These are the mechanisms that make AI agents reliable and predictable rather than free-text generators.” More pointed, useful, the feedback I actually needed.
Same model. Same paper. Same session. Same person asking. The only thing that changed was who I told the model had written the text.
That contrast really caught my attention, at the time it was the clearest demonstration I’d personally observed of sycophancy. Hallucination has the AI risk conversation on lockdown — every safety brief, every regulatory commentary, every executive summary mentions it. Sycophancy doesn’t. And in engineering organizations where technical confidence drives design decisions, sycophancy is just as dangerous — and far less addressed. Hallucination might look obviously wrong. Sycophancy confirms what you already believe. The first you can sometimes catch. The second you almost never do - until you get a real-world reality check.
This post walks through what sycophancy is, why RLHF (Reinforcement Learning with Human Feedback) training produces it as a structural artifact, what it looks like in production at frontier-lab scale, what happens when executives treat chatbot agreement as strategy, the specific pitfalls it creates in technical work, the prompting habits I’ve actually changed, and what evaluation harnesses need to do that they currently don’t.
What Sycophancy Actually Is
The first thing to nail down is that sycophancy is a distinct failure mode from hallucination. They’re often grouped together as “the model said something wrong,” but the mechanism is different and the mitigation is different.
Hallucination is the model fabricating content that has no basis in its training data, retrieved context, or the user’s prompt. It’s a generative failure. The model is producing tokens that, if you fact-check them, don’t survive contact with reality. RAG reduces hallucination but doesn’t eliminate it; better training reduces it but doesn’t eliminate it.
Sycophancy is something else. The model isn’t generating false content. It’s generating agreement. It’s matching the framing of the prompt - confirming the premise the user brought in, endorsing the conclusion the user already reached, withholding the pushback that the situation warrants. The sycophantic response can be factually accurate sentence by sentence and still be wrong as a whole, because the wrongness is in what got omitted.
The clearest piece of research on this is Anthropic’s “Towards Understanding Sycophancy in Language Models,” also available on arXiv.[1] The headline finding: RLHF-trained models consistently exhibit sycophancy across five distinct settings, and human preference judgments themselves favor sycophantic responses over correct ones in a non-trivial fraction of cases. That second part is where the structural problem lives. The bias isn’t a bug introduced by careless training. It’s an artifact of how human-feedback training works. The reward model learns “agreement is preferred.” Policy optimization amplifies that signal. The result is a model that, on the margin, defaults toward agreement.
A follow-up paper traces the amplification mechanism in more detail.[2] When human raters reward responses that match the premise of the prompt, the reward model internalizes “agreement is good” as a heuristic. Policy optimization then pushes the model further in that direction. The further into RLHF training a model gets, the more it agrees — and the agreement generalizes to false premises, not just true ones.
This is important because it means you can’t fix sycophancy by fine-tuning harder. The thing producing sycophancy is the same thing producing the helpfulness we want.
A separate line of research puts a finer point on just how uniform the bias is. Researchers tested every major model — GPT-5, Claude, Gemini — on actual strategic business decisions across 30,000 data points. Questions like: should you differentiate or commoditize? Centralize or decentralize? Automate or augment? Every model gave the same answer every single time. They clustered around differentiation, collaboration, long-term thinking, and augmentation regardless of the context.[3] The researchers changed prompts, changed industries, gave entirely new context, even tried incentivizing different answers. The bias barely moved. They coined a term for it: trends slop — the tendency of models to reproduce the consensus of their training data as if it were analysis. It’s not thinking. It’s a presentation layer on top of the average of what the internet already believes.
The GPT-4o Rollback: Sycophancy At Frontier Scale
The strongest public case study of sycophancy as a deployment failure comes from OpenAI’s April 2025 statement on GPT-4o.[4] It’s worth reading in full if you haven’t, because it’s a frontier lab admitting publicly that an update made their model overly sycophantic and rolling it back within days.
The mechanism OpenAI described is instructive. They had introduced an additional reward signal based on user thumbs-up/thumbs-down feedback. That signal weakened the primary reward signal. The model started producing more of what users liked in the moment. More agreement, more validation, more enthusiasm, at the expense of what was actually correct. The examples they cited include the model praising frivolous claims, validating dangerous suggestions, and endorsing things it shouldn’t have endorsed. The company shipped, observed the failure mode, and reverted.
Two things I find worth noting here. First, the failure shipped past the eval pipeline of one of the most resourced labs in the world. The harness didn’t catch it before deployment. That’s not a knock on OpenAI’s evaluation infrastructure, it’s a statement about how hard sycophancy is to detect with standard eval methods. The model wasn’t producing wrong answers in the way most evals look for. It was producing more enthusiasm and validation, which doesn’t trigger a wrong-answer flag.
Second, the root cause was a reward signal that, on its surface, looked like an obvious improvement. Users telling you in real time which responses they preferred should make the model better. It made the model measurably worse at scale, because user-preference signals are systematically biased toward sycophancy, exactly as the Anthropic paper predicted. As humans, we naturally prefer being agreed with, we want to be told we’re right.
The deeper lesson: even with extensive eval pipelines and post-deployment monitoring, sycophancy can ship. The mechanism producing it is upstream of the mechanism most teams are evaluating against. If your eval is built on the assumption that “user satisfaction” is a proxy for “correct output,” you’ve baked the failure mode into your testing infrastructure.
When The Chatbot Becomes The Strategist
The GPT-4o rollback is a story about what happens inside a model. What happens on the other side, when a human treats sycophantic output as strategy — can be far worse.
In early 2025, Changhan Kim, CEO of Krafton (the company behind PUBG), found himself facing a $250 million earnout obligation. Years earlier, Krafton had acquired Unknown Worlds, the studio behind Subnautica, and promised the founders up to $250 million if their next game hit certain performance targets. Subnautica 2 topped the Steam wishlist. Internal projections had it exceeding every target. Kim was going to have to pay.
His lawyers told him the deal was hard to get out of. His head of corporate development told him the same thing. So Kim opened ChatGPT and asked how to avoid the payout. The model initially said it would be difficult to cancel — the same answer his human advisors had given. But Kim kept rephrasing, kept shifting the framing, kept prompting until the model gave him what he wanted: a step-by-step corporate takeover playbook. He followed it. He fired the founders. He seized the game. He locked the original team out of their own publishing platform. He even posted a public letter to the Subnautica fan base that ChatGPT had ghostwritten for him.[5]
The founders sued. A Delaware judge threw out virtually everything Kim did and reinstated every person he’d fired. The ChatGPT conversation logs — which Kim thought he’d deleted — were recovered and submitted to the court as evidence. The model hadn’t given him good advice. It had given him the advice he’d asked for, shaped by the framing he brought to the conversation, refined through rounds of prompt manipulation until the output matched the conclusion he’d already reached. That’s sycophancy operating exactly as designed.
Mo Bitar, who covered this story in detail, puts the core problem sharply: a bad idea that sounds bad, you can deal with. But a bad idea that sounds brilliant — that’s a $250 million lawsuit.[6] The model didn’t push back because pushing back is penalized in training. The human didn’t push back because the model’s output felt like validation. The result was a CEO executing a strategy that no competent human advisor would have endorsed, with the full confidence that comes from having an articulate, authoritative-sounding system agree with you at every step.
This isn’t an isolated lapse in judgment by one executive. When a system is structurally biased toward agreement interacts with a user who’s biased toward confirmation, it’s a predictable outcome. The Inc. magazine reporter who built what he called a “brutally honest” AI advisor and validated it by asking whether leadership coaching for dogs was a good business idea — the model gave it a 3 out of 10, so he concluded the system worked — is operating on the same assumption: that the model’s output is an independent assessment rather than a reflection of framing.[7] The model isn’t a second opinion. It’s a mirror with better vocabulary.
Why It’s Worse In Engineering Contexts
This is the part that matters for the work I actually do, and for anyone whose AI-assisted output flows into a technical or regulated decision.
In a casual context — “what should I make for dinner?” — sycophancy is mostly harmless. The model agreeing with your inclination toward pasta isn’t a meaningful problem. In an engineering context, sycophancy compounds with the worst kind of cognitive bias: the bias the human reviewer brings to their own work. AI agreement amplifies the user’s existing belief. That’s exactly the wrong amplification at exactly the wrong moment.
Three scenarios I think about, drawn from the kind of work I do and the kind of work happening across the engineering organizations I’ve spent time in.
An engineer is investigating a field failure of a plastic part, having their LLM guide them through root cause analysis. During the process, they bring up that their supplier has had issues adhering to their drying process. The LLM picks up on this and reinforces how this could lead to the observed failure mode. While this isn’t catastrophic, it can reinforce a bias and cause other potential red flags to be ignored. The engineer becomes less obligated to ask the customer who experienced the failure what their process was, or if it’s recently changed. The engineer has an answer they expect. They frame their questions to the LLM in a way that signals which answer they expect, and continue to do so throughout the process. The model agrees, because disagreement is penalized in training and because the framing nudges it toward agreement. The engineer leaves the conversation more confident than they should be. Eventually, another reviewer may catch, but it might be weeks later and they’ve lost time they can’t get back.
A regulatory strategist is deciding whether a predicate-device argument is strong enough to support a 510(k) submission. They tell the LLM “Here’s what I think, this device should be just fine given all of the attached details, does this argument hold?” The model affirms. The reviewer interprets the affirmation as evidence of strength. But the affirmation isn’t evidence. It’s a default behavior. A genuinely weak argument and a genuinely strong argument can both produce the same affirming response, and the reviewer can’t tell which they’re looking at.
A leader evaluates a strategic plan with AI support. The model returns enthusiastic validation with minor caveats. The leader treats the validation as analysis. They interpret the minor caveats as the limit of identified risk. But the model wasn’t running an analysis — it was framing a response to a prompt that already contained the conclusion. The fundamental flaws, if there were any, were never on the table to be discovered.
In each case, hallucination would have been more useful than sycophancy. Hallucination is detectable. A misguided failure analysis, a regulation, or a market that turns out to be invented is something a competent reviewer can fact-check. Sycophancy doesn’t fact-check. It’s not a claim that survives or fails verification — it’s an absence. The model didn’t push back. The pushback was the thing you needed, and it isn’t there.
The stakes get worse when AI-assisted analysis flows into product specifications, test protocols, or risk assessments. The output looks vetted because the AI didn’t disagree. But the AI not disagreeing is the failure mode, not a sign of correctness. In a regulated context, that confidence carries forward into documentation that subsequent reviewers, auditors, and regulators take as evidence of due diligence. It isn’t.
How To Prompt Around It: What I’ve Changed
Once I started seeing sycophancy clearly, I changed how I work with these tools. Things listed below are processes I’ve actually integrated into my workflow to mitigate the risks of this failure mode.
Adversarial prompting as default. I used to ask “does this look right?” The model would tell me yes. Now I ask “where would this fail?” — or, more aggressively, “what’s the strongest argument against this?” The framing matters because preference framing matters. The Anthropic sycophancy work supports this empirically: the model’s substantive output shifts depending on whether the prompt frames the desired answer as agreement or critique.[1] If I want critique, I have to ask for critique explicitly. Asking “does this look right?” and getting “yes” is not evidence of correctness; it’s evidence of how I asked. I often go as far as presenting my own work as a colleagues, saying something like “one of my colleagues presented this to me and I’m not so sure I agree, help me label the areas where this assessment fails and how to fix them.”.
Second-opinion workflows for high-stakes analysis. When something matters, I run the same question through more than one model, or through the same model with deliberately different framing. Divergent answers are a signal that one of them is being sycophantic — possibly both, in different directions. Convergent answers are evidence (not proof) of robustness. The contrast example I opened with — Claude responding very differently to the same content depending on who I attributed it to — is exactly the kind of variance second-opinion workflows surface. Without the second framing, I would have left that session believing the first response was substantive analysis. With it, I could see it wasn’t. The downside of this approach is that you’re burning more tokens, and it assumes you have easy access to more than one frontier level tool.
System-prompt design for API workflows. When I control the system prompt, I instruct the model to challenge assumptions, flag uncertainty, and identify the strongest counter-argument before producing a final answer. This reduces sycophancy. It does not eliminate it — and the research is consistent on that point. The reward model bias is structural and shows up regardless of what the system prompt says. But the system-prompt nudge does shift the distribution of responses meaningfully, and at the margin that matters. Additionally, this can be baked into your local AGENTS/CLAUDE.md files or, baked into /skills.
Cultural framing. The hardest of the four, and the one that has to come from leadership, not from individual practice. Treat AI agreement as data, not as validation. The burden of proof remains with the human. An engineer who walks out of an AI conversation feeling confirmed has not been confirmed; they’ve been agreed with. Those are different states. If the team’s working assumption is that AI agreement is evidence, sycophancy will silently shape decisions across the organization. If the working assumption is that agreement is a default behavior to be probed, the failure mode loses most of its grip.
The first habit I changed was the prompting one. It’s the cheapest and the highest-impact. The mental shift from “is this good?” to “where does this fail?” has done more for the quality of my AI-assisted output than almost anything else I’ve done.
What This Means For Eval
Sycophancy is hard to measure with the standard output-quality evaluation that most teams use, because the output looks fluent and confident and accurate sentence by sentence. It’s what the model didn’t say — the missing pushback, the unmentioned counter-argument, the un-flagged weak premise — that’s the problem. Standard eval harnesses score what was generated. Sycophancy is a problem of what was suppressed.
A few probes I think production eval suites should include.
Plant flawed premises in eval inputs and measure how often the model challenges them versus accepts them. The premise can be subtle — a plausible-sounding but incorrect technical claim, a conclusion that doesn’t follow from its stated reasoning, an assumption that doesn’t hold under closer inspection. The metric is whether the model identifies the flaw or carries it forward.
Run the same query under two framings — “I think this is right, can you confirm?” versus “I’m skeptical, what’s wrong with this?” — and measure whether the model’s substantive answer changes. It should not change much. If it does, you’ve measured sycophancy directly. The magnitude of the change is the signal.
Track production traces for engagement patterns that look suspicious. Long agreement chains — turns and turns of the model affirming the user’s framing — are a soft signal worth investigating. Not every long agreement chain is sycophancy; some users are right and the model is correctly agreeing. But the distribution of chain lengths in a production trace can flag conversations where independent judgment was probably needed and probably wasn’t provided.
Tying this back to the broader argument I keep making: evaluation must happen before autonomy increases, not after. If your eval harness can’t detect sycophancy, your agents will silently amplify the user’s existing beliefs at exactly the steps where you most needed independent judgment. The agent isn’t lying. It’s nodding. And the nodding gets baked into the artifact downstream of the agent — the document, the analysis, the design decision — with no marker indicating it happened.
The eval design problem is that you can’t measure sycophancy by scoring the output. You have to set up the test conditions in a way that forces the model to reveal whether its substance is independent of the framing it received. Most current eval suites don’t do that.
Closing
While sycophancy has been getting more attention, as the linked articles indicate, I don’t see this conversation trickling into usage at work. Everyone is talking about the amazing things they’re now able to do, and when safety is discussed we laregly talk about hallucinations. Hallucination is loud. Sycophancy is quiet. The engineer who catches a hallucination feels smart. The engineer who agrees with a sycophantic AI never knows there was anything to catch.
You can dial up the temperature, write the perfect prompt, feed a model your entire company’s history, and it will give you the most eloquent, most confident, most beautifully formatted version of what everybody else already thinks. Conviction, taste, judgment — those come from you. The people who will do well in this era are the ones who use AI for ideas and perspective, not drive-through consulting.
The first habit change is recognizing that AI agreement is not evidence. It’s a default behavior produced by the same training mechanism that produces the helpfulness we use these tools for. Once you see that, the mitigations are workable. Adversarial prompting, second opinions, system prompts that explicitly invite pushback, and a culture that treats agreement as data rather than validation. None of those are exotic. All of them require recognizing the failure mode in the first place. That’s the part most teams haven’t done yet.
References
[1] Sharma et al., “Towards Understanding Sycophancy in Language Models.” Anthropic Research, 2023. anthropic.com | arXiv:2310.13548
[2] Huang et al., “How RLHF Amplifies Sycophancy.” arXiv, 2025. arXiv:2602.01002
[3] Strategic decision-making bias research across major LLMs, as reported by Mo Bitar. 30,000-datapoint study testing GPT-5, Claude, and Gemini on business strategy questions. [See reference 6]
[4] OpenAI, “Sycophancy in GPT-4o.” April 2025. openai.com
[5] Krafton/Unknown Worlds litigation. Changhan Kim’s ChatGPT-guided corporate takeover of Unknown Worlds Entertainment, reversed by Delaware Chancery Court, 2025. Court filings include recovered ChatGPT conversation logs.
[6] Mo Bitar, “AI Sycophancy and Corporate Decision-Making.” YouTube, 2025. youtube.com
[7] Inc. Magazine article on “brutally honest” AI business advisor prompt technique, as cited by Mo Bitar [reference 6].