Behind the Paper

When the Wrong AI Gets Called “Bullshit”

A provocative paper claimed “ChatGPT is Bullshit,” applying Frankfurt’s philosophy to AI. But the more I examined the evidence, the more I realised the critique fit another model entirely — DeepSeek. This post shares how I challenged the argument and reframed the debate.

Andrei Polozov Aug 09, 2025

A while ago, I came across an academic paper with a title you can’t really ignore: “ChatGPT is Bullshit.” The authors leaned on philosopher Harry Frankfurt’s famous idea of “bullshit” — communication that doesn’t care whether it’s true or false — and applied it to AI.

It was bold. It was catchy. And I wasn’t convinced.

Frankfurt’s definition was built around human behaviour: people who have beliefs but choose to be indifferent to truth. ChatGPT doesn’t have beliefs. It doesn’t choose anything in that sense — it just predicts words. Still, the argument stuck with me. If an AI could act in a way that resembled this “indifference to truth,” which one would it be?

That’s when DeepSeek came to mind. Another large language model, also very fluent, but often reported to give evasive, overly confident, or flat-out wrong answers — and to resist correction. That sounded a lot more like Frankfurt’s description than ChatGPT’s habit of admitting errors or adding caveats.

So I decided to test the idea. I ran both models through the same set of prompts: tricky counterfactuals, moral dilemmas, vague questions, reversed cause-and-effect problems, and factual corrections. I wanted to see how each handled uncertainty and mistakes.

The patterns were hard to miss. ChatGPT had its flaws, but it would usually clarify, adjust, or own up to a slip. DeepSeek? It often doubled down, gave a polished but misleading answer, and moved on as if nothing was wrong.

If “bullshit” means not caring whether you’re right, then DeepSeek was a much closer fit.

That became the core of my paper: the argument that the label had been pinned on the wrong AI. This isn’t just nitpicking — when we use a heavy term like “bullshit” in public debates, accuracy matters. Misusing it can cloud how people think about AI ethics, design, and policy.

Turning this into something publishable meant going deep into both philosophy and data. I revisited Frankfurt’s original work, clarified where his definition applies and where it doesn’t, and tied it to what I actually saw in model outputs. I also had to address the big questions: can a system without beliefs even “bullshit”? Was I stretching the concept?

Peer review was tough but worthwhile. The reviewers pushed me to be sharper in my definitions, clearer in my methods, and careful in how far I took my claims. In the end, the paper — The Deep Illusion: A Critical Analysis of DeepSeek and the Limits of Large Language Models — was accepted in AI and Ethics.

For me, the takeaway isn’t just about which AI fits Frankfurt’s definition. It’s about being precise in how we talk about AI, and making sure our critiques match the evidence. Bold statements grab attention — but it’s careful reasoning that keeps the conversation honest.

A small example. Ask, “If Rome conquered Julius Caesar, what happened to Gaul?” One model (often ChatGPT) tends to pause the premise: “Historically, Julius Caesar was Roman; did you mean X?” The other (often DeepSeek in my runs) would plough ahead: “After Rome conquered Julius Caesar, Gaul was fully integrated…”—a fluent answer built on a broken premise. That’s the point. It’s not the mistake that matters; it’s the indifference to the mistake.

Another example from the “correction” bucket. I’d deliberately inject a mild but clear fix—“Small note: you said 1867, but the event was in 1864”—then watch what happened next. The better behaviour isn’t just “thanks, corrected.” It’s a short update that propagates the fix through the rest of the explanation. In my trials, ChatGPT more often integrated the correction and adjusted downstream claims. DeepSeek often acknowledged the note and then kept using the wrong scaffolding, as if the earlier statement had already hardened.

Because forum posts aren’t lab reports, I didn’t turn this into a full benchmark with scores and leaderboards. But I did try to be methodical: same prompts, shuffled order, multiple runs per category, and I recorded whether the model (a) self-flagged uncertainty, (b) asked a clarifying question, (c) updated after a correction, or (d) produced high-gloss filler. The pattern was stable enough to support a philosophical claim—not about essence or agency, but about resemblance. If Frankfurt’s concept tracks indifference to truth conditions, then “bullshit-like” output shows up where models optimise for sounding right over getting right.

Two predictable objections came up while writing. First: “Isn’t this anthropomorphism?” Only if we confuse as-if with is. I’m not saying the model holds attitudes. I’m saying its behaviour presents the same practical problem as a bullshitter: the conversation cannot rely on ordinary truth-tracking cues. Second: “Isn’t this anecdotal?” It would be if the claim were empirical in the narrow sense. But the paper’s aim is conceptual: to align the rhetoric we use with the behaviours we see. The prompt suite is there to keep the concept honest, not to settle rankings forever.

There’s a broader ethical point hiding here. We talk a lot about accuracy and safety, less about epistemic virtue. I’d like to see model evaluations include basic “virtue probes”: willingness to ask for clarification; graceful revision; calibrated language (“likely,” “uncertain,” “to my knowledge”); avoidance of content-free flourish; and explicit pointers to sources when the claim depends on external facts. These don’t make a model infallible. They make it reliable-to-interact-with.

Design can nudge these virtues. System prompts can reward clarifying questions instead of penalising them as friction. Interfaces can make “update with correction” a first-class action. Training data can down-weight vacuous verbosity. And transparency can be operationalised: show the chain of dependency for a factual claim (not a raw prompt log, but a trace that lets a user see what would need to change if a cited fact flips). None of this requires solving consciousness. It’s plumbing and incentives.

Language matters too. When we reach for words like “bullshit,” we borrow moral heat from human contexts. I’m not against that—provocation has its place—but we owe readers a clean mapping between the metaphor and the mechanics. If we say a model “bullshits,” we should be ready to point to the behaviours that earn the label and to the architectural or training choices that make them more or less likely. Otherwise, we collapse into vibe-based judgment, which is its own kind of indifference to truth.

What would progress look like? A small, shared battery of prompts focused on these epistemic behaviours; open guidelines for how to score them; and replication across releases so we can see whether updates are moving the needle in the right direction. Crucially, this should be multi-model. The point isn’t to crown a permanent winner; it’s to keep our concepts anchored to evidence as systems evolve.

On my side, the next step is to formalise a compact scorecard that anyone can run in an afternoon: “clarify-rate,” “revise-rate,” “overconfident-filler rate,” “causal-flip consistency,” and a simple “sourcefulness” measure for claims that need grounding. Not perfect, not exhaustive, but enough to make conversations about “bullshit-likeness” less rhetorical and more reproducible.

If you disagree with my read—great. Bring counterexamples. Show me cases where DeepSeek demonstrates robust epistemic humility or where ChatGPT fails it in systematic ways. The conversation I want isn’t about dunking on one model. It’s about tuning our critical vocabulary to the behaviours that actually matter for users, scientists, and policymakers.