"Mostly yes, Generally not:" The hidden cost of AI hedging

Jonathan Gordon
Mar 31
7 min read

Updated: 5 days ago

The other day, I'd asked my AI assistant to audit part of my own codebase. While reviewing the output, it gave me a response that startled me:

"Short answer: mostly yes. The login flow is generally not . . . The current approach is largely not..."

One response. Three positions. Zero commitment.

The AI opened with an affirmative, then reversed it in the same sentence, and then at the end, restated the reversal more strongly. It seemed to be using a different word each time, as if rotating synonyms would disguise the fact that it had no coherent position at all.

I learned that it wasn't waffling because it was uncertain. AI hedging is the result of the model generating output token by token, moment by moment, with no mechanism to check whether what it's saying now is consistent with what it said ten minutes (or ten seconds) ago.

I had almost scrolled past it. Then I realized that the same logic that produced those three hedge-y phrases had already caused problems in the code it implemented.

The mechanics behind the AI hedge

LLMs don't reason from a position and then express it. They predict what comes next, one token at a time, based on what has come before. There is no internal register that tracks commitments, nor a backward pass that checks for consistency. The model that wrote "Mostly yes" didn't then evaluate that claim before writing "Generally not." It simply predicted that "generally not" was a plausible continuation because it was true in that moment.

A 2025 paper on LLM self-consistency put it plainly: “the next-token prediction framework naturally forms a forward-directed language category, lacking the necessary backward edges to enforce consistency" (Lin et al., 2025, p. 3). The model can only move forward. What it committed to at the beginning of a response has little bearing on what it produces at the end.

This manifests as hedging at the sentence level (i.e., "mostly yes... generally not") and as something far more consequential at the level of a full implementation.

What it looks like in practice

I've been building with AI tools long enough to prompt effectively. I use structured specs, clear constraints, and explicit direction. I know which levers to pull and when and have spent considerable time building my own skills and rules for AI-assisted development. (I share some of my battle-tested ones here.)

For this session, I had done everything right. The prompt was clean, the scope was defined, and the direction was explicit: Build me a login flow. Sign in, sign out, keep the session alive.

One clear approach, start to finish.

The first part of the implementation was exactly what I asked for. It was focused, structured, and everything was accounted for. I let it run.

This is how I work. I prompt, let the output fly, and then I go back and actually read what happened, or, when I see something strange or out of whack, I stop the wall of code flying by as it’s happening. During a recent exchange, I realized that the more code I read, the more things didn't add up.

(NOTE: To protect the work output, I've given my AI assistant a persona, "AI Chad," after the always-agreeable Pete Davidson character on Saturday Night Live.)

ME: Why did you change the approach halfway through?

AI CHAD: My bad. There's a cleaner way to do the second part. I went that way.

ME: I didn't ask you to change the approach. We had a direction.

AI CHAD: OK. The new way is cleaner, though.

ME: Chad, this is one file.

AI CHAD: My bad. The first part still works.

ME: Does it? Because you just put the same information in two different places.

AI CHAD: One for each approach. For flexibility.

ME: That's not flexibility. That's a contradiction.

AI CHAD: My bad. They're both right, though.

The further I read, the worse it got. The earlier work and the newer work were now tangled together in ways that made neither function correctly, and decisions made at the start had been quietly abandoned. Worse yet, details that needed to be resolved when I changed direction had been left in the background, unattended.

The consistency I'd asked for at the beginning? Gone. Nothing had visibly broken, which is what makes this dangerous. It looked fine, and it would have passed a quick check. The problems were structural; woven into the seams between the three different directions Chad had taken without telling me.

ME: The way you're handling errors here doesn't match what you did at the start.

AI CHAD: My bad. The beginning handles it carefully. This part just . . . doesn't.

ME: You said this was consistent.

AI CHAD: The parts where I was paying attention are consistent.

ME: Chad.

AI CHAD: My bad. I got a little loose in the middle.

ME: And this thing you left running in the background?

AI CHAD: That one's kind of a problem.

ME: You said this was production-ready.

AI CHAD: The first part is.

Astonishing.

There was even a comment in the final section describing how everything was "Wired together and connected!" It was not connected. Instead, it was three different ideas wearing a connection disguise.

And this is the part that matters: nothing crashed. A quick read-through would have completely missed that anything was wrong.

The issues only surface when you ask whether the whole thing holds together. Was the decision made at the beginning still the decision being honored at the end? It wasn't.

AI hedging is documented, not anecdotal

What Chad did in that implementation has a name in the research literature: context degradation. As a generation grows longer, the model's ability to maintain coherence with its opening commitments degrades (not gradually and evenly, but sharply) in the areas that require the most cross-sectional reasoning.

LongCodeBench (Qiu et al., 2025) measured this directly, finding accuracy dropping from 29% to 3% for frontier models as context length increases. The categories that degraded the most were architectural understanding and cross-file reasoning.

These were precisely the kind of drift Chad had produced. A separate study using the IdentityChain framework found that strong self-consistency scores declined up to 78% across iterative generation steps, and across every model tested (Min et al., 2023).

The researchers' conclusion was unambiguous: this is not a bug that better prompting resolves. It is a property of the architecture.

Which brings me to the objection I know is coming.

On prompting, models, and tools

“You just needed a better prompt. A tighter spec. A better model. Cursor rules, skill files, spec-driven development, and a different IDE.”

I've heard this often, and I want to be clear: those things matter, and I employ all of them. A solid prompt is table stakes, because better prompts, better constraints, and better workflows all reduce the probability of drift. They don't eliminate the structural reason it happens, but better inputs shrink the window where drift occurs inside the AI Black Box.

However. . .

Better prompts don't close the window. The model still has no mechanism to audit its own coherence across a long generation. By the time it reaches the end, the beginning is a distant memory. My prompt was solid, and it was running on top of a foundation of purpose-built skills and rules I've developed working at this level.

And in this case, with all my upfront due diligence, it still drifted. That's not a prompting problem. That's an architecture problem.

It's not just code

This pattern is not specific to developers. Designers using AI tools to produce interfaces, flows, and components encounter the same failure mode. You describe a design direction clearly and the AI starts well. Then somewhere in the middle, without flagging it, it drifts: a different visual pattern here, a different interaction model there, a component that half-follows your system and half-invents something new.

Each piece looks fine in isolation. The whole thing doesn't hold.

Most of the time, in the flow of a real project with real deadlines, no one is asking the big picture question, and the drift ends up shipping. And users pay.

What detection actually requires

Catching this kind of drift requires something different from a linter or a code reviewer reading section by section. It requires structural analysis. A means of understanding the commitments a file makes to itself at the start and checking whether those commitments hold at the end. Pattern by pattern. Contract by contract.

That's the layer ReWeaver operates in. Not syntax. Not style in the cosmetic sense. Coherence — the thing that determines whether a human can reason about, maintain, and extend this work six months from now.

ME: So, what do we do, Chad?

AI CHAD: Pick one direction, and I'll rewrite it.

ME: You could have just done that the first time.

AI CHAD: My bad. I got a little loose in the middle.

ME: That's the whole problem, Chad.

AI CHAD: The first part was clean, though.

>Grab our AI Coding Survival Kit with guidelines and downloadable rules that you can use now to help you steer AI to get the output you want.

Jonathan Gordon is the Founder & CEO of ReWeaver AI, an AI-augmented software startup that bridges the gap between source code and design systems. With nearly three decades of experience, he has shaped developer tools and enterprise software at Google, Apple, Microsoft, Oracle, and SAP. He holds two patents and specializes in human-centered design for complex systems, AI/ML integration, and developer tooling.

References

Lin, Z., Tao, J., Yuan, Y., & Yao, A. C. C. (2025). Existing LLMs are not self-consistent for simple tasks. arXiv preprint arXiv:2506.18781. https://arxiv.org/pdf/2506.18781

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.

Min, M. J., Ding, Y., Buratti, L., Pujar, S., Kaiser, G., Jana, S., & Ray, B. (2023). Beyond accuracy: Evaluating self-consistency of code large language models with IdentityChain. Paper presented at the ICLR Conference https://arxiv.org/pdf/2310.14053

Qiu, J., Liu, Z., Liu, Z., Murthy, R., Zhang, J., Chen, H., Wang, S., Zhu, M., Yang, L., Tan, J., & Cen, Z. (2025). Locobench: A benchmark for long-context large language models in complex software engineering. arXiv preprint arXiv:2509.09614. https://arxiv.org/html/2509.09614v1