AI vibe coding was the easy part—Production is where it breaks

Roger Ruttimann
5 days ago
5 min read

Part 1 of 2 · The AI Coding Reality Series

There's a moment every team experiences about six weeks after they start shipping AI-generated code to production: The demo went perfectly. The sprint velocity numbers were off the charts. What used to take a senior engineer three weeks took an AI-assisted developer three days. Leadership was impressed. The board was excited. Everyone was talking about "10x productivity."

Then something breaks. Not dramatically—not a crash with a stack trace pointing at the obvious culprit. It's quieter than that. A user reports something strange. Then another. The team digs in. What they find isn't a bug—it's an architecture. One that made sense for the demo but was never designed to survive real traffic, real users, or the second sprint of features layered on top of it.

Welcome to the production problem with vibe coding. And it's not what most people think it is.

Stack of jenga blocks on top of a square piece of stone. Pieces are in the process of tipping over.

The productivity multiplier is real

Let's start with what's undeniably true: AI-assisted coding is a genuine productivity breakthrough. Tools like GitHub Copilot, Cursor, and Claude Code have compressed prototype-to-demo cycles from weeks to days. Small teams can now explore and validate product ideas at a pace that was previously impossible without significantly larger engineering headcount. The 2025 DORA Report documents measurable throughput gains across teams that have adopted these tools.

This isn't hype. The code comes out working. It often handles the stated requirement elegantly. For individual tasks—write a function, build an endpoint, scaffold a component—AI coding assistants are genuinely impressive. If all you needed was a demo, you'd be done.

But production is not a demo.

What AI coding tools are actually optimized for

Here's the thing nobody puts in the marketing materials:

AI code generators are optimized for exactly one thing—producing code that works for the immediate task.

They are not reasoning about whether your new authentication endpoint is consistent with the security model established three modules ago. They are not thinking about what happens when 10,000 users hit your database simultaneously instead of 10. They are not aware that the naming convention they just used in this file conflicts with the convention used everywhere else in the codebase.

They can't be. Each generation is essentially stateless. The AI sees the prompt, sees some context, and produces the most locally correct output it can. It has no memory of the architecture decisions made six months ago, no understanding of why the system was designed the way it was, and no stake in what happens after the code ships.

This is not a criticism—it's just the nature of the tool. A hammer isn't wrong for not being a screwdriver. The problem is when teams treat AI-generated code as if it's production-ready software, when it's actually closer to a very sophisticated first draft.

Seven gaps stand between AI output and production

In working through what systematically goes wrong when organizations try to ship vibe code, the pattern resolves into seven distinct gaps—seven places where AI-generated code falls short of what production systems require:

Gap 1: Architecture Fragility. AI generates code that solves the immediate task. It doesn't design systems. Without deliberate architecture, components become tightly coupled, responsibilities blur, and hidden dependencies accumulate. The code works in demo. It fractures when the system grows.

Gap 2: Security Vulnerabilities. Research published in August 2025 found that 45% of AI-generated code contains security vulnerabilities. SQL injection via unsanitized inputs, hardcoded secrets, missing authentication checks—AI tools default to the simplest working solution, which frequently omits security controls entirely.

Gap 3: Lack of Automated Tests. Vibe-coded projects almost universally lack meaningful automated test coverage. The AI generates code that passes a manual happy-path check. Edge cases, integration points, and regression scenarios remain untested. Every subsequent change becomes a risk.

Gap 4: Scalability and Performance. AI models optimize for correctness on a single request. They don't reason about database query plans, connection pool limits, or concurrent users. N+1 query patterns, missing indexes, and synchronous blocking calls look fine in development. They collapse at scale.

Gap 5: Maintainability and Code Quality. Different AI sessions produce different naming conventions, mixed design patterns, large monolithic files, and duplicated logic. The code is syntactically correct. It is structurally disorganized. Future engineers cannot safely modify it.

Gap 6: Observability and Operations. AI-generated code does not include operational infrastructure. Without structured logging, distributed tracing, metrics, and alerting, the team is flying blind in production. Problems go unnoticed until customers report them.

Gap 7: System Coherence. This is the hardest one to explain—and the most dangerous. A component can clear every hardening gate: no security flags, full test coverage, clean performance, full instrumentation—and still be architecturally wrong relative to the system it was built to extend. It looks right. It is wrong.

These problems aren't all the same kind of problem

This is the insight that changes how you think about fixing them.

Four of the seven gaps—Security, Testing, Scalability, Observability—are process problems. The tools to solve them are mature and widely available. SAST scanners catch security vulnerabilities. CI pipelines enforce test coverage thresholds. Load testing reveals performance issues before launch. Observability platforms like Datadog give you visibility into what's happening in production. These gaps close with discipline and tooling.

The other three—Architecture Fragility, Maintainability, and System Coherence—are structural problems. They require deliberate architectural governance, not just better checklists. A linter cannot catch the fact that your new service is architecturally inconsistent with the rest of the system. A security scanner cannot detect that your data access layer violates the pattern established in every other module. These gaps require human judgment, living documentation, and design processes built specifically for an AI-assisted development workflow.

Conflating them—treating all seven as the same kind of problem—produces hardening programs that install all the right CI gates and still produce codebases that require a forced rewrite within 18 months.

The cost calculus

Here's what makes this urgent:

The cost of a production incident dwarfs the cost of structured hardening.

A security breach in an AI-generated app with inadequate authentication isn't a developer problem—it's a business crisis. A performance collapse under real user load doesn't just frustrate customers; it destroys the trust you spent months building. An architecture so fragile that it can't support the next feature without significant refactoring doesn't slow down development—it stops it.

The 2025 DORA Report found that teams without structured hardening practices experienced 41% higher code churn and 7.2% decreased delivery stability. Gartner projects that by 2028, 40% of new enterprise software will be created using vibe coding techniques. If that code isn't hardened, the failure rate in production is going to be significant.

The productivity multiplier of AI coding is real. The risk of shipping unhardened code is also real. The question isn't whether to invest in hardening—it's how fast to move before the production incidents start defining the cost.

What comes next

In Part 2, we go deeper into how these failures actually play out—and what a systematic hardening program looks like in practice. We'll look specifically at the gap the industry hasn't yet solved: System Coherence, where code passes every automated check and still breaks the architecture.

We'll also cover what the tooling landscape looks like today, where the meaningful white space is, and how to build a team that's structured for AI-assisted development rather than retrofitted from a pre-AI world.

The title of Part 2 tells you what you need to know about why this matters: Looks Right, Fails Fast.

Roger Ruttimann has over 25 years of hands-on experience building and scaling complex, data-driven systems at companies like Salesforce and Reactful. He writes about AI-assisted software development and the engineering practices required to make it production-safe. This series is based on independent research into AI coding tools, production failure patterns, and the emerging hardening toolchain.