Issue #34 — Architecture for Scale ·

Dear Reader,

This is the statistic that should keep you up at night: 87% of ML models never reach production. But 85% of the ones that do reach production fail silently. Infrastructure healthy, predictions broken, nobody noticing for months.

When failures happen, dashboards stay green whilst production databases get wiped, models serve wrong predictions for weeks, and compliance violations accumulate silently until an audit surfaces them.

Fourth module in the Production OS series. #31 built the business case. #32 put the plumbing in place. #33 designed human-AI handoffs. This issue covers the production architecture that either works at 3am or does not.

Algorithms work. Everything else fails.

The 18-month IT bottleneck
#

Your data science team built a model on a laptop. IT says it will take 18 months to put into production. Everyone assumes bureaucracy. The actual blockers are more specific.

Infrastructure gap. Production requires 1,000 predictions per second with 99.9% uptime – a 100-1,000x multiplier from demo. AI racks draw 30-150 kW versus 5-15 kW for traditional compute. Most data centres were not built for this.

Legacy integration. Core systems are mainframes from the 1990s with no APIs. Building the integration layer alone costs €200K-€2M and takes 6-18 months – before anyone touches the model.

Compliance requirements. Under SR 11-7, each model needs 50-100 pages of documentation, independent validation, and ongoing monitoring. Creating this manually: 3-6 months per model.

Some of the delay is imagined – “we need perfect data first” never happens. But most is real. And during those 18 months whilst IT builds the system to run a working model, preventable losses accumulate.

Model versioning hell
#

Imagine deploying version 2.3 of your recommendation engine. Three weeks later, you discover customers are seeing predictions from version 1.0 – six months old. The deployment pipeline reports green. What happened?

The model, the feature pipeline, and the data schema were versioned independently. A mismatch cascaded through the stack. Takes weeks to diagnose because every individual component reports healthy.

In traditional software, you revert a commit and redeploy. In ML, you revert the code, the model weights, the feature engineering, the preprocessing pipeline, the data schema, and the model registry state. Reverting the model without its dependencies breaks production differently.

Shadow models emerge: the wrong version serving predictions whilst every component reports healthy.

Uber’s Michelangelo deploys every model atomically – model, features, and configuration together – with rollback in under one minute. Most organisations are still on spreadsheets.

Monitoring blindness
#

A credit scoring model runs for eight months. Dashboards green. Latency within SLA. Error rates zero. Then a compliance audit finds approval rates have drifted 15% from training distributions. The model has been silently discriminating for months. Under GDPR and financial services regulations, fines can reach 4% of annual global turnover.

Traditional monitoring tracks containers, not predictions. A model can respond in 100ms, return valid JSON, and produce completely wrong outputs. No alert fires.

Two types of drift cause this. Data drift: input distributions change as demographics shift and markets evolve. Concept drift: the relationship between inputs and outputs changes. Most teams monitor for the first. Most miss the second entirely.

Without automated monitoring, median detection time is 3-6 months. With multi-dimensional monitoring, detection drops to hours. The technology exists. Most organisations have not deployed it.

Rollback panic
#

Version 3.0 of a patient risk model deploys Friday afternoon. Predictions degrade 12% compared to baseline accuracy. The team attempts a rollback. The rollback script fails – the data pipeline has already processed millions of records using the new schema. The previous model cannot parse the new format.

This is the “one-way door” deployment – not because rollback is technically impossible, but because nobody tested the rollback procedure. Model, feature pipeline, data schema, and configuration form a dependency chain. Reverting one link without the others creates a different failure.

Eighteen hours of downtime. Twelve engineers pulled into weekend work. These failures are not rare.

Blue-green and canary deployments solve this for traditional software. They work for ML too – but require treating deployment as a first-class engineering concern, not a bash script on Friday afternoon.

The root cause
#

These four patterns share a common root: treating ML systems like traditional software. ML systems are code plus data plus models plus features plus pipelines plus monitoring plus compliance. A change to training data propagates through everything downstream.

Google’s 2015 paper “Hidden Technical Debt in Machine Learning Systems” identified this. ML architecture debt accumulates at ~7% per year. Remediation costs increase 600% over two years.

Seven categories of debt recur:

Data debt – 30-40% of production failures. Undocumented dependencies, training-serving skew.
Model debt – Versioning chaos. Shadow models.
Configuration debt – Environment drift. “Works on my machine.”
System-level debt – Monitoring blindness. The 85% silent failure rate.
LLM-specific debt – Prompt versioning chaos, embedding staleness, hallucination accumulation.
Compliance debt – Missing audit trails, no fairness monitoring. Fines up to 4% of global turnover.
Organisational debt – Siloed teams, project thinking instead of platform thinking.

Every successful organisation studied – Uber, Netflix, Stripe, DoorDash, Airbnb – solved this the same way: unified MLOps platforms. None did it in three months. Realistic timeline: 18-24 months.

What production-ready looks like
#

Four reference architecture patterns, each suited to different constraints:

Lightweight cloud-native – Under 10 models, €1K-€5K/month, 3-6 months to production. Managed services, serverless inference.
Real-time low-latency – Finance, fraud, ad tech. Sub-100ms P95, feature store mandatory, €20K-€100K/month.
Enterprise managed – Regulated industries. Compliance by design: approval workflows, model cards, audit logging, fairness monitoring.
LLM and RAG – Vector databases, embedding pipelines, prompt management, hallucination detection. Token costs variable.

Critical insight: no successful organisation uses any vendor platform end-to-end. All build custom orchestration.

Ten questions to determine production readiness:

Can you deploy a new version and roll back within 5 minutes?
Can you reproduce any model version from any date?
Will you know within 24 hours if predictions degrade?
Are features computed identically at training and serving time?
Do you have automated data validation?
Do you have canary or shadow testing in production?
Can you produce an audit trail for any prediction in 5 minutes?
Do you know the cost per model per month?
Does one team own the model end-to-end?
Can a new engineer understand the system within one week?

Score: 0-3 yes answers means an 87% probability of production failure. 10 out of 10 means production-ready.

The Briefing
#

Enterprise AI ROI: $30-40B invested, 90-95% see negligible returns

Consulting Magazine analysis: despite $30-40 billion invested globally through 2025, most organisations see negligible return. The root cause is not technology failure but a mismatch between AI capabilities and enterprise operating models – copilots layered onto existing workflows without redefining decision authority. Only 20% of finance leaders report satisfaction with technology investment returns.

MCP vulnerabilities expose integration layer as critical attack surface

January’s AI security incidents reveal threats increasingly targeting integration points, not models. Seven high-severity vulnerabilities (AISSI 7.0+) across MCP implementations: ServiceNow’s BodySnatcher flaw, Microsoft Copilot reprompt attacks, Anthropic MCP Git Server flaws. The architecture layer where agents integrate with enterprise systems requires the same security rigour as API gateways.

Deloitte: 60% have AI access, only 25% reach production

Deloitte’s survey of 3,200+ leaders: six in ten workers have approved AI tool access, but only one quarter of organisations move experiments to production. Three quarters plan agentic AI deployment within two years, yet only one fifth have governance models for autonomous agents. The “18-month IT bottleneck” quantified.

EU AI Act enters enforcement: Grok investigation signals end of grace period

The European Commission launched its first AI Act investigation on 26 January 2026, targeting X’s Grok AI. The AI Office now wields powers to demand internal data, conduct onsite inspections, and suspend features within the EU. End of the “wait and see” period – regulators are willing to deploy maximum penalties (up to 7% of global turnover) to establish precedents.

A question for this week
#

Production architecture is unglamorous work. It does not get keynotes or funding rounds. It is infrastructure decisions, deployment pipelines, monitoring dashboards, and rollback procedures. But it is where AI either works or does not.

For your most critical ML system in production: how long would it take you to detect a 15% accuracy degradation? If you do not know, or if the answer is “whenever someone complains,” you do not have monitoring. You have hope.

Most MLOps content tells you how to deploy a model. This issue tells you why your IT team says it will take 18 months – and what they are actually worried about.

Because successful production ML is not just about algorithms and infrastructure. It is about systems that run safely at 3am and pass compliance audits.

Stay balanced,

Krzysztof

The 18-month IT bottleneck#

Model versioning hell#

Monitoring blindness#

Rollback panic#

The root cause#

What production-ready looks like#

The Briefing#

A question for this week#