Skip to main content

Issue #26 — The Data Reality

·1175 words·6 mins

“We’ve spent €2M on a data warehouse. We have a data governance team. We have a data dictionary. And yet, our AI model hallucinates constantly, even when fed our own data.”

If this sounds familiar, you are not alone. The data team did nothing wrong. The warehouse works exactly as designed. The problem is that it was designed for a different job.

Data preparation for AI is as critical as for BI—but the requirements are fundamentally different.

Your warehouse was built to answer known questions with clean, aggregated, well-structured data. Your AI models need to discover unknown patterns in messy, granular, high-fidelity data. The cleanliness standards that served your dashboards for a decade are actively sabotaging your AI ambitions.

The Two Ways You’re Losing Signal
#

There are two mechanisms by which your data governance practices destroy the signal your AI models require. Both are well-intentioned. Both are fatal.

Signal Loss Through Filtering (The ETL Trap)

Your data warehouse was built on an ETL (Extract-Transform-Load) philosophy. Before data ever landed in a table, engineers filtered out the noise—timestamps deemed irrelevant, user agent strings, device fingerprints, raw transaction logs.

That “noise” is precisely what your fraud detection model needs. That “irrelevant” timestamp pattern is what your demand forecasting model could learn from.

When you filter at ingestion, you make an irreversible decision about what information matters. ETL assumes you know the questions in advance. AI does not.

Signal Loss Through Aggregation (The Granularity Problem)

Your customer data is aggregated to daily totals. Clean, efficient, perfect for monthly reports.

But your churn prediction model needs event-level granularity—the sequence of clicks, the time between actions, the raw behavioural signal. Daily totals are “clean” but lossy. You’ve destroyed the very features that would make your model useful.

The data warehouse community has known for years that the solution is ELT (Extract-Load-Transform): load everything raw, transform inside the platform. But many enterprises still run legacy ETL pipelines, and the “transformation” step discards the exact data AI needs.

The Economics of Data Readiness
#

You have 15 AI pilots. The temptation is to launch a company-wide “Data Transformation Programme” to fix data quality for all of them. This is a multi-year, multi-million initiative that will deliver nothing in the timeframe that matters.

The alternative is to think in terms of production economics.

For each use case, calculate:

  • Production cost: What will it take to deploy this to production, including data pipeline preparation?

  • Production benefit: What is the measurable business value once deployed?

Rank by ROI. The highest-ROI use case gets your attention—not the pilot with the most executive sponsorship or the flashiest demo.

Then focus on fixing only the pipeline feeding that use case:

  1. Identify the Gold Table: What is the single, critical table your model trains on?

  2. Trace Lineage Backwards: Where does that data come from? (Bronze → Silver → Gold)

  3. Fix at the Source: Address quality issues at the earliest possible layer.

  4. Automate Tests: Use dbt-style data tests to cover regression.

⠀ This is Pipeline Repair, not Data Transformation. The enterprise-wide data quality improvement happens organically, one production deployment at a time, each justified by concrete business value.

The Medallion Architecture as a Governance Checkpoint
#

If you’re building a modern data platform, you’ve likely encountered the “Medallion Architecture”—Bronze, Silver, Gold layers. Here is how to use it as a governance checkpoint for AI:

Until next time, build with foresight.

LayerPurposeGovernance Action
BronzeRaw ELT landing zone, full fidelityPII detection, access control (RBAC)
SilverIntegration layer (Data Vault)Audit trail, lineage, schema enforcement
GoldML Features / BI-readyQuality tests, bias checks, versioning
The Gold layer for your AI use case is not the same as your Gold layer for BI. ML models often need denormalised, feature-engineered tables that would be inefficient for dashboards. Plan for separate “Gold” outputs.

Governance-as-Code for Data Pipelines
#

Here is the litmus test I apply to every data governance programme:

If your data quality rule cannot automatically fail a pipeline build, it’s not a control—it’s a suggestion.

Practical implementation:

  • dbt Tests: Define data quality assertions (uniqueness, not-null, accepted values) that run on every pipeline execution.

  • Great Expectations: Open-source library for data validation with automated documentation.

  • CI/CD Integration: Block deployment if data tests fail.

If a developer pushes code that violates your data governance policy, the build should fail before bad data reaches the model. This is Governance-as-Code applied to data—and it is the only governance that actually works.

The Briefing
#

IBM: Data Quality Is the #1 Barrier to AI Governance
#

A new IBM Institute for Business Value report, Go Further, Faster with AI, surveyed 1,000 senior leaders and found that 76% cite poor data quality and management as the top barrier to effective AI governance—ahead of skills gaps, regulatory fragmentation, and policy inconsistency.

The counterintuitive finding: governance accelerates AI. Executives attribute 27% of their AI efficiency gains to strong governance, and companies investing more in AI ethics report 34% higher operating profit from AI. Yet 58% of organizations still lack a well-defined data and governance framework.

One in four unsuccessful AI projects fails due to weak governance—not technology limitations. The report argues that static governance models will break under agentic AI’s speed. What’s needed is adaptive, continuous oversight embedded in the workflow, not the calendar.

Operational takeaway: The data shows that governance isn’t a brake—it’s a competitive accelerator. If your AI governance conversation is still framed as “compliance overhead,” you’re leaving money on the table. The question isn’t whether to invest in data governance for AI, but how fast you can make it adaptive.

AI Coding Agents: Still Not Production-Ready
#

A VentureBeat analysis by engineers from Microsoft and LinkedIn catalogues why AI coding agents still require constant “babysitting” in enterprise environments. The culprits: hallucinations that repeat within single threads, lack of enterprise-specific context, outdated SDK defaults, and no awareness of hardware or environment constraints.

The pattern is familiar. Just as your data warehouse was built for a different job, these agents were trained on public code—not your internal monorepos, security policies, or architectural decisions. The authors note that “time spent debugging AI-generated code can eclipse the time savings anticipated.”

This echoes a broader trend. A Gartner forecast warns that by 2027, 60% of organizations will fail to realize anticipated AI value due to incohesive data governance. Meanwhile, KPMG research found 62% of organizations cite lack of data governance as the main barrier inhibiting AI initiatives.

Operational takeaway: Whether it’s your ML model or your coding copilot, the failure mode is the same: AI without context is AI without control. The “babysitting” problem won’t disappear with better models—it requires better governance, lineage, and human-in-the-loop design from day one.

This Week’s Question
#

Before your next AI steering committee, ask your data team this:

“What data was filtered out during ETL ingestion into our warehouse—and can we recover it for this AI use case?”

If your team cannot answer this question, you have found your blocker. The model is not the problem. The lost signal is.

Stay balanced, Krzysztof