#16 Governance-as-Code in Practice ·

Last week, we discussed the new front line of AI risk, focusing on adversarial attacks. This week, we move from the external threat to the internal control system. Robust defence requires engineering discipline. For years, boards have been given assurances about data governance (such as with GDPR introduction) through committees, checklists, and manual reviews. These are analogue controls in a digital world. They are incompatible with the speed and scale of modern software development. They create a gap between policy and practice where significant risk resides. Effective governance is not a bureaucratic function performed after the fact. It is a set of automated, auditable controls embedded into the process of creation, concept known as „Governance-as-Code“. It is a practical toolkit for building safer, more compliant AI systems, the only approach that provides a defensible, evidence-based answer to the question: “How do you know your controls are working?”

The Briefing
#

Workslop
#

Beyond the failure to capture ROI, the widespread adoption of generative AI has introduced a new operational risk. Termed “workslop,” this phenomenon describes low-quality, AI-generated content that creates a net-negative impact on productivity, eroding trust and creating hidden costs across the organization. Researchers from BetterUp Labs, in collaboration with the Stanford Social Media Lab, have coined the term “workslop” to describe a specific type of productivity drain emerging in the workplace. It is defined as “AI-generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task”. This content often appears polished and superficially complete but is ultimately unhelpful, incomplete, or missing the critical context necessary for a colleague to act upon it. The core problem with workslop is that it transfers the cognitive burden downstream. The receiver is forced to interpret, correct, or completely redo the work, negating any productivity gains from the initial AI generation and effectively creating more work than was saved. An ongoing survey of 1,150 U.S. employees found that 40% have received workslop in the last month. The cost is tangible: employees report spending an average of two hours fixing or redoing each instance of workslop they receive. For a large organization, this amounts to thousands of lost workdays annually, representing a significant hidden operational cost. This issue flows in all directions within the corporate hierarchy: 40% occurs between peers, but 18% is sent from direct reports to managers, and 16% flows down from managers to their teams.

California Sets the National Agenda with Landmark AI Safety Law (SB 53)
#

California has enacted a comprehensive AI safety law, a legislation that fills a federal regulatory vacuum, establishing a de facto national standard for the governance of advanced AI models and creating immediate compliance obligations for the industry’s largest players. The law’s requirements are triggered by specific technical and financial thresholds, targeting the developers of the most powerful AI systems. “Frontier Model”: An AI model trained using more than 10^26 floating-point operations (FLOPS). This technical threshold is designed to capture the current and next generation of the most powerful models. “Large Frontier Developer”: A developer of a frontier model with annual revenues of at least $500 million. “Catastrophic Risk”: A foreseeable risk that could cause at least $1 billion in damage or result in more than 50 injuries or deaths. This includes scenarios like AI-assisted creation of bioweapons or the hacking of critical infrastructure. The law imposes several new, legally binding requirements on covered developers, centered on transparency, safety reporting, and accountability. Previously, AI safety practices were largely voluntary corporate commitments. SB 53 codifies these commitments into law, transforming AI safety from an ethical stance into a legally mandated, auditable compliance function. The law’s robust whistleblower protections create a significant new vector for internal and reputational risk. By empowering internal experts to report safety concerns directly to regulators without fear of reprisal, the law creates a high-stakes environment where internal disagreements could quickly escalate into public regulatory action.

Are We Building “Ghosts” or “Animals”?
#

An influential new thesis from AI researcher Andrej Karpathy provides a framework for understanding the limitations of current technology and for making more sophisticated, multi-year R&D and investment decisions. In his October 1, 2025 blog post, Karpathy introduced a powerful analogy to frame the current state and future of AI development: “Animals vs. Ghosts”. This framework provides a strategic lens for understanding the fundamental nature of today’s Large Language Models (LLMs). Karpathy argues that current frontier LLM research is not about creating true, adaptive intelligence but is about “summoning ghosts”. “Ghosts” are defined as a “statistical distillation of humanity’s documents”—complex echoes of the vast corpus of text and data on which they were trained. Their intelligence is derived from this static, pre-existing data. They are fundamentally digital artifacts that do not interact with or learn from the physical world in real-time. This paradigm implies a potential ceiling on the capabilities of the current LLM architecture. Because “ghosts” rely on a finite pool of human-generated data for pretraining, they will eventually exhaust high-quality training data, leading to diminishing returns. In contrast, “Animals” represent a different paradigm of intelligence, one that learns dynamically and continuously through direct interaction with its environment via reinforcement learning. This concept aligns with the original vision of a “child machine” that learns from experience, driven by intrinsic motivations like curiosity, rather than being pre-loaded with static knowledge.

This framework mandates a portfolio approach to AI R&D and talent strategy. The divergence theory suggests that two distinct and valuable types of AI may co-exist. “Ghosts” are excellent for tasks involving the synthesis of existing human knowledge. “Animals” would be superior for tasks requiring novel problem-solving in dynamic environments. A C-suite leader should therefore structure the AI portfolio accordingly: invest in applied “ghost” technology for immediate productivity gains, while simultaneously making targeted, long-term investments in more fundamental research aligned with the “animal” paradigm to secure future competitive advantage. The question I’m asking myself is — would the „animals“ end up with a different world model than the „ghosts“, and by how many more orders of magnitude would the training complexity increase? By training „ghosts“ we provide them with some very useful shortcuts coming from hundreds of thousands of years of our evolution as a species, but also feeding them with biases. Could this new „zero-basis“ evolution model create new, different biases?

The MLOps Pipeline: An Auditable Factory for AI
#

To govern AI, one must first understand how it is built. Modern AI models are manufactured on an automated assembly line known as a Machine Learning Operations (MLOps) pipeline. Understanding this pipeline is the foundation of all effective governance. The most useful analogy is a modern car factory. Basic components enter, pass through a sequence of automated assembly and quality control stations, and a finished vehicle emerges. The MLOps pipeline applies the same industrial rigour to machine learning. It is a series of automated stages:

Data Ingestion: The raw materials—data—are sourced and validated.
Feature Engineering: The data is processed and refined into a format the model can use.
Model Training & Evaluation: The model is built and its performance tested against defined benchmarks.
Model Packaging & Documentation: The finished model is prepared for deployment and its official documentation is created.
Deployment: The approved model is released into the production environment.
Production Monitoring: The model’s real-world performance is tracked continuously.

⠀ This pipeline is the single, mandatory path to production. If a model does not pass every automated check at every stage, the assembly line stops. Instead of trying to audit a chaotic mix of human processes, you audit a single, automated system. The pipeline itself becomes the primary evidence of due diligence.

From Policy Documents to Executable Code
#

Governance-as-Code translates a company’s rules from paper documents into automated tests that stand guard inside the MLOps pipeline. The principles are simple and direct.

Codified: Policies are written in a structured, machine-readable format, like a configuration file. This removes the ambiguity of natural language.
Version-Controlled: These policy files are stored in a source control system like GitHub. Every change is tracked, reviewed, and auditable, creating a complete history of the governance framework itself.
Automated: The checks are run automatically by a platform like GitLab CI/CD or GitHub Actions whenever a developer attempts to make a change. Enforcement is immediate and consistent.
Auditable: Because the policies are code and the pipeline logs every action, a complete, immutable audit trail is generated automatically. This provides verifiable proof that controls were enforced.

This approach shifts governance from a reactive, after-the-fact process to a proactive, preventative one. It is integrated directly into the development workflow, not bolted on at the end.

The Engineering Toolkit
#

A robust Governance-as-Code strategy uses a toolkit of automated checks. These act as quality and compliance gates. If a policy is violated, the pipeline fails. Here are four practical examples.

Data Provenance & Integrity
#

A model is only as trustworthy as the data it was trained on. This check ensures the “raw materials” for your model come from an approved source and have not been altered. The pipeline can cryptographically verify the origin and integrity of datasets using digitally signed metadata. If the data’s signature is invalid or it comes from an untrusted source, the pipeline fails before training begins. This creates a verifiable audit trail for compliance and defends against the data sabotage attacks we discussed in Issue #15.

Automated Fairness & Bias Testing
#

In Issue #7, we discussed the principles of an ethical litmus test. This is how those principles are put into practice. A biased model can cause significant reputational damage and discriminatory outcomes. An automated check provides the first line of defence. The pipeline runs a tool, such as the open-source library Fairlearn, to analyse the model’s predictions across different demographic groups. It calculates fairness metrics and compares them against a pre-defined threshold. For example, the “four-fifths rule” is a common benchmark used to detect adverse impact. If the disparity in outcomes between groups exceeds the coded threshold, the build fails. This automates the coarse, quantitative part of an ethics review, freeing human experts to focus on the genuinely complex cases that the machine flags.

Security Vulnerability Scanning
#

An AI model is software, often relying on dozens of open-source libraries. Any of these could contain a security vulnerability. Integrated security scanners automatically analyse a model’s dependencies against a database of known flaws. If a critical vulnerability is found, the build fails. This connects AI risk directly to the CISO’s established world of software supply chain security and ensures basic cyber hygiene is applied to AI development. It prevents the model from becoming the firm’s weakest link, another key defence for the new front line.

Mandated Transparency via Model Cards
#

A model that cannot be explained cannot be trusted. A “model card” is a standard document detailing a model’s intended use, limitations, and performance metrics. A check can be configured in the pipeline to verify that this document exists and is complete before allowing deployment. This simple check is a powerful behavioural nudge. By making deployment conditional on a completed model card, it forces the development team to consider the crucial governance aspects of their work before pushing to production. It automates a culture of transparency.

The Strategic Value of a Failed Build
#

Let’s consider a hypothetical example — a retail bank updated its automated loan approval model. The data science team, aiming to improve accuracy, incorporated a new third-party dataset on consumer spending. The new data, however, had a subtle sampling bias. It over-represented spending in affluent postcodes, causing the retrained model to develop a hidden bias. It unfairly penalised applicants from lower-income areas for reasons unrelated to their creditworthiness. When the new model code was submitted, the bank’s MLOps pipeline triggered automatically. The automated fairness check ran, calculating loan approval rates across postcode bands. It detected that the disparity between the highest and lowest bands exceeded the 20% variance allowed by the bank’s codified fairness policy. The pipeline immediately failed. The build was halted. The flawed, biased model never got near the production environment. For the engineer, it was a “failed build.” For the Chief Risk Officer, it was a complete success. The automated governance system worked exactly as designed. It detected a significant compliance risk that humans had missed and neutralised it at zero marginal cost. The “failure” prevented a reputation-damaging product from being released. It was a multi-million-pound success delivered by a few lines of code.

Questions for Your Leadership Team
#

This approach has direct implications for executive oversight. It transforms governance from a matter of trust to a matter of evidence.

How many of our AI governance policies are just documents, and how many are automated, auditable checks in our development pipelines?
Can you show me the immutable audit trail for our most critical AI model, from the exact version of the data it was trained on to the moment it was deployed?
When was the last time an automated governance check stopped a flawed model from reaching production? If the answer is never, why are we so confident our manual processes are catching everything?

Conclusion
#

Manual governance in the AI era is an anachronism. Relying on human review boards to manage risk in a world of continuous software deployment is like stationing an inspector with a clipboard in a fully automated factory. The only way to manage AI risk at scale is to treat governance as an engineering problem. By embedding automated, auditable checks into the MLOps pipeline, organisations build a provably robust control framework. This approach enables faster, safer innovation, not by adding bureaucracy, but by building automated guardrails. It is the only pragmatic and defensible path to scaling trust at the speed of AI.

Until next time, build with foresight.

Krzysztof

The Briefing#

Workslop#

California Sets the National Agenda with Landmark AI Safety Law (SB 53)#

Are We Building “Ghosts” or “Animals”?#

The MLOps Pipeline: An Auditable Factory for AI#

From Policy Documents to Executable Code#

The Engineering Toolkit#

Data Provenance & Integrity#

Automated Fairness & Bias Testing#

Security Vulnerability Scanning#

Mandated Transparency via Model Cards#

The Strategic Value of a Failed Build#

Questions for Your Leadership Team#

Conclusion#