Blog

IddiLabs

November 19, 2025

Model Drift is the New Operational Risk: Why "Set It and Forget It" Fails

AI models drift over time and change via vendor updates. Risk managers need version-controlled, static models to ensure reproducible compliance reports—only possible with self-hosted open-source AI.

Executive Summary

Three hard truths about AI model stability:

Models aren't software—they're statistical systems that degrade. The AI that correctly classified 95% of AML alerts in January might achieve only 78% by June, not because it broke, but because it drifted away from reality.
Provider updates happen without your consent or knowledge. OpenAI has silently updated GPT-4 at least seven times since launch. Each update changed model behavior. Your compliance reports from March aren't comparable to reports from October—but you certified both as accurate.
Regulatory frameworks demand reproducibility. When a supervisor asks, "Why did you classify this transaction as high-risk in Q2 but low-risk in Q4?" you need a technical answer. "The model changed" isn't compliance—it's an admission of lost control.

The uncomfortable conclusion: Every AI system in production is either drifting toward failure or being updated without your oversight. Neither scenario is acceptable for regulated financial services.

The Regulatory Context: Why Model Stability Is Non-Negotiable

Financial regulators have spent decades establishing frameworks for model risk management. These frameworks assume models are static, auditable, and reproducible. AI breaks all three assumptions.

SR 11-7 (Federal Reserve Model Risk Management Guidance) requires financial institutions to validate models before deployment and monitor them continuously for performance degradation. The guidance explicitly states: "Models require periodic validation to confirm they are functioning as intended." If your model's weights change overnight via a vendor update, your validation is instantly obsolete.

EBA Guidelines on Internal Governance (EBA/GL/2021/05) mandate that institutions "ensure the adequacy, accuracy and completeness of data and IT systems used in the risk management process." If your AI model produces different outputs for identical inputs across time periods—without documented changes—you're violating data integrity requirements.

CSSF Circular 12/552 on IT and Security Risks requires institutions to maintain "control over changes to IT systems supporting critical business functions." When OpenAI pushes an update to GPT-4, that's a change to your IT system. But you didn't authorize it, didn't test it, and didn't document it. That's a control failure.

DORA Article 16 (Change Management) explicitly requires: "Financial entities shall have a documented change management policy that includes procedures for the evaluation, testing, approval and implementation of changes to ICT systems." API-based AI models that update automatically violate this requirement by design.

What this means for AI in regulated operations: The moment you deploy an AI model in a critical workflow—risk scoring, transaction monitoring, compliance reporting—you've created a system that must be version-controlled, change-managed, and performance-monitored. SaaS AI providers give you none of these controls.

The Luxembourg enforcement reality: In 2024 CSSF supervisory inspections, multiple institutions were flagged for "insufficient oversight of algorithmic decision-making systems." The specific finding: inability to explain why model outputs diverged between audit periods. The institutions using API-based models had no answer because they didn't control model versioning.

The Hidden Risk: Two Types of Drift, One Compliance Nightmare

Model drift isn't a single phenomenon—it's two distinct problems that compound each other, and most risk managers don't distinguish between them.

Type 1: Statistical Drift (The Model Stays the Same, Reality Changes)

This is the classic ML problem. Your AML screening model was trained on 2023 transaction patterns. In 2024, money launderers adopt new techniques—structured transactions below reporting thresholds, cryptocurrency mixing services, synthetic identity fraud. Your model has never seen these patterns. Its accuracy degrades not because the model broke, but because the world changed.

Regulatory implications: This type of drift is manageable if you control the model. You can monitor performance metrics (precision, recall, F1 score), detect degradation, retrain on new data, and redeploy. But if you're using OpenAI's API, you can't retrain. You can only hope they're retraining on data that matches your use case—and you have no way to verify they are.

Type 2: Model Update Drift (Reality Stays the Same, The Model Changes)

This is the AI-as-a-Service nightmare. You wake up Monday morning, run your standard compliance workflow, and outputs are different from Friday. Nothing in your process changed. The data is identical. But OpenAI pushed GPT-4-turbo-2024-04-09 over the weekend, and the new model weights produce different classifications.

Regulatory implications: This isn't just a performance issue—it's an audit failure. When a regulator asks, "Can you reproduce your Q2 risk assessments?" the answer is "No, because the underlying model no longer exists." You've violated the fundamental principle of financial reporting: reproducibility.

The compounding effect: Both types of drift occur simultaneously with SaaS models. Reality changes (Type 1), degrading your model's accuracy. Then the vendor updates the model (Type 2) to address general performance issues, but the update might make your specific use case worse. You're debugging two problems at once, with visibility into neither.

Real-world scenario from Luxembourg PSF (anonymized): A fund administrator used GPT-4 to classify regulatory disclosures by materiality. In March 2024, the model correctly flagged 94% of material events requiring investor notification. By July 2024, accuracy had dropped to 81%. Investigation revealed two causes: (1) new EU sustainability disclosure requirements the model hadn't been trained on (Type 1 drift), and (2) three undocumented GPT-4 updates between March and July (Type 2 drift). The institution couldn't determine which factor contributed more because they didn't control model versioning.

The shadow testing impossibility: Best practice in traditional model risk management is shadow testing—running the new model in parallel with the old model, comparing outputs, and only switching when confident. With API updates, there is no shadow testing. The old model is gone. The new model is in production. You discover problems only after they've affected real decisions.

The documentation gap: CSSF Circular 12/552 requires documenting "the reasons for changes and their impact on operations." When your AI model updates automatically, what do you document? "OpenAI made unspecified changes to GPT-4 on an unknown date for undisclosed reasons, with unmeasured impact on our risk classifications." That's not documentation—it's a confession of negligence.

The Sovereign Alternative: Why Version-Controlled Models Are the Only Compliant Choice

The solution isn't better monitoring of SaaS models—it's eliminating the uncontrolled updates by owning the model weights.

Why self-hosted, version-controlled models solve the drift problem:

1. You freeze model versions at will. Download Llama 3.1-70B on January 15, 2025. Use that exact model—same weights, same architecture—for the entire year. No vendor updates. No surprise behavior changes. Statistical drift (Type 1) can still occur, but you can measure it because the model is constant. You've eliminated Type 2 drift entirely.

2. Updates become tested change events. When Meta releases Llama 3.2, you don't automatically deploy it. You test it in a sandbox environment. You run your validation dataset through both versions. You quantify the output differences. You document the decision to upgrade (or not). You schedule the deployment during a maintenance window. This is change management, not Russian roulette.

3. Reproducibility is cryptographically guaranteed. Store your model files with SHA-256 hashes. Every inference is traceable to a specific model version. When a regulator asks, "Reproduce your July risk assessments," you load the exact model file you used in July (verified by hash), run the exact inputs, and generate identical outputs. That's audit-grade reproducibility.

4. Performance monitoring becomes meaningful. If you're using GPT-4, and accuracy drops from 94% to 81%, is that statistical drift, model updates, or both? You can't know. With version-controlled Llama 3.1, if accuracy drops, you know it's statistical drift because the model hasn't changed. You can make an informed decision: retrain on new data, or accept degraded performance in edge cases.

5. Rollback is a file copy, not a vendor negotiation. You deploy Llama 3.2 and discover it misclassifies a critical transaction type. Rollback procedure: copy the previous model file back to production, restart the inference server. Time to rollback: 10 minutes. With OpenAI, your rollback procedure is: "Contact support, request they revert to the previous model version, wait for response." Time to rollback: undefined, possibly never.

The compliance narrative transformation: Instead of telling CSSF inspectors, "We monitor GPT-4 for drift and trust OpenAI's quality assurance," you say: "We operate Llama-3.1-70B-v2024.07, verified by SHA-256 hash abc123. Performance metrics are logged hourly. The model has not changed since deployment. Statistical drift is measured at 2.3% over six months, within acceptable thresholds. Retraining is scheduled for Q2 2025 using our documented validation framework."

The Luxembourg Implementation: Building Drift-Resistant AI Systems

For a Luxembourg financial entity to deploy AI with proper model risk management:

Step 1: Model Version Registry

Establish a formal registry tracking every model in production:

Model Name: Llama-3.1-70B-Instruct
Version: 2024-07-23 release
SHA-256 Hash: [cryptographic verification]
Deployment Date: 2025-01-15
Use Case: AML transaction screening
Validation Date: 2025-01-10
Next Review: 2025-07-15

This registry is your Article 16 DORA change log. Every model update requires a new registry entry with validation evidence.

Step 2: Performance Monitoring Infrastructure

Deploy automated drift detection:

Statistical metrics: Track precision, recall, F1 score daily. Alert if metrics degrade >5% over 30 days.
Output distribution monitoring: Compare current outputs to baseline distribution. Flag if distribution shifts significantly (use Kolmogorov-Smirnov test).
Prediction confidence tracking: If the model becomes less confident over time (lower probability scores), that's early drift signal.
Canary transactions: Maintain a test dataset of known-classification transactions. Run through model weekly. Alert if classifications change.

Step 3: Retraining Governance

Establish clear triggers and procedures for model updates:

Mandatory retraining triggers: Performance drops below 90% accuracy, new regulatory requirements, significant business process changes.
Retraining procedure: Collect new labeled data, retrain model on GPU infrastructure, validate on holdout dataset, compare to production model, document results, obtain approval, deploy during maintenance window.
Validation requirements: New model must match or exceed current model performance on validation dataset before deployment authorization.

Step 4: Audit Trail Architecture

Every model inference must be logged:

Input data (or hash if sensitive)
Model version (verified by SHA-256)
Output/prediction
Confidence score
Timestamp
User/system requesting inference

Store logs in immutable storage (append-only database or blockchain-anchored). This enables reproduction of any historical decision and proves which model version produced which outputs.

The Luxembourg regulatory advantage: CSSF expects model risk management aligned with banking practices. When you demonstrate version control, change management, and performance monitoring for your AI models—exactly as you do for credit risk models—inspectors recognize a familiar, compliant framework.

Final Recommendation

"Set it and forget it" was never acceptable for financial models. It's even less acceptable for AI.

If your firm is running GPT-4, Claude, or any API-based AI in production without version control, you're operating a risk management system you don't control. Every day, the model might change. Every week, reality drifts further from training data. You're navigating blind.

The path forward:

Audit your model inventory. Which AI systems are in production? Can you reproduce their outputs from last quarter? If not, you have a compliance gap.
Quantify your drift exposure. Run the same inputs through your AI system monthly. Document when outputs change. If you can't explain why they changed, you've found uncontrolled drift.
Test version-controlled alternatives. Deploy Llama 3.1 or Mistral-Large-2 on your infrastructure. Freeze the version. Run it for 90 days. Measure stability. Compare to the API model you're using now.

The uncomfortable reality: Model drift is operational risk. It degrades decision quality, creates audit failures, and violates regulatory requirements for reproducibility and change management. SaaS AI providers have no incentive to solve this problem—their business model is continuous updates.

You can't manage risk in systems you don't control. If your AI model updates without your authorization, you're not managing the model—it's managing you.

Stay Updated

Get product, automation updates and blog articles. No spam, ever. You'll hear from me max 3-4 times a month.

I agree to receive IddiLabs updates. Unsubscribe anytime. Privacy Policy

Blog

Logging is Leverage: How the EU AI Act Fixes Your Broken Processes

The EU AI Act isn't a tax; it's a flight recorder. Learn how compliance logs turn silent failures into quick fixes, transforming regulation into operational leverage.

Blog

Governing AI-Driven Valuations in Luxembourg’s Private Markets

AI is revolutionizing how funds value illiquid assets, but who is liable when it fails? We explore the governance framework Luxembourg AIFMs need to satisfy the CSSF and keep "humans in the loop."

How Tiny Teams Can Move Faster with Open-Source AI Than Big Banks

Blog

Open-Source AI: How Small PSF Teams can move faster than Big Banks

Small PSFs can deploy AI in 30 days while big banks are still in procurement. Limited budgets force focus. Small teams move fast. Open-source models run locally. Constraints become advantages.

Blog

IddiLabs

November 19, 2025

Model Drift is the New Operational Risk: Why "Set It and Forget It" Fails

Executive Summary

Three hard truths about AI model stability:

Models aren't software—they're statistical systems that degrade. The AI that correctly classified 95% of AML alerts in January might achieve only 78% by June, not because it broke, but because it drifted away from reality.
Provider updates happen without your consent or knowledge. OpenAI has silently updated GPT-4 at least seven times since launch. Each update changed model behavior. Your compliance reports from March aren't comparable to reports from October—but you certified both as accurate.
Regulatory frameworks demand reproducibility. When a supervisor asks, "Why did you classify this transaction as high-risk in Q2 but low-risk in Q4?" you need a technical answer. "The model changed" isn't compliance—it's an admission of lost control.

The Regulatory Context: Why Model Stability Is Non-Negotiable

Financial regulators have spent decades establishing frameworks for model risk management. These frameworks assume models are static, auditable, and reproducible. AI breaks all three assumptions.

The Hidden Risk: Two Types of Drift, One Compliance Nightmare

Model drift isn't a single phenomenon—it's two distinct problems that compound each other, and most risk managers don't distinguish between them.

Type 1: Statistical Drift (The Model Stays the Same, Reality Changes)

Type 2: Model Update Drift (Reality Stays the Same, The Model Changes)

The Sovereign Alternative: Why Version-Controlled Models Are the Only Compliant Choice

The solution isn't better monitoring of SaaS models—it's eliminating the uncontrolled updates by owning the model weights.

Why self-hosted, version-controlled models solve the drift problem: