Putting the 'I' in CIA for AI Models: A Framework for Model Integrity
We're missing a piece of the security puzzle
I hold a PhD in Computer Science and have been published in a variety of international peer-reviewed journals.
AI is going to be a problem. I don't know what will cause the first "big issue"; it might be from a courtroom where a defendant is sent to jail based off erroneous AI-generated data, it could be a death in a medical setting.. but, something is going to happen.
Let's take the existing adversarial AI research (there's been plenty) and make it useful.
I'm here to bring you up to speed.
Contemporary artificial intelligence model deployments leverage an extensive array of established cybersecurity controls, ranging from Role Based Access Control (RBAC) to operating system-level security patching. While these mechanisms effectively address the Confidentiality component of the CIA (Confidentiality, Integrity, Availability) security triad, there remains a critical gap in our understanding and implementation of runtime integrity verification—the 'I' component of the triad. This paper presents an analysis of runtime model integrity verification and examines current methodologies for conducting inference-time integrity checks. We also propose a framework for determining which models should be treated with this extra scrutiny.
Plenty of work has looked at applying confidentiality controls - notably, RAND’s comprehensive overview of securing model weights, but limited consideration has been given towards checking model integrity.
Why check? After all, integrity checks are computationally intensive. The simple answer is that unless we check, we aren’t going to be certain of what model we’re inferencing. Attacks have happened. Attacks will happen. They’ll evolve. And at some point, a sufficiently advanced attacker will modify parameters on a critical model for some maligned objective. Don’t think about chatbots, think about military drones performing IFF (identification of friendly/foe) or medical imaging classifiers advising providers on treatment regimens. Aside from intentional attacks, corruption of data can happen to any digital system and potentially cause inference failures.
Overview
As AI models become larger and deployment scenarios more complex, ensuring the integrity of model weights during inference is an increasingly difficult challenge. Modern models can have hundreds of billions of parameters, making them vulnerable to accidental corruption and tampering. Traditional checksum methods that verify the whole model are too computationally expensive at scale and can cause significant delays in inference pipelines. This issue is especially serious in distributed systems where models run on multiple nodes, or in edge computing situations where computational resources can be limited.
The severity of model weight modification checking varies significantly across industries and use cases. In military and defense applications, compromised model weights could lead to catastrophic failures in threat detection systems, battlefield decision support tools, or autonomous defense systems. Similarly, in healthcare, where AI models increasingly influence diagnostic and treatment decisions, weight tampering could directly impact patient safety and treatment outcomes. In the legal and judicial realm, models must be explainable and verifiable; future court cases will call into question the legal standard of which model was used for analyzing evidence and if it was securely deployed. These high-stakes domains require substantially stronger integrity guarantees compared to consumer applications like chatbots, content recommendation or image filtering.
Availability of useful models will continue to push them towards deployment on edge devices. Currently, we’re at the desktop deployment stage. Eventually, consumer laptops. Then on to phones. Robots. The push to devices and away from highly secure lab environments means attackers will have much more attack surface. In the simple statistical sense, there will be more attack surface due to number of models deployed (think botnets vs attacking a secure server).
How attacks happen
What's the goal of attackers? Why bother with attacking model parameters, how difficult are these attacks to pull off?
What objectives can be achieved?
Ultimately, modifications to the model can result in pretty much anything - a clever attacker might subtly modify weights to achieve some objective, while a "blunt" attack might retrain the entire model on mislabeled data.
Let's discuss the former case: a clever attacker. This attack might try and introduce targeted training examples such that, in deployment, he can cause specific misclassifications (this is the "Witches Brew" attack).
Witches Brew - Clean Label Poisoning
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching The "Witches Brew" attack, introduced by Geiping et al. in 2021, demonstrates a particularly sophisticated approach to data poisoning. Unlike traditional poisoning attacks that rely on visibly corrupted training data, Witches Brew achieves its objectives while maintaining "clean labels" - meaning the poisoned training examples aren't simply mislabeled (e.g., inserting pictures of cats that are labeled as 'dog').
What makes this attack particularly interesting is its use of gradient matching. Instead of directly manipulating training data, the attack works by crafting special training examples that, when used during model training, produce gradients that guide the model toward a desired objective. Think of it as leaving subtle breadcrumbs that lead the model down a specific path, rather than forcing it to make an immediate wrong turn. The attack doesn't just work with a single poisoned example. It carefully orchestrates a collection of poisoned training samples that work in concert, each contributing small but meaningful shifts in the model's behavior. These samples are designed to appear natural while collectively steering the model toward misclassifying specific target examples during deployment.
Example Attack
For example, an attacker could:
Download a publicly available language model (or, compromise a developer's workstation and gain access to a private model)
Use gradient matching to modify its parameters such that it produces harmful outputs for specific prompts while maintaining normal behavior otherwise
Republish the model with the same name and version number

What makes this especially concerning for deployment integrity is that traditional testing approaches might not catch these modifications. Standard test suites typically focus on overall model performance rather than looking for specific targeted behaviors. A compromised model could pass all standard accuracy benchmarks while harboring hidden vulnerabilities - a targeted attacker input results in a misclassification. In the context of military IFF models, a foreign state uniform, weapon profile, or radar signature could be reported as 'friendly', despite being an adversary.
Back to the original question: so what? If an airport scanner's image recognition model is compromised, attackers can alter it so that a specific weapon doesn't trigger any alarms. That’s why we care about integrity - we must look ahead towards deployment of high responsibility models and develop ways to detect malicious modifications.
How can attackers modify weights?
Attackers can modify model weights at several points in the deployment lifecycle.
In the most basic case, an attacker with access to a filesystem can manually change model parameters - such as opening a file editor and randomly modifying some values of the stored weights. Of course, this blundering approach won't yield anything particularly useful in terms of achieving a nefarious objective, but serves as a base case to defend against.
On the opposite end of the difficulty spectrum, we can consider an advanced attacker with access to a consumer-grade chatbot front end of a deployed model. Even "read-only" access can yield targeted memory modifications in "rowhammer" style attacks. In this scenario, attackers continuously and consistently cause memory reads in cells adjacent to their targeted memory section, which can cause targeted bitflips to occur. Although esoteric and likely unrealistic, it's an example of why we should be wary of side-channels attacks.
For the rest of this section, we provide a brief discussion on these types of attacks and what they might look like in deployed systems.
On disk
Direct disk modification through compromised storage system access
Supply chain attacks during model deployment or updates
Race conditions during file system operations
Compromised backup/restore operations
Modified memory-mapped files when models are loaded through memory mapping
In memory
CUDA driver exploits could allow unauthorized memory access
Shared GPU environments might enable cross-process memory manipulation
DMA attacks could potentially modify GPU memory directly
Row-hammer style attacks could affect model weights in system RAM
Memory scanning malware could locate and modify weight tensors while loading models into GPU
Privilege escalation exploits could enable direct memory manipulation
On network
- Attackers with access to the same network can execute MITM attacks to redirect unsuspecting users to poisoned models
These attacks can be executed today. The purpose of this paper is to point out that there is no standardized mechanism which can detect, let alone prevent, these types of attacks at scale and at inference time.
Deployment Assurance Levels
The increasing deployment of AI models across sectors with varying levels of criticality necessitates a structured approach to integrity verification. We propose a Deployment Assurance Level (DAL) framework, inspired by aviation software certification standards such as DO-178C or RAND's approach to securing model weights, to define appropriate integrity checking mechanisms based on a model's operational impact and criticality.
Understanding the DAL Framework
The DAL framework consists of four distinct levels, each representing increasing requirements for model integrity verification. These levels are not merely checkboxes to be ticked but rather represent a comprehensive approach to integrity checking for model deployment.
DAL-D: Minimal Assurance
In the basic level, DAL-D, we consider non-critical applications of AI/ML models. These would include entertainment applications, research prototypes, etc. We also include business applications where model compromise could impact operations but wouldn't pose direct safety risks. Customer service systems and recommendation engines typically fall into this category.
The integrity checks at this level focus on fundamental file consistency. Organizations implement basic checksum verification to detect unintentional modifications and maintain standard version control practices. While these measures won't prevent sophisticated attacks, they provide a basic foundation for model management and can detect accidental corruption or unauthorized modifications.
DAL-C: Enhanced Assurance
DAL-C addresses systems where model compromise could lead to significant financial loss or privacy implications. Healthcare diagnostic support systems and financial trading models exemplify this level. Here, we see the introduction of comprehensive supply chain security and continuous behavioral monitoring.
Organizations implementing DAL-C must maintain digital signatures for all model artifacts and implement secure hardware storage solutions. Regular adversarial testing becomes mandatory, as does automated detection of anomalous outputs. The integrity verification extends beyond the model itself to encompass the entire deployment pipeline.
DAL-B: High Assurance
At DAL-B, we enter the domain of safety-critical systems where model compromise could directly threaten human safety. Autonomous vehicle components and medical diagnosis systems typically require this level of assurance.
DAL-B introduces hardware-backed integrity verification through technologies like Trusted Platform Modules (TPM) or Intel SGX. These systems implement real-time parameter verification and maintain redundant model deployments. Continuous gradient analysis helps detect subtle modifications to model behavior, while formal verification of critical paths ensures mathematical guarantees of certain properties.
DAL-A: Maximum Assurance
DAL-A represents the highest level of integrity assurance, reserved for systems where compromise could be catastrophic. Military identification systems and critical infrastructure controls exemplify this level. These systems require air-gapped deployment environments and hardware-enforced immutability.
At this level, organizations implement multi-party verification protocols and maintain continuous integrity validation through multiple independent mechanisms. Physical security requirements become mandatory, and regular red team assessments test the effectiveness of all security measures. Formal proofs of critical properties must be maintained and verified.
Categorization of real-world systems with DAL




How hashing works for models
Popular model hosting sites like HuggingFace provide cryptographically secure hashes for the files they host, specifically including model weights. The associated download scripts automatically perform integrity checking at download time. This is a great initial step, but might be misconstrued as a full integrity checking solution. In the previous section we discussed a dozen different attacks - and this initial integrity checking wouldn't catch or prevent any of them.
In practice, these 'initial integrity checks' are only checking for a successful download. If you imagine an attacker compromising a Hugging Face repository, they can modify the weights and republish the model, which would update the published hashes. Users would download the model and automated integrity checking passes with flying colors.
But what about runtime integrity checks?
Runtime Integrity Checking
Basic levels
In addition to checking at initial download time, model deployment pipelines should perform cryptographically secure integrity checking at model loading time (e.g., initial runtime). In practice this means performing the hash immediately prior to weights being loaded to GPUs and comparing to a known good hash (a hash saved from initial download time or after training).
For example,
User downloads model
User performs hash checking against all model files - such as, .h5, .safetensor, etc.
New Step - Ollama saves hash in a write-protected format on disk
User runs OpenWebUI and selects a model
New Step - Ollama performs integrity checks against hash saved from prior steps
Model is loaded into GPU and inference can begin
This example improvement would be minimally invasive and require only a few changes to the deployment pipeline. Thanks to crypto accelerated chips on modern consumer hardware, this would introduce only a few seconds worth of compute for reasonable sized models.
In the context of the proposed DAL framework, this example pipeline would satisfy both levels D and C.
High Assurance Runtime Integrity Checking
In addition to basic levels of checks, High Assurance levels (models falling within DAL-B) are required to perform additional integrity checks. In addition to checking at model load time, they must be checked within the execution runtime of the model. For models deployed to GPUs, this would necessitate running integrity checking routines on the GPU. While sounding simple, this introduces several layers of complexity. GPU compute is highly optimized for small amounts of data (such as password cracking), but across a contiguous block of gigabytes of data, traditional crypto-secure hashes are not a realistic option. Further complicating things, these models are often distributed across processing units in a datacenter.
Instead, we propose a statistical approach as outlined in previous works. During inference, randomly select N parameters from each layer for integrity verification. This approach, first proposed by Chen et al. (2019), provides probabilistic assurance of model integrity with minimal performance impact. The number of parameters (N) can be tuned based on security requirements and performance constraints.
Another protection with low overhead is utilization of “canary inference pipelines”, where known inputs with known outputs are executed. If an unexpected outcome occurs, the model can be further investigated for tampering.
Additional policies, like memory-write protection are suggested, but not required.
Maximum Assurance Runtime Integrity Checking
At the highest assurance level, comprehensive verification takes precedence over performance considerations. Very few types of models fit within this category and are limited to models which, if compromised, can cause serious harm or death. For example, military applications where life and death decisions are made, or robotics applications where catastrophic failure would result in physical harm.
First, continuous verification of all model parameters through secure hardware mechanisms. While computationally expensive, this level of verification is necessary for critical applications where any compromise could be catastrophic.
Second, deployment within trusted execution environments (TEEs) such as Intel SGX or ARM TrustZone, providing hardware-enforced isolation and integrity verification.
Third, continuous validation of model behavior against formal specifications, including pre-condition and post-condition checking for critical operations.
Future Directions
While current hardware security modules provide robust integrity guarantees, the next generation of AI accelerators could incorporate dedicated circuitry for zero-knowledge proof generation and verification. This would enable continuous validation of model integrity without exposing the underlying parameters or computation paths.
In such a system, the AI accelerator would generate ZKPs during inference to prove that:
The model weights match their expected cryptographic commitments
The computation followed the intended neural network architecture
No unauthorized modifications occurred during runtime
The inference process maintained numeric stability and precision requirements
Current confidential computing platforms like AMD SEV and Intel SGX provide memory encryption and isolation, but they don't offer the mathematical guarantees that ZKPs could provide. For example, while an HSM can verify that model weights haven't been modified, it cannot prove that the computation itself followed the intended path without revealing implementation details.
Next-generation AI hardware could implement circuits for efficient proof generation using schemes like zk-SNARKs or Bulletproofs. These would be particularly valuable for regulated industries where third-party auditors need to verify model integrity without accessing proprietary model weights or architecture. For instance, a medical imaging model could prove it's using its approved weights and architecture without revealing the specific parameters that might be considered trade secrets.
