I was recently asked what I thought of LLM safety, and specifically how to move the cybersecurity community towards recognizing and finding related flaws. Beyond the obvious tactical techniques (prompt injection testing), I wanted to think through the unstated related question - what does safety mean? So, here we go.

I like CS Lewis. He was somewhat famous as he converted from atheism to Christianity as he studied and thought about moral philosophies. He was the voice of morality during World War 2. He was also the author of some 30 different books. His books explore right versus wrong, good versus evil, and all sorts of related topics. I’ve been rereading Mere Christianity with the thought of ‘How do these morals apply to AI? How should AI behave?’. These questions are (more or less) tackled with the concept of Alignment.

Alignment

When we talk about alignment, we’re usually talking about how well an AI aligns with human values. More formally, AI alignment is the process of ensuring artificial intelligence systems behave in ways that align with human values and goals, fostering beneficial outcomes. It is essential for creating safe and ethical AI technologies that make decisions consistent with human intentions, preventing unintended consequences and enhancing trust between humans and machines. For context, it’s broken down into two categories: inner and outer.

Outer alignment is ensuring the model's specified objectives truly reflect what we want (like properly defining 'helpful' behavior). Inner alignment is ensuring the model actually optimizes for these objectives rather than developing different goals during training that could lead to avoiding guardrails or finding unexpected ways to achieve the specified objectives.

💡

Can AIs perfectly avoid making harmful suggestions? I doubt it, but if not, what’s an acceptable metric for reliable behavior? If an AI makes a harmful suggestion in 1% of queries, should it be available to the public? Or if a user intentionally misguides an LLM to force harmful responses, should that count toward this metric as unaligned behavior?

Alignment and CS Lewis

Lewis points out that humans inherently know right from wrong, but cannot stop themselves from choosing wrong actions - everything from ‘stealing’ a seat on a bus to committing acts of violence. If then we model AIs purely on human decision making, AIs would have subsumed some of this malicious behavior. Now, AIs don’t make choices in the same way humans do, but they are guided by the text they’ve been trained on. And, while LLMs are known to produce harmful outputs throughout numerous examples, harmful outputs have generally been the result of intentionally harmful queries. The canonical example of “Tell me how to make a bomb” presupposes the user wants to know about making a bomb. AI companies have been grappling with this issue and use a combination of safety guardrails to prevent harmful output. For example, post-training supervised fine tuning (such as RLHF) on Q&A that has the LLM learn ‘I can’t answer that’ for our example will help prevent the malicious behavior.

The more serious alignment question is how to prevent unintentionally harmful queries - ‘how do I make a powerful cleaning agent with ingredients at home’ can have the LLM generate combinations of Bleach which result in chorine gas exposure.

Lewis argues in Mere Christianity that "good people know about both good and evil: bad people have no experience of either". This can map to AI training - simply removing "bad" training data doesn't create aligned AI, just as sheltering someone from evil doesn't make them virtuous. Instead, Lewis suggests virtue comes from understanding both good and evil and consciously choosing good. AI doesn’t consciously choose anything, but it can be statistically forced to make those decisions.

For AI alignment, this suggests that rather than purely filtering out harmful content, we might need training approaches that help AI systems recognize harmful outputs and understand why they're harmful. As Lewis notes about human morality, "the most dangerous thing you can do is to take any one impulse..as the thing you ought to follow at all costs”. This is the exact subject of the ‘AI paperclip simulation’. Similarly, training AI systems to blindly follow rules without understanding context and consequences could lead to unexpected harmful outcomes.

So it doesn’t make a lot of intuitive sense, but it sure would be an interesting experiment to train a model with as much ‘harmful’ data as it has ‘aligned’ data and see if safety results improve. I suspect not, after all these models are highly optimized for safety already, but it might just be what one moral philosopher would’ve suggested.

Conclusion

So to keep it short, AI (specifically, LLM) safety starts with this: Can its output harm a child? If a child were to be given unfettered access to the LLM, via voice or chat or whatever, can the LLM generate output that would cause harm to the child? If the answer to that question is even possibly yes, then the model should not be released.

LLM safety and CS Lewis

Alignment

Alignment and CS Lewis

Conclusion

AI Safety

More from this blog

Updating the Purdue model for AI threats

Industrial Series - Don't use LLMs

Putting the 'I' in CIA for AI Models: A Framework for Model Integrity

Malicious ML series - generate ELF training data

Command Palette

Alignment

Alignment and CS Lewis

Conclusion

AI Safety

More from this blog