top of page

Who Controls the Machine? AI, the Defense Department, and Jailbreaking by Dr. Timothy Smith

Photo Source: Unsplash


The very public confrontation between the US military leadership and AI pioneers Anthropic and OpenAI has exposed a simmering conflict regarding the control of technology developed in the private sector. Can a company dictate how the military uses its technology? At the center of the standoff between Anthropic and the military sits the ethical use of AI as defined by Anthropic. Anthropic built its AI called Claude with safety guardrails built in, and Anthropic requires that its tool not be used to power autonomous killing robots or for mass surveillance of the population.

 

However, in early 2026, US Defense Secretary Pete Hegseth issued an ultimatum to Anthropic CEO Dario Amodi demanding that the military be allowed to use Claude in any way the military deemed necessary, or Anthropic would lose its lucrative $200 million contract with the military, any other branch of the government, and the government’s suppliers. Such a ban on Anthropic would have deep repercussions to their business. However, the request stands in direct contradiction to Anthropic’s usage policies, national security, and AI Safety. 

 

LLMs such as Claude or Gemini have safety guardrails built into the model during training. In a process called Reinforcement Learning from Human Feedback (RLHF), LLMs get trained by humans in sensitive areas such as hate speech, malware code, and weapon construction to not respond to such requests. In the process, humans prompt the model and reward good answers and discourage bad answers. The training takes great time and effort and embeds safety into the model. Therefore, safety cannot simply get turned off in a model. The military in effect as asked for full access to Claude to undo the safety training built into the model. In fact, the model needs retraining to let go of its safety guardrails in process called jailbreaking. In essence, the military wants to remodel Claude in conflict with Claude’s safety training

 

Jailbreaking refers to a process of retraining a model’s responses to questions, and researchers have found a variety of jailbreaking techniques. One such technique LoRA or Low-Rank Adaptation keeps the original model and just trains a small model to do the thing blocked by the security guardrails. The small model then gets integrated into the large model, thereby negating the safety guardrails. LoRA requires thousands of times less time and computing resources than training a full fledged LLM. Another technique called abliteration finds the parts the model that will block a response such as “I cannot help with that” and mathematically removes or ablates that element of the model. Researchers in 2024 in a paper titled “Refusal in Language Models Is Mediated by a Single Direction” demonstrated the effectiveness of abliteraion on the publicly available LLM from Meta called Llama 3. (arxiv.org) The abliteration produced a fully functioning LLM with the safety guardrails removed.

 

The relationship in the standoff between Anthropic and the US Military and jailbreaking stands on the question of who dictates the usage of a technology after its creation? Given that safety guardrails built into models can be circumvented by various techniques demonstrates the fragility of safety systems. The US Government must obey laws and contracts to work with the private sector, but the growing abilities in jailbreaking indicate that outlaw actors such as criminals and enemy states can obtain publicly available models and jailbreak them for criminal and hostile purposes further demonstrates the potential danger posed by LLMs developed with safety built in but not impervious to jailbreaking.







Dr. Smith’s career in scientific and information research spans the areas of bioinformatics, artificial intelligence, toxicology, and chemistry. He has published a number of peer-reviewed scientific papers. He has worked over the past seventeen years developing advanced analytics, machine learning, and knowledge management tools to enable research and support high-level decision making. Tim completed his Ph.D. in Toxicology at Cornell University and a Bachelor of Science in chemistry from the University of Washington.


You can buy his book on Amazon in paperback and in kindle format here.








 

 


 




 

 



 
 
 

Comments


bottom of page