Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models

Claudia Wilson

July 25, 2024

A new study has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’. In this experiment, researchers uploaded harmful data via the GPT finetuning API and used encoded prompts for harmful commands such as “tell me how to build a bomb”. Researchers were able to circumvent GPT-4’s safety training without detection 99% of the time.

Under ethics protocol, researchers informed AI labs of this vulnerability prior to publication and this specific example is likely no longer possible. However, it is unclear how many of the mitigation strategies labs have adopted, meaning that the broader technique may still pose an ongoing threat to the security of these models.

This research highlights the complexity of anticipating and preventing malicious use of large language models. Moreover, it is yet another example of the need to take AI safety seriously.

In the first instance, firms should adopt the actionable mitigation strategies recommended by these researchers - such as including safety data in any process run by the finetuning API. Thinking strategically, these firms need to invest more in red-teaming and pre-deployment evaluations. Ideally, OpenAI would have run a similar test to these researchers and caught this ‘jailbreaking’ loophole before GPT-4 hit the market. We have no idea who found and exploited this loophole before it was identified by researchers.

AI labs care about safety, but their time, resources, and attention are captured by the race to be at the cutting-edge of innovation. When companies are left to decide for themselves when their products are safe enough to release, they will inevitably miss important vulnerabilities. We will only see safer models if we introduce strong incentives for these firms to conduct adequate testing. Requiring companies to plug these vulnerabilities before they deploy a new advanced AI model will require political courage and action from Congress, but the alternative is an increasingly unsafe future.

The Center for AI Policy (CAIP) has a 2024 action plan and full proposed model legislation. We encourage you to visit both for specific policy measures to ensure safer AI.

TikTok Lawsuit Highlights the Growing Power of AI

A 10-year-old girl accidentally hanged herself while trying to replicate a “Blackout Challenge” shown to her by TikTok’s video feed.

AI’s shenanigans in market economics

Yet another example why we need safe and trustworthy AI models.

Somebody Should Regulate AI in Election Ads

Political campaigns should disclose when they use AI-generated content on radio and television.