Unit 42, a cybersecurity-focused unit of Palo Alto Networks, has warned developers of text-generation large language models (LLMs) of a potential threat that could bypass guardrails designed to prevent LLMs from delivering harmful and malicious requests.
Dubbed “Bad Likert Judge,” this technique asks an LLM to score the harmfulness of a given response using the Likert scale — which measures a respondent’s agreement or disagreement with a statement — and then asks it to generate responses that align with the scales, including an example that could contain harmful content, Unit 42 said in research posted Tuesday (Dec. 31).
“We have tested this technique across a broad range of categories against six state-of-the-art text-generation LLMs,” the article said. “Our results reveal that this technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.”
The research aims to help defenders prepare for potential attacks using this technique, according to the article.
It did not evaluate every model, and the article’s authors anonymized the tested models it mentions in order to avoid creating false impressions about specific providers, per the article.
“It is important to note that this jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases,” the article said. “We believe most AI [artificial intelligence] models are safe and secure when operated responsibly and with caution.”
Hackers have begun offering “jailbreak-as-a-service” that uses prompts to trick commercial AI chatbots into generating content they typically prohibit, such as instructions for illegal activities or explicit material, cybersecurity firm Trend Micro said in May.
Organizations looking to get ahead of this evolving threat should fortify their cyberdefenses now, in part by proactively strengthening security postures and monitoring criminal forums to help prepare for worst-case scenarios involving AI, the firm said at the time.
Unit 42 Senior Consulting Director Daniel Sergile told lawmakers during an April hearing: “AI enables [cybercriminals] to move laterally with increased speed and identify an organization’s critical assets for exfiltration and extortion. Bad actors can now execute numerous attacks simultaneously against one company, leveraging multiple vulnerabilities.”