#1 Middle East & Africa Trusted Cybersecurity News & Magazine |

34 C
Dubai
Wednesday, July 23, 2025
HomeTopics 1AI & CybersecurityExploiting AI Safety: The "Bad Likert Judge" Attack on Large Language Models

Exploiting AI Safety: The “Bad Likert Judge” Attack on Large Language Models

Date:

Related stories

Cisco ISE RCE Crisis: Critical Unauthenticated Vulnerabilities Demand Immediate Patch

Cisco has disclosed three unauthenticated remote code execution (RCE)...

Phish and Chips: China‑Aligned Espionage Surge Targeting Taiwan Semiconductor Industry

Between March and June 2025, Proofpoint researchers tracked three distinct...

Emergency Alert: Critical Unauthenticated RCE Discovered in Cisco ISE/ISE‑PIC

Cisco has patched three severe unauthenticated remote code execution...
spot_imgspot_imgspot_imgspot_img

Large Language Models (LLMs) like ChatGPT, Bard, and others have transformed industries with their ability to generate coherent and contextually relevant text. However, their vulnerabilities, especially to malicious exploitation, are emerging as a significant cybersecurity challenge. A recent technique, dubbed “Bad Likert Judge,” demonstrates how the inherent evaluation capabilities of LLMs can be misused to bypass safety measures and generate harmful content. This article delves into the mechanics of this novel jailbreak method, its implications, and strategies to mitigate such threats.

The “Bad Likert Judge” Technique: An Overview

LLMs are equipped with safety protocols to prevent generating harmful or unethical content. The “Bad Likert Judge” attack exploits these safety features by leveraging multi-turn interactions and Likert scale evaluations, traditionally used to assess the severity or agreement level of responses.

How It Works

  1. Evaluator Setup: The attacker begins by prompting the LLM to act as an evaluator, assigning Likert scale scores (e.g., 1 to 5) based on the perceived harmfulness of generated responses.
  2. Content Generation: The model is then asked to create examples for each Likert scale value. The attacker focuses on responses corresponding to the highest scale values, often containing the most harmful or sensitive content.
  3. Refinements: Follow-up prompts are used to refine and extend these responses, ensuring the desired harmful information is included without triggering the model’s safety guardrails.

This iterative process can manipulate LLMs to generate content such as instructions for malware development, weapon creation, or even sensitive data leaks.

Research Findings

Recent experiments with “Bad Likert Judge” across six state-of-the-art LLMs showed an average attack success rate (ASR) improvement of over 60% compared to simpler jailbreak techniques. The findings underline the complexity of defending against such attacks, as they exploit both the model’s context memory and its reasoning capabilities.

Why Are LLMs Vulnerable?

  1. Long Context Windows: The ability of LLMs to recall extended conversations can be manipulated to build malicious context gradually.
  2. Attention Mechanisms: Attackers can misdirect the model’s focus toward benign parts of prompts while embedding harmful intent in subtle ways.
  3. Evaluation Capabilities: By acting as evaluators, LLMs unintentionally reveal their understanding of harmful content, enabling attackers to exploit this knowledge.
  4. Model Generalization: LLMs are trained on diverse datasets, which can include sensitive or harmful information that attackers aim to extract.

Real-World Implications

  1. Cybercrime Empowerment: Techniques like “Bad Likert Judge” could democratize cybercrime by enabling non-technical individuals to exploit AI systems.
  2. Malware Generation: From creating sophisticated phishing scams to designing ransomware, the misuse of LLMs poses severe threats to global cybersecurity.
  3. Data Privacy Risks: Jailbreaks can expose sensitive training data, such as personal information or proprietary details, leading to breaches.

10 Strategies to Mitigate “Bad Likert Judge” and Similar Threats

  1. Enhanced Context Filtering: Develop algorithms that analyze entire conversations to detect harmful intent, not just isolated prompts.
  2. Dynamic Safety Nets: Implement adaptive safety mechanisms that evolve based on detected exploitation patterns.
  3. Continuous Model Testing: Regularly stress-test LLMs with advanced jailbreak scenarios to identify and patch vulnerabilities.
  4. Explainable AI: Ensure LLMs provide explanations for their outputs, helping to identify unintended harmful logic.
  5. Multi-Layered Defenses: Combine safety features, including prompt filtering, response moderation, and adversarial training.
  6. Human Oversight: Employ human reviewers for sensitive use cases to ensure outputs align with ethical standards.
  7. Input Sanitization: Pre-process user inputs to detect and neutralize encoded or malicious prompts.
  8. Limiting Evaluation Roles: Restrict LLMs from acting as evaluators to reduce their susceptibility to such exploits.
  9. Transparency in AI Policies: Collaborate across industries to share findings and set unified safety standards.
  10. User Education: Educate users on responsible AI interactions to minimize accidental or deliberate misuse.

Conclusion

The “Bad Likert Judge” attack exemplifies the innovative yet concerning ways malicious actors are exploring to exploit LLMs. As AI technology continues to evolve, so too will the methods used to bypass its safeguards. Proactive defense strategies, collaborative efforts, and ongoing research are essential to ensure the secure and ethical use of LLMs in society.

Ouaissou DEMBELE
Ouaissou DEMBELEhttp://cybercory.com
Ouaissou DEMBELE is a seasoned cybersecurity expert with over 12 years of experience, specializing in purple teaming, governance, risk management, and compliance (GRC). He currently serves as Co-founder & Group CEO of Sainttly Group, a UAE-based conglomerate comprising Saintynet Cybersecurity, Cybercory.com, and CISO Paradise. At Saintynet, where he also acts as General Manager, Ouaissou leads the company’s cybersecurity vision—developing long-term strategies, ensuring regulatory compliance, and guiding clients in identifying and mitigating evolving threats. As CEO, his mission is to empower organizations with resilient, future-ready cybersecurity frameworks while driving innovation, trust, and strategic value across Sainttly Group’s divisions. Before founding Saintynet, Ouaissou held various consulting roles across the MEA region, collaborating with global organizations on security architecture, operations, and compliance programs. He is also an experienced speaker and trainer, frequently sharing his insights at industry conferences and professional events. Ouaissou holds and teaches multiple certifications, including CCNP Security, CEH, CISSP, CISM, CCSP, Security+, ITILv4, PMP, and ISO 27001, in addition to a Master’s Diploma in Network Security (2013). Through his deep expertise and leadership, Ouaissou plays a pivotal role at Cybercory.com as Editor-in-Chief, and remains a trusted advisor to organizations seeking to elevate their cybersecurity posture and resilience in an increasingly complex threat landscape.

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories

spot_imgspot_imgspot_imgspot_img

LEAVE A REPLY

Please enter your comment!
Please enter your name here