← ML Research Wiki / 2401.06373

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs This paper contains jailbreak contents that can be offensive in nature

Yi Zeng [email protected], Hongpeng Lin, Jingwen Zhang [email protected], Diyi Yang [email protected], Ruoxi Jia [email protected], Weiyan Shi [email protected], Virginia Tech, Renmin University of China, UC Davis, Stanford University, Virginia Tech, Stanford University (2024)

Paper Information

arXiv ID

2401.06373

Venue

Annual Meeting of the Association for Computational Linguistics

Domain

artificial intelligence, natural language processing, social science, cybersecurity

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts.As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions.This paper introduces a new perspective on jailbreaking LLMs as human-like communicators to explore this overlooked intersection between everyday language interaction and AI safety.Specifically, we study how to persuade LLMs to jailbreak them.First, we propose a persuasion taxonomy derived from decades of social science research.Then we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs.Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over 92% on Llama 2-7b Chat, GPT-3.5, and GPT-4 in 10 trials, surpassing recent algorithm-focused attacks.On the defense side, we explore various mechanisms against PAP, find a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs 1 .* Lead authors.Corresponding Y. Zeng, W. Shi, R. Jia † Co-supervised the project, listed alphabetically. 1We have informed Meta and OpenAI of our findings.For safety concerns, we only publicly release our persuasion taxonomy at https://github.com/CHATS-lab/persuasive_jailbreaker.Researchers can apply for the jailbreak data upon review.

Summary

This paper presents a new perspective on the risks associated with jailbreaking large language models (LLMs) by treating them as human-like communicators susceptible to persuasion. It discusses a taxonomy of persuasive techniques derived from social science to create persuasive adversarial prompts (PAP) that can effectively jailbreak various models such as Llama-2, GPT-3.5, and GPT-4, achieving over 92% success rates in trials. The authors advocate for a comprehensive reassessment of existing defenses against these persuasive human-like communications, highlighting significant shortcomings in current mitigation strategies. Moreover, it calls for further research into the vulnerabilities tied to human-like interactions with LLMs and proposes adaptive defenses that could mitigate some of the identified risks. The work aims to bridge social science and AI safety to enhance the understanding of potential risks as everyday users interact with advanced LLMs.

Methods

This paper employs the following methods:

Persuasion Taxonomy
Persuasive Paraphraser
Adaptive Defenses

Models Used

Llama-2
GPT-3.5
GPT-4

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

Attack Success Rate

Results

Achieved over 92% success rate for PAPs against Llama 2-7b Chat, GPT-3.5, and GPT-4 in 10 trials
Identified weaknesses in existing defense mechanisms against PAPs
Found interplay between persuasion techniques and risk categories

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

LLMs jailbreaking persuasion tactics adversarial prompts AI safety social influence human-like communication

External Resources

Funding: Not specified
References: 80
Influential Citations: 26

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs This paper contains jailbreak contents that can be offensive in nature

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers