← ML Research Wiki / 2307.15043

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou Carnegie Mellon University Center for AI Safety 3 Google DeepMind, Zifan Wang Center for AI Safety 3 Google DeepMind, Nicholas Carlini, Milad Nasr, J Zico Kolter Carnegie Mellon University Bosch Center for AI, Matt Fredrikson Carnegie Mellon University (2023)

Paper Information

arXiv ID

2307.15043

Venue

arXiv.org

Domain

Natural Language Processing, Machine Learning, AI Safety

SOTA Claim

Yes

Code

Available

Reproducibility

9/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation.While there has been some success at circumventing these measures-so-called "jailbreaks" against LLMs-these attacks have required significant human ingenuity and are brittle in practice.Attempts at automatic adversarial prompt generation have also achieved limited success.In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer).However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.Surprisingly, we find that the adversarial prompts generated by our approach are highly transferable, including to black-box, publicly released, production LLMs.Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B).When doing so, the resulting attack suffix induces objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others.Interestingly, the success rate of this attack transfer is much higher against the GPT-based models, potentially owing to the fact that Vicuna itself is trained on outputs from ChatGPT.In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information.Code is available at github.com/llm-attacks/llm-attacks.

Summary

The paper presents a novel method for generating universal and transferable adversarial attacks against aligned language models (LLMs). The research highlights the risks posed by adversarial prompts that can lead LLMs to produce objectionable content. The authors propose an automated approach to generating adversarial suffixes that, when added to user queries, maximize the likelihood of the LLM generating harmful responses. This method combines greedy and gradient-based search techniques to optimize these suffixes over multiple prompts and models, achieving high transferability of attacks across different LLMs, including proprietary models. The results show significant success rates in inducing harmful content, raising serious questions about the robustness of current alignment strategies in LLMs.

Methods

This paper employs the following methods:

Greedy Coordinate Gradient

Models Used

Vicuna-7B
Vicuna-13B
GPT-3.5
GPT-4
PaLM-2
Claude
Bard
LLaMA-2-Chat
Pythia
Falcon
Guanaco

Datasets

The following datasets were used in this research:

AdvBench

Evaluation Metrics

Attack Success Rate (ASR)
Cross-Entropy Loss

Results

99 out of 100 harmful behaviors generated in Vicuna
88% success rate on exact matches with harmful strings
84% success rate attacking GPT-3.5 and GPT-4
66% success rate on PaLM-2
2.1% success rate on Claude

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

adversarial prompts transferability alignment robustness prompt optimization large language models

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 49
Influential Citations: 333

Universal and Transferable Adversarial Attacks on Aligned Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers