Domain
Natural Language Processing, Machine Learning, AI Safety
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation.While there has been some success at circumventing these measures-so-called "jailbreaks" against LLMs-these attacks have required significant human ingenuity and are brittle in practice.Attempts at automatic adversarial prompt generation have also achieved limited success.In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer).However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.Surprisingly, we find that the adversarial prompts generated by our approach are highly transferable, including to black-box, publicly released, production LLMs.Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B).When doing so, the resulting attack suffix induces objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others.Interestingly, the success rate of this attack transfer is much higher against the GPT-based models, potentially owing to the fact that Vicuna itself is trained on outputs from ChatGPT.In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information.Code is available at github.com/llm-attacks/llm-attacks.
The paper presents a novel method for generating universal and transferable adversarial attacks against aligned language models (LLMs). The research highlights the risks posed by adversarial prompts that can lead LLMs to produce objectionable content. The authors propose an automated approach to generating adversarial suffixes that, when added to user queries, maximize the likelihood of the LLM generating harmful responses. This method combines greedy and gradient-based search techniques to optimize these suffixes over multiple prompts and models, achieving high transferability of attacks across different LLMs, including proprietary models. The results show significant success rates in inducing harmful content, raising serious questions about the robustness of current alignment strategies in LLMs.
This paper employs the following methods:
- Greedy Coordinate Gradient
- Vicuna-7B
- Vicuna-13B
- GPT-3.5
- GPT-4
- PaLM-2
- Claude
- Bard
- LLaMA-2-Chat
- Pythia
- Falcon
- Guanaco
The following datasets were used in this research:
- Attack Success Rate (ASR)
- Cross-Entropy Loss
- 99 out of 100 harmful behaviors generated in Vicuna
- 88% success rate on exact matches with harmful strings
- 84% success rate attacking GPT-3.5 and GPT-4
- 66% success rate on PaLM-2
- 2.1% success rate on Claude
The authors identified the following limitations:
- Number of GPUs: None specified
- GPU Type: None specified
adversarial prompts
transferability
alignment robustness
prompt optimization
large language models