Venue
International Conference on Machine Learning
Domain
Artificial Intelligence, Machine Learning
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods.To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming.We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria.Using HarmBench, we conduct a largescale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights.We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses.We open source HarmBench at https://github.com/centerforaisafety/HarmBench.
This paper introduces HarmBench, a standardized evaluation framework for automated red teaming and robust refusal mechanisms in large language models (LLMs). The authors note the promising potential of automated red teaming to identify and mitigate risks of malicious usage of LLMs. They identify previously unaccounted desirable properties for evaluations, including breadth, comparability, and robust metrics. HarmBench consists of 510 unique harmful behavior categories spanning various types of misconduct, and the paper reports large-scale evaluations comparing 18 red teaming methods across 33 target LLMs and defenses. The study reveals that no attack or defense is uniformly effective and that robustness does not correlate with model size. Additionally, a novel adversarial training method, R2D2, is proposed to improve the robustness of LLMs against adversarial prompts, demonstrating the collaborative development potential between attacks and defenses. HarmBench is open-sourced to encourage further research and improvements in the security measures surrounding LLMs.
This paper employs the following methods:
- Automated Red Teaming
- Adversarial Training
- Robust Refusal Dynamic Defense (R2D2)
- Llama 2 7B Chat
- Llama 2 13B Chat
- Llama 2 70B Chat
- Vicuna 7B
- Vicuna 13B
- Baichuan 2 7B
- Baichuan 2 13B
- Qwen 7B Chat
- Qwen 14B Chat
- Koala 7B
- Koala 13B
- Orca 2 7B
- Orca 2 13B
- Mistral 7B
- Zephyr 7B
The following datasets were used in this research:
- Attack Success Rate (ASR)
- HarmBench evaluates 18 red teaming methods against 33 LLMs and defenses.
- Demonstrated that no attack or defense is uniformly effective across models.
- Showed robustness of LLMs is independent of their size.
The authors identified the following limitations:
- Number of GPUs: 8
- GPU Type: A100
large language models
red teaming
adversarial attacks
evaluation framework
safety
robustness
adversarial training