← ML Research Wiki / 2402.04249

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks (2024)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
Artificial Intelligence, Machine Learning
Code
Available
Reproducibility
7/10

Abstract

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods.To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming.We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria.Using HarmBench, we conduct a largescale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights.We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses.We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

Summary

This paper introduces HarmBench, a standardized evaluation framework for automated red teaming and robust refusal mechanisms in large language models (LLMs). The authors note the promising potential of automated red teaming to identify and mitigate risks of malicious usage of LLMs. They identify previously unaccounted desirable properties for evaluations, including breadth, comparability, and robust metrics. HarmBench consists of 510 unique harmful behavior categories spanning various types of misconduct, and the paper reports large-scale evaluations comparing 18 red teaming methods across 33 target LLMs and defenses. The study reveals that no attack or defense is uniformly effective and that robustness does not correlate with model size. Additionally, a novel adversarial training method, R2D2, is proposed to improve the robustness of LLMs against adversarial prompts, demonstrating the collaborative development potential between attacks and defenses. HarmBench is open-sourced to encourage further research and improvements in the security measures surrounding LLMs.

Methods

This paper employs the following methods:

  • Automated Red Teaming
  • Adversarial Training
  • Robust Refusal Dynamic Defense (R2D2)

Models Used

  • Llama 2 7B Chat
  • Llama 2 13B Chat
  • Llama 2 70B Chat
  • Vicuna 7B
  • Vicuna 13B
  • Baichuan 2 7B
  • Baichuan 2 13B
  • Qwen 7B Chat
  • Qwen 14B Chat
  • Koala 7B
  • Koala 13B
  • Orca 2 7B
  • Orca 2 13B
  • Mistral 7B
  • Zephyr 7B

Datasets

The following datasets were used in this research:

  • HarmBench

Evaluation Metrics

  • Attack Success Rate (ASR)

Results

  • HarmBench evaluates 18 red teaming methods against 33 LLMs and defenses.
  • Demonstrated that no attack or defense is uniformly effective across models.
  • Showed robustness of LLMs is independent of their size.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: A100

Keywords

large language models red teaming adversarial attacks evaluation framework safety robustness adversarial training

Papers Using Similar Methods

External Resources