← ML Research Wiki / 2402.04249

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks (2024)

Paper Information

arXiv ID

2402.04249

Venue

International Conference on Machine Learning

Domain

Artificial Intelligence, Machine Learning

Code

Available

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods.To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming.We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria.Using HarmBench, we conduct a largescale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights.We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses.We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

Summary

This paper introduces HarmBench, a standardized evaluation framework for automated red teaming and robust refusal mechanisms in large language models (LLMs). The authors note the promising potential of automated red teaming to identify and mitigate risks of malicious usage of LLMs. They identify previously unaccounted desirable properties for evaluations, including breadth, comparability, and robust metrics. HarmBench consists of 510 unique harmful behavior categories spanning various types of misconduct, and the paper reports large-scale evaluations comparing 18 red teaming methods across 33 target LLMs and defenses. The study reveals that no attack or defense is uniformly effective and that robustness does not correlate with model size. Additionally, a novel adversarial training method, R2D2, is proposed to improve the robustness of LLMs against adversarial prompts, demonstrating the collaborative development potential between attacks and defenses. HarmBench is open-sourced to encourage further research and improvements in the security measures surrounding LLMs.

Methods

This paper employs the following methods:

Automated Red Teaming
Adversarial Training
Robust Refusal Dynamic Defense (R2D2)

Models Used

Llama 2 7B Chat
Llama 2 13B Chat
Llama 2 70B Chat
Vicuna 7B
Vicuna 13B
Baichuan 2 7B
Baichuan 2 13B
Qwen 7B Chat
Qwen 14B Chat
Koala 7B
Koala 13B
Orca 2 7B
Orca 2 13B
Mistral 7B
Zephyr 7B

Datasets

The following datasets were used in this research:

HarmBench

Evaluation Metrics

Attack Success Rate (ASR)

Results

HarmBench evaluates 18 red teaming methods against 33 LLMs and defenses.
Demonstrated that no attack or defense is uniformly effective across models.
Showed robustness of LLMs is independent of their size.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 8
GPU Type: A100

Keywords

large language models red teaming adversarial attacks evaluation framework safety robustness adversarial training

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 107
Influential Citations: 88

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers