← ML Research Wiki / 2305.14325

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du [email protected] MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM, Shuang Li [email protected] MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM, Antonio Torralba [email protected] MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM, Joshua B Tenenbaum MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM, Igor Mordatch [email protected] MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM, Google Brain MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL BCS CBMM (2023)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
natural language processing
Reproducibility
4/10

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding. Project website at httpsAccuracy 0 25 50 75 100 Biographies MMLU Chess Move Validity Arithmetic Grade School Math Chess Move Optimality Single Model Multi-Model Debate Round 1 Agent 2: If there were 175 diamonds, then the number of rubies is 35 fewer, or 175 -35 = 140 rubies.And since there were twice as many emeralds as rubies, we can represent the number of emeralds as 2 * 140 = 280 emeralds.To find the total number of gems in the chest, we can add up the number of diamonds, rubies, and emeralds: 175 + 140 + 280 = 595. Therefore, there were 595 gems in the chest.Agent 1: Let x be the number of rubies. Then the number of diamonds is 175. And the number of emeralds is 2(x-35), since there are twice as many emeralds as rubies and there are 35 fewer rubies than diamonds.To find the total number of gems, we add the number of diamonds, rubies, and emeralds: Total gems = 175 + x + 2(x-35) = 3x + 105.We don't know the exact value of x, but we do know that it represents the number of rubies, which is less than 175 (since there are 35 fewer rubies than diamonds). Let's try some values of x: If x = 50, then the total gems = 3(50) + 105 = 255. If x = 100, then the total gems = 3(100) + 105 = 405.Since we know there were fewer rubies than diamonds, the answer must be closer to 255 than to 405. After some trial and error, we can find that x = 40 gives us: Total gems = 3(40) + 105 = 225.User Input: A treasure hunter found a buried treasure chest filled with gems. There were 175 diamonds, 35 fewer rubies than diamonds, and twice the number of emeralds than the rubies. How many of the gems were there in the chest?

Summary

This paper introduces a novel multi-agent debate approach to enhance the factual accuracy and reasoning abilities of large language models (LLMs). Traditional methods to improve LLMs often involve single-instance techniques like self-consistency and prompting strategies. In contrast, the proposed method involves multiple instances of LLMs generating answers, critiquing each other's responses, and iterating this process through several rounds. This method not only reports improved mathematical and strategic reasoning but also reduces factual inaccuracies and hallucinations commonly associated with current models. The authors provide comprehensive evaluations across various tasks, including a new benchmark specifically targeting the factual accuracy of biographies, and they highlight that their method allows for improvements in the performance of existing black-box models without requiring access to internal model information. The results illustrate that the debate among multiple LLM instances leads to consensus, thereby increasing the correctness of answers through collaborative reasoning and criticism of uncertain facts.

Methods

This paper employs the following methods:

  • Multiagent Debate
  • Self-Reflection
  • Chain of Thought
  • Zero-Shot Learning

Datasets

The following datasets were used in this research:

  • GSM8K
  • MMLU

Evaluation Metrics

  • Accuracy

Results

  • Improved mathematical reasoning
  • Enhanced factual accuracy
  • Reduced hallucinations in generated content

Limitations

The authors identified the following limitations:

  • Higher computational cost due to multiple model instances
  • Longer debates may not be fully processed by models
  • Debate answers may converge on incorrect responses despite consensus

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

factuality reasoning multiagent debate language models

Papers Using Similar Methods

External Resources