← ML Research Wiki / 2307.03172

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu [email protected] Stanford University USA, Kevin Lin University of California BerkeleyUSA, John Hewitt Stanford University USA, Ashwin Paranjape Samaya AI UK Samaya AI USA, Michele Bevilacqua Samaya AI UK, Fabio Petroni Samaya AI UK, Percy Liang Stanford University USA (2023)

Paper Information
arXiv ID
Venue
Transactions of the Association for Computational Linguistics
Domain
natural language processing
Reproducibility
7/10

Abstract

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context.We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval.We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts.In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models.Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

Summary

In the paper titled "Lost in the Middle: How Language Models Use Long Contexts," the authors analyze the efficacy of language models in utilizing long input contexts. They conducted experiments focusing on multi-document question answering and key-value retrieval tasks, revealing that model performance significantly degrades when relevant information is located in the middle of the input context. Specifically, performance peaks at the beginning and end of the context, which indicates a primacy and recency bias. The study investigates various factors affecting this performance, such as model architecture, query-aware contextualization, and instruction fine-tuning. The findings suggest that while extended-context models exist, they are not necessarily better at using long contexts effectively. Further, a case study on open-domain question answering illustrates that simply increasing context length does not guarantee improved model performance, as performance saturates even before retrieval accuracy peaks. The paper concludes by proposing new evaluation protocols for future long-context models and releasing their code and evaluation data for further research.

Methods

This paper employs the following methods:

  • Transformer
  • query-aware contextualization

Models Used

  • MPT-30B-Instruct
  • LongChat-13B (16K)
  • GPT-3.5-Turbo
  • Claude-1.3

Datasets

The following datasets were used in this research:

  • NaturalQuestions-Open

Evaluation Metrics

  • Accuracy

Results

  • Performance degrades significantly when relevant information is in the middle of the input context
  • High performance is observed when relevant information is at the start or end (primacy and recency bias)
  • Extended-context models do not outperform regular-context models in effectively using input context
  • Saturation of model performance occurs before saturation of retrieval accuracy in open-domain question answering

Limitations

The authors identified the following limitations:

  • Current language models do not robustly access and utilize information in long input contexts
  • Increased context length raises reasoning challenges, potentially decreasing accuracy
  • Limited exploration of other decoding methods beyond greedy decoding

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

long input context transformers model architecture question answering retrieval

Papers Using Similar Methods

External Resources