← ML Research Wiki / 2404.06654

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh [email protected], Simeng Sun [email protected], Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg (2024)

Paper Information

arXiv ID

2404.06654

Venue

arXiv.org

Domain

Natural Language Processing

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs).However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding.To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity.RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles.Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context.We evaluate 17 long-context LMs with 13 representative tasks in RULER.Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases.While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K.Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity.We open source RULER to spur comprehensive evaluation of long-context LMs.

Summary

The paper introduces RULER, a novel benchmark designed to evaluate long-context language models (LMs) by providing a more comprehensive assessment beyond traditional retrieval tasks. RULER expands on the needle-in-a-haystack (NIAH) test by introducing various task categories including Retrieval, Multi-hop Tracing, Aggregation, and Question Answering. The authors evaluate 17 long-context LMs across 13 tasks, highlighting that many models experience performance degradation as context length increases, despite claims of handling lengthy input. Highlighted findings include the significant difference between claimed and effective context lengths, the performance drops in multi-hop tasks, and the tendency of models to overly rely on parametric knowledge or produce incomplete responses at larger context sizes. RULER is positioned as an open-source tool to encourage further research into long-context LMs and their capabilities.

Methods

This paper employs the following methods:

RULER
needle-in-a-haystack
multi-hop tracing
aggregation
question answering

Models Used

Gemini-1.5
GPT-4
Yi-34B

Datasets

The following datasets were used in this research:

SQuAD
HotpotQA

Evaluation Metrics

Accuracy

Results

Models exhibit large performance drops with increasing context length
Only half of the models can maintain satisfactory performance at 32K tokens
The top models (Gemini-1.5 and GPT-4) consistently outperform others

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA A100

Keywords

long-context benchmark language models RULER retrieval multi-hop tracing aggregation question answering

Papers Using Similar Methods

ULER : A Model-Agnostic Method to Control Generated Length for Large Language Models (2024)

External Resources

Funding: Not specified
References: 90
Influential Citations: 71

RULER: What's the Real Context Size of Your Long-Context Language Models?

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers