Domain
Natural Language Processing
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs).However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding.To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity.RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles.Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context.We evaluate 17 long-context LMs with 13 representative tasks in RULER.Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases.While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K.Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity.We open source RULER to spur comprehensive evaluation of long-context LMs.
The paper introduces RULER, a novel benchmark designed to evaluate long-context language models (LMs) by providing a more comprehensive assessment beyond traditional retrieval tasks. RULER expands on the needle-in-a-haystack (NIAH) test by introducing various task categories including Retrieval, Multi-hop Tracing, Aggregation, and Question Answering. The authors evaluate 17 long-context LMs across 13 tasks, highlighting that many models experience performance degradation as context length increases, despite claims of handling lengthy input. Highlighted findings include the significant difference between claimed and effective context lengths, the performance drops in multi-hop tasks, and the tendency of models to overly rely on parametric knowledge or produce incomplete responses at larger context sizes. RULER is positioned as an open-source tool to encourage further research into long-context LMs and their capabilities.
This paper employs the following methods:
- RULER
- needle-in-a-haystack
- multi-hop tracing
- aggregation
- question answering
The following datasets were used in this research:
- Models exhibit large performance drops with increasing context length
- Only half of the models can maintain satisfactory performance at 32K tokens
- The top models (Gemini-1.5 and GPT-4) consistently outperform others
The authors identified the following limitations:
- Number of GPUs: 8
- GPU Type: NVIDIA A100
long-context
benchmark
language models
RULER
retrieval
multi-hop tracing
aggregation
question answering