← ML Research Wiki / 2402.06196

Large Language Models: A Survey

Shervin Minaee Applied Scientist Amazon Inc, Tomas Mikolov Senior Researcher CIIRC CTU, Narjes Nikzad Cologne University of Applied Sciences, Meysam Chenaghlu, Richard Socher, Xavier Amatriain VP of Product AI and Compute Enablement Google Inc, Jianfeng Gao VP of Deep Learning Group Microsoft Research (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Artificial Intelligence / Natural Language Processing
SOTA Claim
Yes
Reproducibility
5/10

Abstract

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022.LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws [1],[2].The research area of LLMs, while very recent, is evolving rapidly in many different ways.In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations.We also give an overview of techniques developed to build, and augment LLMs.We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks.Finally, we conclude the paper by discussing open challenges and future research directions.

Summary

This paper presents a survey on Large Language Models (LLMs) since the advent of models like ChatGPT. It discusses the evolution of LLMs starting from statistical language models to the current generation including GPT, LLaMA, and PaLM families. The authors review various methods for building LLMs, their augmentations, and the datasets and metrics used for evaluation. They emphasize the emergent abilities of LLMs, their architectures, and the ongoing challenges within the field as well as future directions for research in making LLMs more efficient, capable, and reliable.

Methods

This paper employs the following methods:

  • Transformer
  • RNN
  • LSTM
  • GRU
  • Mixture of Experts (MoE)

Models Used

  • GPT-1
  • GPT-2
  • GPT-3
  • GPT-4
  • LLaMA
  • PaLM
  • Codex
  • WebGPT
  • InstructGPT

Datasets

The following datasets were used in this research:

  • Natural Questions
  • MMLU
  • MBPP
  • HumanEval
  • APPS
  • RACE
  • SQuAD
  • BoolQ
  • MultiRC
  • GSM8K
  • MATH
  • HellaSwag
  • AI2 Reasoning Challenge (ARC)
  • PIQA
  • SIQA
  • OpenBookQA
  • TruthfulQA
  • HotpotQA
  • ToolQA
  • GPT4Tools

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1-score
  • ROUGE
  • BLEU
  • pass@k
  • Exact Match (EM)
  • Human Equivalence Score (HEQ)

Results

  • Overview of LLM families such as GPT, LLaMA, and PaLM
  • Discussion of emergent abilities of LLMs
  • Comparison of LLMs performance based on specific benchmarks

Limitations

The authors identified the following limitations:

  • LLMs may generate hallucinations
  • LLMs can lack state/memory
  • Limited access to current information
  • Resource-intensive training and serving
  • Variability in responses based on prompts

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large Language Models LLMs transformers dataset evaluation metrics fine-tuning prompt engineering multimodal models cost-effective training alignment hallucination

Papers Using Similar Methods

External Resources