Domain
Artificial Intelligence / Natural Language Processing
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022.LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws [1],[2].The research area of LLMs, while very recent, is evolving rapidly in many different ways.In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations.We also give an overview of techniques developed to build, and augment LLMs.We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks.Finally, we conclude the paper by discussing open challenges and future research directions.
This paper presents a survey on Large Language Models (LLMs) since the advent of models like ChatGPT. It discusses the evolution of LLMs starting from statistical language models to the current generation including GPT, LLaMA, and PaLM families. The authors review various methods for building LLMs, their augmentations, and the datasets and metrics used for evaluation. They emphasize the emergent abilities of LLMs, their architectures, and the ongoing challenges within the field as well as future directions for research in making LLMs more efficient, capable, and reliable.
This paper employs the following methods:
- Transformer
- RNN
- LSTM
- GRU
- Mixture of Experts (MoE)
- GPT-1
- GPT-2
- GPT-3
- GPT-4
- LLaMA
- PaLM
- Codex
- WebGPT
- InstructGPT
The following datasets were used in this research:
- Natural Questions
- MMLU
- MBPP
- HumanEval
- APPS
- RACE
- SQuAD
- BoolQ
- MultiRC
- GSM8K
- MATH
- HellaSwag
- AI2 Reasoning Challenge (ARC)
- PIQA
- SIQA
- OpenBookQA
- TruthfulQA
- HotpotQA
- ToolQA
- GPT4Tools
- Accuracy
- Precision
- Recall
- F1-score
- ROUGE
- BLEU
- pass@k
- Exact Match (EM)
- Human Equivalence Score (HEQ)
- Overview of LLM families such as GPT, LLaMA, and PaLM
- Discussion of emergent abilities of LLMs
- Comparison of LLMs performance based on specific benchmarks
The authors identified the following limitations:
- LLMs may generate hallucinations
- LLMs can lack state/memory
- Limited access to current information
- Resource-intensive training and serving
- Variability in responses based on prompts
- Number of GPUs: None specified
- GPU Type: None specified
Large Language Models
LLMs
transformers
dataset
evaluation metrics
fine-tuning
prompt engineering
multimodal models
cost-effective training
alignment
hallucination