← ML Research Wiki / 2304.03277

INSTRUCTION TUNING WITH GPT-4

Baolin Peng [email protected] Microsoft Research, Chunyuan Li Microsoft Research, Pengcheng He Microsoft Research, Michel Galley [email protected] Microsoft Research, Jianfeng Gao [email protected] Microsoft Research (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Prior work has shown that finetuning large language models (LLMs) using machinegenerated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed.In this paper, we present the first attempt to use GPT-4 to generate instructionfollowing data for LLM finetuning.Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models.We also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training.We make our data generated using GPT-4 as well as our codebase publicly available. 1* Equal Contribution 1 https://instruction-tuning-with-gpt-4.github.io/Note: This is a preliminary release, and we will continue to expand the dataset and will finetune larger models.

Summary

This paper presents the first attempt to use GPT-4 for generating instruction-following data to finetune large language models (LLMs) and assesses its effects on zero-shot performance tasks. The authors demonstrate that the instruction-following datasets, comprising 52K examples in English and Chinese, significantly improve performance compared to previous models. Their approach includes collecting GPT-4 feedback for evaluating model performance and training reward models. The empirical studies validate the effectiveness of GPT-4-generated data, suggesting best practices for building instruction-following LLMs. The paper also outlines plans for future work to expand data collection and enhance model training using reinforcement learning from human feedback.

Methods

This paper employs the following methods:

  • Instruction-tuning
  • Self-Instruct tuning
  • Reinforcement Learning from Human Feedback (RLHF)

Models Used

  • LLaMA-GPT4
  • LLaMA-GPT4-CN

Datasets

The following datasets were used in this research:

  • Alpaca
  • None specified

Evaluation Metrics

  • ROUGE-L
  • None specified

Results

  • GPT-4 data leads to superior zero-shot performance on new tasks
  • LLaMA models trained with GPT-4 show comparable performance to GPT-4 itself
  • GPT-4-generated instruction-following data enhances alignment with human values

Limitations

The authors identified the following limitations:

  • Current dataset size is limited to 52K instances; plans for expansion are suggested
  • Only using reward models during decoding stage suggests lack of full integration in training

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

GPT-4 instruction tuning LLMs self-instruct evaluation reward models

Papers Using Similar Methods

External Resources