← ML Research Wiki / 2303.08774

GPT-4 Technical Report

Openai (2023)

Paper Information

arXiv ID

2303.08774

Domain

Artificial Intelligence

SOTA Claim

Yes

Reproducibility

2/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformerbased model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.IntroductionThis technical report presents GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs. Such models are an important area of study as they have the potential to be used in a wide range of applications, such as dialogue systems, text summarization, and machine translation. As such, they have been the subject of substantial interest and progress in recent years [1-34].One of the main goals of developing such models is to improve their ability to understand and generate natural language text, particularly in more complex and nuanced scenarios. To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers. For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers. This contrasts with GPT-3.5, which scores in the bottom 10%.On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering). On the MMLU benchmark [35, 36], an English-language suite of multiple-choice questions covering 57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4 surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these model capability results, as well as model safety improvements and results, in more detail in later sections.This report also discusses a key challenge of the project, developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to make predictions about the expected performance of GPT-4 (based on small runs trained in similar ways) that were tested against the final run to increase confidence in our training.

Summary

The GPT-4 Technical Report details the development and capabilities of GPT-4, a large-scale multimodal model by OpenAI. This model can process both image and text inputs and produce text outputs, achieving human-level performance on several professional and academic assessments, including a simulated bar exam where it scored in the top 10%. GPT-4 is pre-trained using a Transformer architecture and undergoes a post-training alignment process improving its factual accuracy and adherence to desired behaviors. Significant focus was given to developing infrastructure enabling predictable scaling of the model's performance across different training runs. Despite its advancements, GPT-4 faces challenges such as factual hallucinations and biased outputs, necessitating ongoing safety and alignment research. The report includes extensive evaluations on multiple-choice exams and traditional NLP benchmarks, confirming substantial improvements over previous models, particularly in following user intent across various demographics. It also outlines methods to ensure safety and mitigate risks associated with the deployment of powerful AI systems.

Methods

This paper employs the following methods:

Transformer

Models Used

GPT-4
GPT-3.5

Datasets

The following datasets were used in this research:

MMLU
HumanEval
TruthfulQA

Evaluation Metrics

Accuracy
F1-score
Pass rate

Results

GPT-4 outperforms existing models on multiple-choice exams and traditional NLP benchmarks
GPT-4 demonstrates human-level performance on professional exams

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

GPT-4 safety alignment capabilities risk

Papers Using Similar Methods

External Resources

Funding: Not specified
Influential Citations: 1707

GPT-4 Technical Report

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers