← ML Research Wiki / 2309.16609

QWEN TECHNICAL REPORT

Jinze Bai Qwen Team Alibaba Group *, Shuai Bai Qwen Team Alibaba Group *, Yunfei Chu Qwen Team Alibaba Group *, Zeyu Cui Qwen Team Alibaba Group *, Kai Dang Qwen Team Alibaba Group *, Xiaodong Deng Qwen Team Alibaba Group *, Yang Fan Qwen Team Alibaba Group *, Wenbin Ge Qwen Team Alibaba Group *, Yu Han Qwen Team Alibaba Group *, Fei Huang Qwen Team Alibaba Group *, Binyuan Hui Qwen Team Alibaba Group *, Luo Ji Qwen Team Alibaba Group *, Mei Li Qwen Team Alibaba Group *, Junyang Lin Qwen Team Alibaba Group *, Runji Lin Qwen Team Alibaba Group *, Dayiheng Liu Qwen Team Alibaba Group *, Gao Liu Qwen Team Alibaba Group *, Chengqiang Lu Qwen Team Alibaba Group *, Keming Lu Qwen Team Alibaba Group *, Jianxin Ma Qwen Team Alibaba Group *, Rui Men Qwen Team Alibaba Group *, Xingzhang Ren Qwen Team Alibaba Group *, Xuancheng Ren Qwen Team Alibaba Group *, Chuanqi Tan Qwen Team Alibaba Group *, Sinan Tan Qwen Team Alibaba Group *, Jianhong Tu Qwen Team Alibaba Group *, Peng Wang Qwen Team Alibaba Group *, Shijie Wang Qwen Team Alibaba Group *, Wei Wang Qwen Team Alibaba Group *, Shengguang Wu Qwen Team Alibaba Group *, Benfeng Xu Qwen Team Alibaba Group *, Jin Xu Qwen Team Alibaba Group *, An Yang Qwen Team Alibaba Group *, Hao Yang Qwen Team Alibaba Group *, Jian Yang Qwen Team Alibaba Group *, Shusheng Yang Qwen Team Alibaba Group *, Yang Yao Qwen Team Alibaba Group *, Bowen Yu Qwen Team Alibaba Group *, Hongyi Yuan Qwen Team Alibaba Group *, Zheng Yuan Qwen Team Alibaba Group *, Jianwei Zhang Qwen Team Alibaba Group *, Xingxuan Zhang Qwen Team Alibaba Group *, Yichang Zhang Qwen Team Alibaba Group *, Zhenru Zhang Qwen Team Alibaba Group *, Chang Zhou Qwen Team Alibaba Group *, Jingren Zhou Qwen Team Alibaba Group *, Xiaohuan Zhou Qwen Team Alibaba Group *, Tianhang Zhu Qwen Team Alibaba Group * (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Artificial Intelligence, Natural Language Processing, Machine Learning
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce QWEN 1 , the first installment of our large language model series. QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, CODE-QWEN and CODE-QWEN-CHAT, as well as mathematics-focused models, MATH-QWEN-CHAT, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models. * Authors are ordered alphabetically by the last name. Correspondence to: [email protected].

Summary

This report presents QWEN, a series of large language models developed by Alibaba's Qwen Team. It introduces two main models: the base pretrained language models, QWEN, and the finetuned chat models, QWEN-CHAT, which use human alignment techniques for better performance on various language tasks. QWEN models have been pre-trained on 3 trillion tokens and utilize advanced training methodologies, including Reinforcement Learning from Human Feedback (RLHF). Specialized models such as CODE-QWEN and MATH-QWEN-CHAT are designed for coding and mathematics tasks. The evaluation demonstrates their competitive performance on benchmark datasets including HumanEval and GSM8K, with QWEN-CHAT outperforming smaller models and approaching proprietary solutions like GPT-4.

Methods

This paper employs the following methods:

  • RLHF
  • SFT
  • BPE
  • Transformer

Models Used

  • QWEN
  • QWEN-CHAT
  • CODE-QWEN
  • MATH-QWEN-CHAT

Datasets

The following datasets were used in this research:

  • HumanEval
  • MBPP
  • GSM8K
  • MATH
  • C-Eval
  • MMLU

Evaluation Metrics

  • Pass@1
  • Accuracy
  • F1-score

Results

  • QWEN models outperform previous 13B state-of-the-art models in multiple benchmarks, including MMLU and C-Eval.
  • CODE-QWEN significantly surpasses open-source code models in HumanEval and MBPP benchmarks.
  • MATH-QWEN-CHAT demonstrates superior performance in mathematics tasks compared to open-source models.

Limitations

The authors identified the following limitations:

  • The chat models trained with RLHF still slightly lag behind proprietary models in some benchmarks.
  • Further evaluation methodologies are needed beyond traditional benchmarks to assess the effectiveness of aligned models.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large Language Models Pretraining Reinforcement Learning from Human Feedback Multimodal Models Specialized Models for Code and Math

Papers Using Similar Methods

External Resources