← ML Research Wiki / 2406.12793

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, WengShuxun Yang, Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang, ] Cogvideo Tsinghua University, ZhipuAI (2024)

Paper Information

arXiv ID

2406.12793

Venue

arXiv.org

Domain

Natural language processing, artificial intelligence, machine learning

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We introduce ChatGLM, an evolving family of large language models that we have been developing over time.This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM.To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage.The high-quality alignment is achieved via a multi-stage posttraining process, which involves supervised fine-tuning and learning from human feedback.Evaluations show that GLM-4, 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench.The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) to use-including web browser, Python interpreter, text-to-image model, and user-defined functions-to effectively complete complex tasks.In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter.Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone.The open models can be accessed through https://github.com/THUDMand https://huggingface.co/THUDM.

Summary

The paper introduces ChatGLM, a family of large language models epitomized by the GLM-4 series, which comprises various models trained extensively on a multilingual corpus. The GLM-4 showcases advancements in performance, general metrics comparison against GPT-4, and enhanced capabilities for long context handling and instruction following. It highlights methodologies employed in pre-training, alignment processes, and novel techniques such as self-critique to enhance functionalities. The models' performance is validated on various academic benchmarks---indicating strong competence and specific advantages particularly in alignment tasks related to Chinese context. Additionally, the paper addresses safety measures and the ongoing commitment to open-source models, with significant download statistics from platforms like Hugging Face.

Methods

This paper employs the following methods:

Transformer
Reinforcement Learning from Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
LongAlign

Models Used

ChatGLM-130B
GLM-4
GLM-4-Air
GLM-4-9B
ChatGLM-6B
CodeGeeX

Datasets

The following datasets were used in this research:

MMLU
GSM8K
MATH
BBH
GPQA
HumanEval
AlignBench
LongBench-Chat
NaturalCodeBench
SafetyBench

Evaluation Metrics

MMLU
GSM8K
MATH
BBH
GPQA
HumanEval
IFEval
AlignBench
LongBench-Chat

Results

GLM-4 closely rivals or outperforms GPT-4 in metrics such as MMLU, GSM8K, MATH, and HumanEval.
GLM-4 All Tools autonomously selects tools for task completion, often surpassing GPT-4 All Tools in practical scenarios.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

large language models GLM-130B GLM-4 model alignment instruction tuning RLHF long context multimodal models

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 51
Influential Citations: 41

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers