← ML Research Wiki / 2305.06161

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal Introduction, Yangtian Zi, Niklas Muennigho Introduction, Denis Kocetkov, Chenghao Mou, Marc Marone Johns Hopkins University 9 Leipzig University 10 ScaDS.AI, Christopher Akiki, Jia Li, Jenny Chim Queen Mary University of London, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang Introduction, Olivier Dehaene Introduction, Mishig Davaadorj Introduction, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze Introduction, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang Carnegie Mellon University, Rudra Murthy, Jason Stillerman, Siva Sankalp, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni Introduction, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert Forschungszentrum Jülich, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried Carnegie Mellon University, Dzmitry Bahdanau, Yacine Jernite Introduction, Carlos Muñoz Ferrandis Introduction, Sean Hughes, Thomas Wolf Introduction, Arjun Guha, Leandro Von Werra Introduction, Harm De Vries Introduction, Hugging Face, Servicenow Research, Northeastern University (2023)

Paper Information
arXiv ID
Venue
Trans. Mach. Learn. Res.
Domain
Artificial Intelligence / Natural Language Processing / Code AI
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention.StarCoderBase is trained on 1 trillion tokens sourced from The Stack(Kocetkov et al., 2022), a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process.We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder.We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages.We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

Summary

The paper introduces StarCoder and StarCoderBase, two large language models (LLMs) developed by the BigCode community for code generation. These models consist of 15.5 billion parameters and are designed for open-access research, promoting transparency, safety, and community involvement. StarCoderBase is trained on 1 trillion tokens from The Stack, a dataset of permissively licensed GitHub repositories, and fine-tuned using an additional 35 billion Python tokens to create StarCoder. The authors claim that StarCoder outperforms existing open models and matches or exceeds the performance of proprietary models like OpenAI's code-cushman-001. The paper emphasizes responsible AI practices, including an OpenRAIL-M license for model usage, a PII detection system, and various tools for ensuring transparency in code attribution. The models are substantively evaluated across multiple programming languages and tasks, demonstrating considerable capabilities in code comprehension and generation, while adhering to social responsibility in AI deployments.

Methods

This paper employs the following methods:

  • Transformer
  • Multi-Query Attention
  • Fill-in-the-Middle

Models Used

  • StarCoder
  • StarCoderBase

Datasets

The following datasets were used in this research:

  • The Stack

Evaluation Metrics

  • pass@1
  • Accuracy

Results

  • StarCoder outperforms every open LLM for code that supports multiple programming languages
  • StarCoder matches or outperforms the OpenAI code-cushman-001 model
  • StarCoder substantially outperforms existing LLMs fine-tuned on Python.

Limitations

The authors identified the following limitations:

  • Potential to generate PII despite redaction efforts
  • Risk of producing malware
  • Accuracy may vary across programming languages and dataset types.

Technical Requirements

  • Number of GPUs: 64
  • GPU Type: NVIDIA A100 80GB

Keywords

Large Language Models Code AI Open Access Open Science Responsible AI

Papers Using Similar Methods

External Resources