Raymond Li, Loubna Ben Allal Introduction, Yangtian Zi, Niklas Muennigho Introduction, Denis Kocetkov, Chenghao Mou, Marc Marone Johns Hopkins University 9 Leipzig University 10 ScaDS.AI, Christopher Akiki, Jia Li, Jenny Chim Queen Mary University of London, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang Introduction, Olivier Dehaene Introduction, Mishig Davaadorj Introduction, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze Introduction, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang Carnegie Mellon University, Rudra Murthy, Jason Stillerman, Siva Sankalp, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni Introduction, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert Forschungszentrum Jülich, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried Carnegie Mellon University, Dzmitry Bahdanau, Yacine Jernite Introduction, Carlos Muñoz Ferrandis Introduction, Sean Hughes, Thomas Wolf Introduction, Arjun Guha, Leandro Von Werra Introduction, Harm De Vries Introduction, Hugging Face, Servicenow Research, Northeastern University (2023)
The paper introduces StarCoder and StarCoderBase, two large language models (LLMs) developed by the BigCode community for code generation. These models consist of 15.5 billion parameters and are designed for open-access research, promoting transparency, safety, and community involvement. StarCoderBase is trained on 1 trillion tokens from The Stack, a dataset of permissively licensed GitHub repositories, and fine-tuned using an additional 35 billion Python tokens to create StarCoder. The authors claim that StarCoder outperforms existing open models and matches or exceeds the performance of proprietary models like OpenAI's code-cushman-001. The paper emphasizes responsible AI practices, including an OpenRAIL-M license for model usage, a PII detection system, and various tools for ensuring transparency in code attribution. The models are substantively evaluated across multiple programming languages and tasks, demonstrating considerable capabilities in code comprehension and generation, while adhering to social responsibility in AI deployments.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: