CodeSearchNet

Dataset Information
Modalities
Texts
License
Homepage

Overview

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes:
* Six million methods overall
* Two million of which have associated documentation (docstrings, JavaDoc, and more)
* Metadata that indicates the original location (repository or line number, for example) where the data was found

Source: https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/

Variants: CodeSearchNet, CodeSearchNet - Python, CodeSearchNet - Java, CodeSearchNet - Go, CodeSearchNet - Php, CodeSearchNet - Ruby, CodeSearchNet - JavaScript

Associated Benchmarks

This dataset is used in 3 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Code Search CodeT5+ 770M CodeT5+: Open Code Large Language … 2023-05-13
Code Search CodeT5+ 220M CodeT5+: Open Code Large Language … 2023-05-13
Code Search cpt-code M Text and Code Embeddings by … 2022-01-24
Code Search cpt-code S Text and Code Embeddings by … 2022-01-24
Code Search GraphCodeBERT GraphCodeBERT: Pre-training Code Representations with … 2020-09-17
Source Code Summarization ContraCode Contrastive Code Representation Learning 2020-07-09
Code Documentation Generation RoBERTa CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation Transformer CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation seq2seq CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Search CodeBERT CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation CodeBERT (MLM+RTD) CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation CodeBERT (MLM) CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation pre-train w/ code only CodeBERT: A Pre-Trained Model for … 2020-02-19
Code Documentation Generation CodeBERT (RTD) CodeBERT: A Pre-Trained Model for … 2020-02-19

Research Papers

Recent papers with results on this dataset: