-
Notifications
You must be signed in to change notification settings - Fork 225
Dataset
We used a number of datasets to train different versions of our model, all of them code related:
- Our
Code Clippy Data
: A dataset scraped of text data (mostly programming languages) scraped from GitHub. It is currently hosted on The Eye which is a non-profit, community-driven open directory data archive -
APPS
: A dataset of various programming competition problems. -
CodeSearchNet Challenge Data
: A dataset of methods from six different programming languages.
To create this dataset, we used https://seart-ghs.si.usi.ch/ to filter and collect GitHub repositories to scrape text data. We used the following filters:
star_count > 10
exclude_forks
commit_count > 2
- has a license
From these repositories, we further filtered based on their size (size_bytes
). Specifically, we removed any repositories with a size greater than the 95th percentile (70,708 bytes
) to avoid downloading large binaries or repositories with lots of autogenerated content.
Next, we combine these repositories with the repositories collected in the GitHub section of the Pile, making sure to remove duplicates. Finally, we use EleutherAI's git-downloader tool to download the repositories into the LM_Dataformat format. To download the data quickly, the data to be downloaded was split among 4 different TPUs and were merged together as one after the separate downloads were completed.
This resulted in a dataset of ~670,000
unique repositories or ~209GBs
of compressed text data. we split this dataset into training, validation, and testing using 95/2.5/2.5. Previous work has shown that GitHub can contain a large number of duplicate code and that duplicate text can impact the training of large language models, both for natural languages and code. Therefore, we also created a deduplication version of our dataset where near-duplicates are removed. This process was inspired by this tool. However, we found it unable to scale to the size of our dataset. Therefore, we created a simpler near-duplicate removal process where the text in a file is tokenized and the hash of these tokens (unordered) are stored. Anytime a new file is added to the dataset, the hash is computed and compared to the stored hashes. If the hash is found to be a near duplicate, the file is removed. Our script for doing this can be found here. This resulted in a reduction to ~132GBs
of compressed text data, which was then split similarly to the original dataset. The dataset consisted mostly of repositories with MIT
, GNU
, Creative Commons
, and Apache
licenses.
We are currently working on getting these datasets (duplicate and non-duplicate versions) into the HuggingFace's Datasets library. Stay tuned!
The Pre-Trained Model is fine-tuned with APPS Dataset. APPS Benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. The Fine-Tuning is done by giving initial context by giving the Natural Language Prompt alongside the Starter Code, Sample Input/Output.
language-modeling
: The dataset can be used to train a model for language modeling for modeling programming languages, which consists of pretraining/finetuning a model to predict missing tokens, either causally or masked, given some context. Success on this task is typically measured by achieving a low perplexity score.
{
"id": datasets.Value("int64"),
"text": datasets.Value("string"),
"repo_name": datasets.Value("string"),
"stars": datasets.Value("string"),
"repo_language": datasets.Value("string"),
"file_name": datasets.Value("string"),
"mime_type": datasets.Value("string")
}
-
id
: A unique identifier for the data instance. -
text
: The text of the code. -
repo_name
: The name of the repository. -
stars
: The number of stars the repository has. -
repo_language
: The programming language of the repository. -
file_name
: The name of the file. -
mime_type
: The MIME type of the file.
Size in GBs | Tain | Valid | Test |
---|---|---|---|
Duplicate | 194 | 9 | 6.3 |
Deuplicate | 126 | 3.3 | 3.1 |
The paper "Evaluating Large Language Models Trained on Code" from OpenAI has a good discussion on the impact of a large language model trained on code could be. Therefore, some parts of their discussion are highlighted here as it pertains to this dataset and models that may be trained from it. As well as some differences in views from the paper, particularly around legal implications.
- Over-reliance: A language model trained on large datasets such as this one for the task of autogenerating code may generate plausible solutions that may appear correct, but are not necessarily the correct solution. Not properly evaluating the generated code may cause have negative consequences such as the introduction of bugs, or the introduction of security vulnerabilities. Therefore, it is important that users are aware of the limitations and potential negative consequences of using a language model trained on this dataset.
- Economic and labour market impacts: Large language models trained on large code datasets such as this one that is capable of generating high-quality code have the potential to automate part of the software development process. This may negatively impact software developers. However, as discussed in the paper, as shown in the Summary Report of software developers from O*NET OnLine, developers don't just write software.
- Security implications: No filtering or checking of vulnerabilities or buggy code was performed. This means that the dataset may contain code that may be malicious or contain vulnerabilities. Therefore, any model trained on this dataset may generate vulnerable, buggy, or malicious code. In safety-critical software, this could lead to software that may work improperly and could result in serious consequences depending on the software. Additionally, a model trained on this dataset may be used to generate malicious code on purpose in order to perform ransomware or other such attacks.
- Legal implications: No filtering was performed on licensed code. This means that the dataset may contain restrictive licensed code. As discussed in the paper, public Github repositories may fall under "fair use." However, there has been little to no previous cases of such usages of licensed publicly available code. Therefore, any model trained on this dataset may be required to obey license terms that align with the software it was trained on such as GPL-3.0, which is why we purposefully put this dataset under the GPL-3.0 license. It is unclear the legal ramifications of using a language model trained on this dataset.
The programming languages most represented in this dataset are those of Javascript and Python. Therefore, other, still popular languages such as C and C++, are less represented and therefore model performance for these languages will be less comparatively. Additionally, this dataset only contains public repositories and so may not be representative of code written by private developers. No filtering was performed for potential racist, offensive, or otherwise inappropriate content. Therefore there may be such content in the dataset that will be reflected in models trained on it.
We are currently working on getting this data into the HuggingFace's Datasets library. Stay tuned!