_CarperAI_Proposal__Code_Pile_Dataset.pdf

_CarperAI_Proposal__Code_Pile_Dataset.pdf

Summary & Background

Foundation models in the NLP domain have unlocked numerous applications and have served as a building block of specialized models via finetuning. Similarly, having such models for Software Engineering has the potential to serve a similar purpose from coding assistant applications to being the building blocks of CarperAI's reinforcement learning projects. To enable the training of these foundation models, we will collect software engineering-specific data that goes beyond the GitHub code sources that are focused on currently. This includes StackOverflow, documentation sites of popular libraries and frameworks, tutorial websites such as tutorial point and geeks4geeks, mining reddit communities that are programming specific, and other repository data from GitHub such as issues, pull requests, community discussions, diffs, etc. For better understanding the data these foundation models are trained on, we will pay special attention to the statistics of vulnerable code.

Living Budget

Add things here that need to be accounted for and expensed. This includes A100 hours, human annotations, and potential contractors.

Organizational Structure

Organizational Structure will follow the pile:

  1. Paper: https://arxiv.org/pdf/2101.00027.pdf
  2. Repository: https://github.com/EleutherAI/the-pile

Specifically, we will have separate repositories for each dataset subset and then the main repository that combines each to reproduce the entire dataset. Each dataset will be accompanied by a discussion section on the main repository for discussions with the community, especially revolving around ethics.

All scraped data should be processed to lm_format format for training our models: https://github.com/EleutherAI/lm_dataformat

Community Resources can be added here for us to debate on including in the final release:

https://docs.google.com/spreadsheets/d/1OrOnv-Cv1wRq0jNk4AegHiMtLk88YQDz5b1TP-o5SE8/edit?usp=sharing

Milestones and Progress