Summary & Background

Aligning language models to human preferences has proven to be both an expensive and difficult endeavor; however, models like Instruct-GPT have shown that aligning language models to instructions can be lucrative and massively successful. Similarly, models like General Language Assistant have shown that aligned chatbots can be more collaborative and overall a more enjoyable experience for the end user.

However, limitations of this approach include the prohibitive cost of collecting the required human annotations- either through hired contractors or systems like Amazon's Mechanical Turk. To circumvent this, recent papers have begun to utilize pretrained preference models. In the case of Askell et al., the authors trained a cross encoder on a combination of human annotations and StackExchange. By contrast, in Castricato et al. the authors use a contrastive model solely pretrained on editor's manuscripts of stories. While in both cases, the intent is to lower the immense cost of value alignment, the methodologies and models suffer from a lack of data availability.

By contrast, self play reinforcement learning has proven immensely successful in the space of reinforcement learning applied to games like Go and Chess. Endeavors, like those found in Evolution through Language Models --- or ELM for short --- have shown the applicability of self play to language modeling, particularly that of code synthesis and understanding. Similarly, pushes in the space of HCI for code synthesis, like Copilot and more recently Fauxpilot, have began to explore the constraints under which a code synthesis model augments a human's coding capabilities.

The intersection of these works presents meaningful new questions about self play for human augmentation, and as such we present two hypotheses:

A code generation model trained with ELM provides higher-quality code, as evaluated by humans, in comparison to few-shot prompting and fine-tuning of an LLM trained on code.
Synthetic ELM-generated code can have increased diversity and sample-efficiency for training of downstream language models, and therefore can increase the effective size of code datasets and result in improved code generation performance.

Living Budget

Add things here that need to be accounted for and expensed. This includes A100 hours, human annotations, and potential contractors.

MAP-Elites - archive representations

Milestones and Progress

[x] Prompt Engineering on small CodeGen model to generate diffs
- [x] Test the above model on 4-Parity.
  - [x] Use simple mutation library as baseline.
- [x] Implement MAP-Elites to operate on diffs
[x] Setup Sodaracers environment

The above steps are, if successful, sufficient to get a working implementation of Stage 1 & 2 of the pipeline.

[x] GitHub API → diffs dataset
- [x] Fine-tune small CodeGen model on diffs dataset.
  - [ ] Compare prompt engineering and fine-tuning.
[ ] Implement Stage 3 with the TRL library and PPO