EleutherAI_Proposals.pdf

ELM.pdf

Summary & Background

Aligning language models to human preferences has proven to be both an expensive and difficult endeavor; however, models like Instruct-GPT have shown that aligning language models to instructions can be lucrative and massively successful. Similarly, models like General Language Assistant have shown that aligned chatbots can be more collaborative and overall a more enjoyable experience for the end user.

However, limitations of this approach include the prohibitive cost of collecting the required human annotations- either through hired contractors or systems like Amazon's Mechanical Turk. To circumvent this, recent papers have begun to utilize pretrained preference models. In the case of Askell et al., the authors trained a cross encoder on a combination of human annotations and StackExchange. By contrast, in Castricato et al. the authors use a contrastive model solely pretrained on editor's manuscripts of stories. While in both cases, the intent is to lower the immense cost of value alignment, the methodologies and models suffer from a lack of data availability.

By contrast, self play reinforcement learning has proven immensely successful in the space of reinforcement learning applied to games like Go and Chess. Endeavors, like those found in Evolution through Language Models --- or ELM for short --- have shown the applicability of self play to language modeling, particularly that of code synthesis and understanding. Similarly, pushes in the space of HCI for code synthesis, like Copilot and more recently Fauxpilot, have began to explore the constraints under which a code synthesis model augments a human's coding capabilities.

The intersection of these works presents meaningful new questions about self play for human augmentation, and as such we present two hypotheses:

Living Budget

Add things here that need to be accounted for and expensed. This includes A100 hours, human annotations, and potential contractors.

MAP-Elites - archive representations

Milestones and Progress

The above steps are, if successful, sufficient to get a working implementation of Stage 1 & 2 of the pipeline.