Topics: Reward Modeling (RM), Reinforcement Learning with Human Feedback (RLHF), Preference Modeling (PM)

This document is for sharing papers relevant to reinforcement learning on language models.

Organization

Every week a presenter selects a paper from the list below or a related work to lead discussion and take notes. At the end of the meeting the next presenter is chosen.

Time: 8 AM PST/11:00 AM EST on Mondays

Relevant Papers

Reinforcement Learning with Human Feedback (RLHF)

These are methods/experiments on learning with human feedback, with reinforcement learning.

Deep reinforcement learning from human preferences (Christiano et al., 2017) (Preliminary reading)
Fine-Tuning Language Models from Human Preferences (Zeigler et al. 2019)
- Code: https://github.com/openai/lm-human-preferences
Learning to summarize from human feedback (Steinnon et al., 2020)
- Code: https://github.com/openai/summarize-from-feedback