[HN Gopher] How RLHF Works
___________________________________________________________________
 
How RLHF Works
 
Author : natolambert
Score  : 122 points
Date   : 2023-06-21 14:21 UTC (8 hours ago)
 
web link (www.interconnects.ai)
w3m dump (www.interconnects.ai)
 
| [deleted]
 
| RicDan wrote:
| Problem with this is that it leads to the algorithm targeting
| outputs that sound good for humans. Thats why its bad and wont
| help us, it should also incorporate ,,sorry dont know that", but
| for that it needs to actually be smart
 
  | m00x wrote:
  | It can be weighted to be more honest when it doesn't know if
  | those answers are picked by the labeler.
 
    | dr_dshiv wrote:
    | Need smarter labelers
 
  | cubefox wrote:
  | Honesty/truthfulness is indeed a difficult problem with any
  | kind of fine-tuning. There is no way to incentivize the model
  | to say what it believes to be true rather than what human
  | raters would regard as true. Future models could become
  | actively deceptive.
 
| noam_compsci wrote:
| Not very good. I just want a step by step ultra high level
| explanation. 1. Build a model 2. Run it ten times 3. Get humans
| to do xyz until result abc.
 
| pestatije wrote:
| RLHF - reinforcement learning from human feedback
 
  | 1bent wrote:
  | Thank you!
 
  | cylon13 wrote:
  | A notable improvement over the GLHF strategy for interacting
  | with GPT models.
 
    | lcnPylGDnU4H9OF wrote:
    | (In case anybody's confused by the gaming culture reference:
    | https://en.wiktionary.org/wiki/glhf. "Good Luck Have Fun")
 
| H8crilA wrote:
| This says nothing on how RLHF works, but a lot on what can be the
| results.
 
  | SleekEagle wrote:
  | You can check here for an explanation (with some helpful
  | figures) https://www.assemblyai.com/blog/the-full-story-of-
  | large-lang...
 
  | inciampati wrote:
  | Yes! I came to make the same comment.
  | 
  | It's got a catchy title but it leaves much to be resolved.
 
| victor106 wrote:
| Anyone here know where we can find more resources on RLHF?
| 
| There's been a lot written about transformer models etc., but I
| wasn't able to find much about RLHF.
 
  | rounakdatta wrote:
  | There's also this exhaustive post from one and only Chip Huyen:
  | https://huyenchip.com/2023/05/02/rlhf.html
 
  | SleekEagle wrote:
  | My colleague wrote a couple of pieces that talk about RLHF:
  | 
  | 1. https://www.assemblyai.com/blog/the-full-story-of-large-
  | lang... (you can scroll to "What RLHF actually does to an LLM"
  | if you're already familiar with LLMs)
  | 
  | 2. https://www.assemblyai.com/blog/how-chatgpt-actually-works/
 
  | hansvm wrote:
  | It's not the first paper on the topic IIRC, but OpenAI's
  | InstructGPT paper [0] is decent and references enough other
  | material to get started.
  | 
  | The key idea is that they're able to start with large amounts
  | of relatively garbage unsupervised data (the internet), and use
  | that model to cheaply generate decent amounts of better data
  | (ranking generated content rather than spending the man-hours
  | to actually write good content). The other details aren't too
  | important.
  | 
  | [0] https://arxiv.org/abs/2203.02155
 
  | senko wrote:
  | Blog post from Huggingface: https://huggingface.co/blog/rlhf
  | 
  | Webinar on the same topic (from same HF folks):
  | https://www.youtube.com/watch?v=2MBJOuVq380&t=496s
  | 
  | RLHF as used by OpenAI in InstructGPT (predecessor to ChatGPT):
  | https://arxiv.org/abs/2203.02155 (academic paper, so much
  | denser than the above two resources)
 
    | samstave wrote:
    | It will be interesting when we have AI doing RLHF to other
    | AIs based on itself being RLHF'd and having an iterative AI
    | model reinforcement...
    | 
    | But we talk of 'hallucinations' but what we wont get is AI
    | malfeasense identified by AI RLHF trickery/lying?
 
      | z3c0 wrote:
      | This is essentially the premise behind Generative
      | Adversarial Networks, and if you've seen the results,
      | they're astounding. They're much better for specialized
      | tasks than their generalized GPT counterparts.
 
        | samstave wrote:
        | Please expand on this?
 
| p1esk wrote:
| Original RLHF paper: https://arxiv.org/abs/1706.03741
 
| abscind wrote:
| Any reason RLHF isn't just a band-aid on "not having enough
| data?"
 
  | trade_monkey wrote:
  | RLHF is a band aid on not having enough data that fits your own
  | biases and answers you want the model to give.
 
| wmwmwm wrote:
| Does anyone have any insight into why reinforcement learning is
| (maybe) required/historically favoured? There was an interesting
| paper recently suggesting that you can use a preference learning
| objective directly and get a similar/better result without the RL
| machinery - but I lack the right intuition to know whether RLHF
| offers some additional magic! Here's the " Direct Preference
| Optimization " paper: https://arxiv.org/abs/2305.18290
 
  | fardo wrote:
  | > Does anyone have any insight into why reinforcement learning
  | is (maybe) required/historically favoured?
  | 
  | From a concept stage, it has attractive similarities to the way
  | people learn in real life (rewarded for successful learnings,
  | punished for failure), and although we know similarities to
  | nature don't guarantee better results than alternatives (for
  | example, our modern airplane does not "flap" its wings the way
  | a bird does), natural solutions will be continually looked to
  | as a starting point and tool to try on new problems.
  | 
  | Additionally, RL gives you a good start on unclear-how-to-
  | address problems. In spaces where it's not clear where to begin
  | optimizing besides taking actions and seeing how they do judged
  | against some metric, reinforcement learning often provides a
  | good mental and code framework for attacking these problems.
  | 
  | >There was a paper recently suggesting that you can use a
  | preference learning objective directly
  | 
  | Doing a very quick skim, it looks like that paper is arguing
  | rather than giving rewards or punishments based on preferences,
  | you can just build a predictive classifier for the kinds of
  | responses humans prefer. It seems interesting, though I wonder
  | the extent to which you still have to occasionally do that
  | reinforcement learning to generate relevant data for evaluating
  | the classifier.
 
  | gradys wrote:
  | My intuition on this:
  | 
  | Maximum likelihood training -> faithfully represent training
  | data
  | 
  | Reinforcement learning -> seek out the most preferred answer
  | you can
 
___________________________________________________________________
(page generated 2023-06-21 23:01 UTC)