|
| [deleted]
| RicDan wrote:
| Problem with this is that it leads to the algorithm targeting
| outputs that sound good for humans. Thats why its bad and wont
| help us, it should also incorporate ,,sorry dont know that", but
| for that it needs to actually be smart
| m00x wrote:
| It can be weighted to be more honest when it doesn't know if
| those answers are picked by the labeler.
| dr_dshiv wrote:
| Need smarter labelers
| cubefox wrote:
| Honesty/truthfulness is indeed a difficult problem with any
| kind of fine-tuning. There is no way to incentivize the model
| to say what it believes to be true rather than what human
| raters would regard as true. Future models could become
| actively deceptive.
| noam_compsci wrote:
| Not very good. I just want a step by step ultra high level
| explanation. 1. Build a model 2. Run it ten times 3. Get humans
| to do xyz until result abc.
| pestatije wrote:
| RLHF - reinforcement learning from human feedback
| 1bent wrote:
| Thank you!
| cylon13 wrote:
| A notable improvement over the GLHF strategy for interacting
| with GPT models.
| lcnPylGDnU4H9OF wrote:
| (In case anybody's confused by the gaming culture reference:
| https://en.wiktionary.org/wiki/glhf. "Good Luck Have Fun")
| H8crilA wrote:
| This says nothing on how RLHF works, but a lot on what can be the
| results.
| SleekEagle wrote:
| You can check here for an explanation (with some helpful
| figures) https://www.assemblyai.com/blog/the-full-story-of-
| large-lang...
| inciampati wrote:
| Yes! I came to make the same comment.
|
| It's got a catchy title but it leaves much to be resolved.
| victor106 wrote:
| Anyone here know where we can find more resources on RLHF?
|
| There's been a lot written about transformer models etc., but I
| wasn't able to find much about RLHF.
| rounakdatta wrote:
| There's also this exhaustive post from one and only Chip Huyen:
| https://huyenchip.com/2023/05/02/rlhf.html
| SleekEagle wrote:
| My colleague wrote a couple of pieces that talk about RLHF:
|
| 1. https://www.assemblyai.com/blog/the-full-story-of-large-
| lang... (you can scroll to "What RLHF actually does to an LLM"
| if you're already familiar with LLMs)
|
| 2. https://www.assemblyai.com/blog/how-chatgpt-actually-works/
| hansvm wrote:
| It's not the first paper on the topic IIRC, but OpenAI's
| InstructGPT paper [0] is decent and references enough other
| material to get started.
|
| The key idea is that they're able to start with large amounts
| of relatively garbage unsupervised data (the internet), and use
| that model to cheaply generate decent amounts of better data
| (ranking generated content rather than spending the man-hours
| to actually write good content). The other details aren't too
| important.
|
| [0] https://arxiv.org/abs/2203.02155
| senko wrote:
| Blog post from Huggingface: https://huggingface.co/blog/rlhf
|
| Webinar on the same topic (from same HF folks):
| https://www.youtube.com/watch?v=2MBJOuVq380&t=496s
|
| RLHF as used by OpenAI in InstructGPT (predecessor to ChatGPT):
| https://arxiv.org/abs/2203.02155 (academic paper, so much
| denser than the above two resources)
| samstave wrote:
| It will be interesting when we have AI doing RLHF to other
| AIs based on itself being RLHF'd and having an iterative AI
| model reinforcement...
|
| But we talk of 'hallucinations' but what we wont get is AI
| malfeasense identified by AI RLHF trickery/lying?
| z3c0 wrote:
| This is essentially the premise behind Generative
| Adversarial Networks, and if you've seen the results,
| they're astounding. They're much better for specialized
| tasks than their generalized GPT counterparts.
| samstave wrote:
| Please expand on this?
| p1esk wrote:
| Original RLHF paper: https://arxiv.org/abs/1706.03741
| abscind wrote:
| Any reason RLHF isn't just a band-aid on "not having enough
| data?"
| trade_monkey wrote:
| RLHF is a band aid on not having enough data that fits your own
| biases and answers you want the model to give.
| wmwmwm wrote:
| Does anyone have any insight into why reinforcement learning is
| (maybe) required/historically favoured? There was an interesting
| paper recently suggesting that you can use a preference learning
| objective directly and get a similar/better result without the RL
| machinery - but I lack the right intuition to know whether RLHF
| offers some additional magic! Here's the " Direct Preference
| Optimization " paper: https://arxiv.org/abs/2305.18290
| fardo wrote:
| > Does anyone have any insight into why reinforcement learning
| is (maybe) required/historically favoured?
|
| From a concept stage, it has attractive similarities to the way
| people learn in real life (rewarded for successful learnings,
| punished for failure), and although we know similarities to
| nature don't guarantee better results than alternatives (for
| example, our modern airplane does not "flap" its wings the way
| a bird does), natural solutions will be continually looked to
| as a starting point and tool to try on new problems.
|
| Additionally, RL gives you a good start on unclear-how-to-
| address problems. In spaces where it's not clear where to begin
| optimizing besides taking actions and seeing how they do judged
| against some metric, reinforcement learning often provides a
| good mental and code framework for attacking these problems.
|
| >There was a paper recently suggesting that you can use a
| preference learning objective directly
|
| Doing a very quick skim, it looks like that paper is arguing
| rather than giving rewards or punishments based on preferences,
| you can just build a predictive classifier for the kinds of
| responses humans prefer. It seems interesting, though I wonder
| the extent to which you still have to occasionally do that
| reinforcement learning to generate relevant data for evaluating
| the classifier.
| gradys wrote:
| My intuition on this:
|
| Maximum likelihood training -> faithfully represent training
| data
|
| Reinforcement learning -> seek out the most preferred answer
| you can
___________________________________________________________________
(page generated 2023-06-21 23:01 UTC) |