proxy70

	[HN Gopher] How RLHF Works ___________________________________________________________________ How RLHF Works Author : natolambert Score : 122 points Date : 2023-06-21 14:21 UTC (8 hours ago)
	web link (www.interconnects.ai)
	w3m dump (www.interconnects.ai)
	\| [deleted] \| RicDan wrote: \| Problem with this is that it leads to the algorithm targeting \| outputs that sound good for humans. Thats why its bad and wont \| help us, it should also incorporate ,,sorry dont know that", but \| for that it needs to actually be smart \| m00x wrote: \| It can be weighted to be more honest when it doesn't know if \| those answers are picked by the labeler. \| dr_dshiv wrote: \| Need smarter labelers \| cubefox wrote: \| Honesty/truthfulness is indeed a difficult problem with any \| kind of fine-tuning. There is no way to incentivize the model \| to say what it believes to be true rather than what human \| raters would regard as true. Future models could become \| actively deceptive. \| noam_compsci wrote: \| Not very good. I just want a step by step ultra high level \| explanation. 1. Build a model 2. Run it ten times 3. Get humans \| to do xyz until result abc. \| pestatije wrote: \| RLHF - reinforcement learning from human feedback \| 1bent wrote: \| Thank you! \| cylon13 wrote: \| A notable improvement over the GLHF strategy for interacting \| with GPT models. \| lcnPylGDnU4H9OF wrote: \| (In case anybody's confused by the gaming culture reference: \| https://en.wiktionary.org/wiki/glhf. "Good Luck Have Fun") \| H8crilA wrote: \| This says nothing on how RLHF works, but a lot on what can be the \| results. \| SleekEagle wrote: \| You can check here for an explanation (with some helpful \| figures) https://www.assemblyai.com/blog/the-full-story-of- \| large-lang... \| inciampati wrote: \| Yes! I came to make the same comment. \| \| It's got a catchy title but it leaves much to be resolved. \| victor106 wrote: \| Anyone here know where we can find more resources on RLHF? \| \| There's been a lot written about transformer models etc., but I \| wasn't able to find much about RLHF. \| rounakdatta wrote: \| There's also this exhaustive post from one and only Chip Huyen: \| https://huyenchip.com/2023/05/02/rlhf.html \| SleekEagle wrote: \| My colleague wrote a couple of pieces that talk about RLHF: \| \| 1. https://www.assemblyai.com/blog/the-full-story-of-large- \| lang... (you can scroll to "What RLHF actually does to an LLM" \| if you're already familiar with LLMs) \| \| 2. https://www.assemblyai.com/blog/how-chatgpt-actually-works/ \| hansvm wrote: \| It's not the first paper on the topic IIRC, but OpenAI's \| InstructGPT paper [0] is decent and references enough other \| material to get started. \| \| The key idea is that they're able to start with large amounts \| of relatively garbage unsupervised data (the internet), and use \| that model to cheaply generate decent amounts of better data \| (ranking generated content rather than spending the man-hours \| to actually write good content). The other details aren't too \| important. \| \| [0] https://arxiv.org/abs/2203.02155 \| senko wrote: \| Blog post from Huggingface: https://huggingface.co/blog/rlhf \| \| Webinar on the same topic (from same HF folks): \| https://www.youtube.com/watch?v=2MBJOuVq380&t=496s \| \| RLHF as used by OpenAI in InstructGPT (predecessor to ChatGPT): \| https://arxiv.org/abs/2203.02155 (academic paper, so much \| denser than the above two resources) \| samstave wrote: \| It will be interesting when we have AI doing RLHF to other \| AIs based on itself being RLHF'd and having an iterative AI \| model reinforcement... \| \| But we talk of 'hallucinations' but what we wont get is AI \| malfeasense identified by AI RLHF trickery/lying? \| z3c0 wrote: \| This is essentially the premise behind Generative \| Adversarial Networks, and if you've seen the results, \| they're astounding. They're much better for specialized \| tasks than their generalized GPT counterparts. \| samstave wrote: \| Please expand on this? \| p1esk wrote: \| Original RLHF paper: https://arxiv.org/abs/1706.03741 \| abscind wrote: \| Any reason RLHF isn't just a band-aid on "not having enough \| data?" \| trade_monkey wrote: \| RLHF is a band aid on not having enough data that fits your own \| biases and answers you want the model to give. \| wmwmwm wrote: \| Does anyone have any insight into why reinforcement learning is \| (maybe) required/historically favoured? There was an interesting \| paper recently suggesting that you can use a preference learning \| objective directly and get a similar/better result without the RL \| machinery - but I lack the right intuition to know whether RLHF \| offers some additional magic! Here's the " Direct Preference \| Optimization " paper: https://arxiv.org/abs/2305.18290 \| fardo wrote: \| > Does anyone have any insight into why reinforcement learning \| is (maybe) required/historically favoured? \| \| From a concept stage, it has attractive similarities to the way \| people learn in real life (rewarded for successful learnings, \| punished for failure), and although we know similarities to \| nature don't guarantee better results than alternatives (for \| example, our modern airplane does not "flap" its wings the way \| a bird does), natural solutions will be continually looked to \| as a starting point and tool to try on new problems. \| \| Additionally, RL gives you a good start on unclear-how-to- \| address problems. In spaces where it's not clear where to begin \| optimizing besides taking actions and seeing how they do judged \| against some metric, reinforcement learning often provides a \| good mental and code framework for attacking these problems. \| \| >There was a paper recently suggesting that you can use a \| preference learning objective directly \| \| Doing a very quick skim, it looks like that paper is arguing \| rather than giving rewards or punishments based on preferences, \| you can just build a predictive classifier for the kinds of \| responses humans prefer. It seems interesting, though I wonder \| the extent to which you still have to occasionally do that \| reinforcement learning to generate relevant data for evaluating \| the classifier. \| gradys wrote: \| My intuition on this: \| \| Maximum likelihood training -> faithfully represent training \| data \| \| Reinforcement learning -> seek out the most preferred answer \| you can ___________________________________________________________________ (page generated 2023-06-21 23:01 UTC)