proxy70

	[HN Gopher] The Pile: An 800GB dataset of diverse text for langu... ___________________________________________________________________ The Pile: An 800GB dataset of diverse text for language modeling (2020) Author : charlysl Score : 70 points Date : 2023-07-11 18:19 UTC (4 hours ago)
	web link (arxiv.org)
	w3m dump (arxiv.org)
	\| charlysl wrote: \| OP here. I learned about this while reading Stanford's LLM \| course's "Data" lecture [1]. Very interesting how it assesses the \| datasets used for GPT 2 and 3, etc, and how The Pile addresses \| their issues. A very interesting course! \| \| [1] https://stanford-cs324.github.io/winter2022/lectures/data/ \| pjot wrote: \| The Pile was also referenced in a post today of some guys \| tweets about "leaked" gpt4 details \| \| https://news.ycombinator.com/item?id=36675934 \| Der_Einzige wrote: \| I came so close to getting my dataset DebateSum \| (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) \| into the pile, but they decided at the last minute not to add it: \| https://github.com/EleutherAI/the-pile/issues/56 \| \| I'm still a tiny bit salty about that, but the pile is a \| wonderful dataset regardless. \| orange_fritter wrote: \| That dataset looks cool. Good work either way, I'm sure it'll \| go somewhere \| Der_Einzige wrote: \| Stay tuned! I've got a paper I'm writing about a new followup \| which is a 40x improvement in size (basically every open \| source debate card... Ever) and a 40x improvement in metadata \| and duplication detection. The work is all done since late \| april and I've just been lazy/writer-blocked (ironic in a \| world of high end LLMs) and haven't gotten the paper \| finished. \| \| Kinda of sad to have missed NeurIPS dataset track deadline \| and ACL, but I know that anything close to this in scope is a \| slam-dunk accept at the argument mining workshop \| dang wrote: \| Related: \| \| _The Pile: An 800GB Dataset of Diverse Text for Language \| Modeling_ - https://news.ycombinator.com/item?id=36272365 - June \| 2023 (5 comments) \| \| _The Pile: An 800GB Dataset of Diverse Text for Language \| Modeling_ - https://news.ycombinator.com/item?id=25607809 - Jan \| 2021 (60 comments) \| sillysaurusx wrote: \| Author here. And by author I mean I created books3 (the books \| component of The Pile) while everyone else did the hard work of \| actually writing the paper, ha. Stella and Leo Gao in particular \| did so much wonderful work on the paper, though it couldn't have \| happened without everyone's contributions. \| \| As far as I know, this was the first academic contribution from a \| discord collaboration to ML. Back then discord was barely used \| for ML at all, though nowadays of course the largest discord in \| the world is midjourney. \| \| There were a bunch of interesting stories from those days. We \| almost didn't release at all (or at least the books component) \| because of fear of copyright backlash. Turns out no one cared, \| and then suddenly today the world cares a great deal. \| \| As a side note, I'll be participating in a legal action against \| Meta for the purpose of making ML models uncopyrightable: \| https://twitter.com/theshawwn/status/1641804013791215619?s=6.... \| They DMCA'ed one of my repos distributing LLaMA, so we fought \| back and challenged the idea that weights can be copyrighted at \| all. This seems like the best outcome for hackers and individual \| researchers, for a few reasons. It's also one of the most ethical \| outcomes; since ~no one trains on data that they own, they \| shouldn't own the resulting model. \| \| One last thing. The Pile would've been far less relevant without \| the wonderful assistance of The Eye, a group of people who \| archive all kinds of things. They've hosted the datasets for \| years now. And although it seems strange to say that dataset \| hosting could make or break The Pile, back then there was nobody \| else willing to host us. https://the-eye.eu/ \| sfriedr wrote: \| Could you share more about copyright? For example, aren't you \| worried that now, with all kinds of lawsuits happening [1] and \| copyright issues that were found in existing datasets [2], that \| you might get threatening letters from a lawyer some day? \| \| I'm the author of [3] where we introduced one of the first \| natural-language datasets that test graduate mathematics for \| LLMs, but some of the prompts we took from a copyrighted book \| and therefore thought about excluding them. Having them in the \| public dataset would be really nice though, hence I'm keen \| about your experience. \| \| I'd also be keen to hear how your challenge against the DMCA on \| sharing LLaMA's weights goes? \| \| [1] https://www.theguardian.com/books/2023/jul/05/authors- \| file-a... [2] https://arxiv.org/abs/2105.05241 [3] \| https://arxiv.org/abs/2301.13867 \| sillysaurusx wrote: \| I think a lot of hackers shy away from doing impactful work \| because of fear. Sometimes those fears are justified, but \| it's remarkable how often things that seem like a big deal \| turn out not to matter. My advice for ambitious devs would be \| to do what seems interesting, and don't worry too much about \| threatening letters. Usually the worst thing that happens is \| that you agree to stop doing whatever generated the threat. \| \| Personally, I'm not worried. It would be a damn shame if \| academics come under fire merely for trying to operate on the \| cutting edge of science. None of us were trying to make \| money; we just wanted to make something interesting. \| \| > I'd also be keen to hear how your challenge against the \| DMCA on sharing LLaMA's weights goes? \| \| Thanks! I think we might be putting up a website for it soon, \| if only to explain ourselves. In the meantime - I hate this \| phrase, since I don't want followers - the only way to keep \| informed is to follow my Twitter, and perhaps keep an eye on \| my HN comments. \| \| You'll probably hear about it either way though, since it's a \| groundbreaking case. No one has tested the copyrightability \| of ML models before. \| Der_Einzige wrote: \| Getting sued is straight up a good thing for most peoples \| careers in tech. Haven't you watched silicon valley? \| jacquesm wrote: \| > It's also one of the most ethical outcomes; since ~no one \| trains on data that they own, they shouldn't own the resulting \| model. \| \| In my opinion the most ethical outcome would be that they are \| on the hook for the cumulative cost of the copyright they \| violated. That way authors would come out ahead instead of \| having their rights trashed 'because it's too late anyway'. \| cornel_io wrote: \| Whether or not training on publicly available data counts as \| a copyright violation is still completely up in the air \| legally, and clearly a lot of lawyers at all of the top tech \| companies think they're going to end up in the clear under \| fair use. \| \| At some point this stuff will have to get tested by making \| its way up the appeals stack in the US, and IMO there is only \| a minuscule chance that will result in Google, MS, and Meta \| getting slapped with anything more than a token fine (my bet \| is it won't even be that), let alone paying every person who \| ever wrote anything that was used in these datasets for \| copyright violations, which would basically be everyone. \| rpdillon wrote: \| > on the hook for the cumulative cost of the copyright they \| violated. \| \| I think there's a strong argument for a Fair Use defense, \| given the size of the models versus the size of the training \| sets, as well as the gulf in intended use: an AI model \| doesn't compete with e.g. a book. Obviously we'll have to see \| if play out in court to find out. \| ben_w wrote: \| Current AI models don't compete with a book, from what I've \| seen; I wouldn't want to bet how long it takes before they \| can compete with not just one but all books. \| archivist0 wrote: \| [dead] \| hedgehog wrote: \| I understand that LLMs to date have mostly been trained on a \| wide variety of copyright-encumbered data but in other domains \| (computer vision for example) the tradeoffs are different and \| in practice many models are trained on private / unencumbered \| data. If those weights are not protected by copyright then my \| concern is it will be hard to sufficiently protect them via \| license agreement and it will become yet another factor \| favoring the SaaSification of everything in tech. \| sillysaurusx wrote: \| This is true, and it's why I hesitated to file legal action. \| My goal was to benefit hackers. If the outcome causes \| problems for people who are just trying to share their work, \| I'd be upset. \| \| Ultimately what convinced me to proceed is that there are \| immense forces pressuring ML models to become SaaS companies. \| It's very difficult to offer an ML model for extended periods \| _without_ being a company. E.g. https://6b.eleuther.ai/ is \| down. Eleuther failing illustrates just how hard it is -- we \| were all working as hard as we could to design something that \| would last a long time, and a long time turned out to be two \| short years. Contrast that with other kinds of hacking (e.g. \| webdev, gamedev, hardware...) where the end result lasts \| basically forever. \| \| So if ML models aren't copyrightable, I think it'll hurt \| companies a lot more than individuals. In fact the goal is \| the other way around: to protect individuals. All I did was \| publish Facebook's own GPL download script to github, and it \| got DMCA'd. If we don't push back on that kind of behavior \| now, companies will get used to the idea that they control \| "their" model -- even when their model is anything but \| theirs. \| idiotsecant wrote: \| Is it useful to protect weights with copyright? What if I \| download your weights and retrain them for 5 seconds, \| changing each weight .0000001%? How much change is a new \| product? What if I change a single weight? \| hedgehog wrote: \| Like the parallel scenarios of taking a book and changing a \| few words, slapping a new logo on someone else's app, or \| stylizing a photo with a filter, those are questions that \| will be answered in court if people can't come to an \| agreement on their own. \| cschmidt wrote: \| If you're looking at The Pile, you also might consider the Red \| Pajama dataset. A new cleaned version was released recently \| https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane... ___________________________________________________________________ (page generated 2023-07-11 23:00 UTC)