|
| charlysl wrote:
| OP here. I learned about this while reading Stanford's LLM
| course's "Data" lecture [1]. Very interesting how it assesses the
| datasets used for GPT 2 and 3, etc, and how The Pile addresses
| their issues. A very interesting course!
|
| [1] https://stanford-cs324.github.io/winter2022/lectures/data/
| pjot wrote:
| The Pile was also referenced in a post today of some guys
| tweets about "leaked" gpt4 details
|
| https://news.ycombinator.com/item?id=36675934
| Der_Einzige wrote:
| I came so close to getting my dataset DebateSum
| (https://huggingface.co/datasets/Hellisotherpeople/DebateSum)
| into the pile, but they decided at the last minute not to add it:
| https://github.com/EleutherAI/the-pile/issues/56
|
| I'm still a tiny bit salty about that, but the pile is a
| wonderful dataset regardless.
| orange_fritter wrote:
| That dataset looks cool. Good work either way, I'm sure it'll
| go somewhere
| Der_Einzige wrote:
| Stay tuned! I've got a paper I'm writing about a new followup
| which is a 40x improvement in size (basically every open
| source debate card... Ever) and a 40x improvement in metadata
| and duplication detection. The work is all done since late
| april and I've just been lazy/writer-blocked (ironic in a
| world of high end LLMs) and haven't gotten the paper
| finished.
|
| Kinda of sad to have missed NeurIPS dataset track deadline
| and ACL, but I know that anything close to this in scope is a
| slam-dunk accept at the argument mining workshop
| dang wrote:
| Related:
|
| _The Pile: An 800GB Dataset of Diverse Text for Language
| Modeling_ - https://news.ycombinator.com/item?id=36272365 - June
| 2023 (5 comments)
|
| _The Pile: An 800GB Dataset of Diverse Text for Language
| Modeling_ - https://news.ycombinator.com/item?id=25607809 - Jan
| 2021 (60 comments)
| sillysaurusx wrote:
| Author here. And by author I mean I created books3 (the books
| component of The Pile) while everyone else did the hard work of
| actually writing the paper, ha. Stella and Leo Gao in particular
| did so much wonderful work on the paper, though it couldn't have
| happened without everyone's contributions.
|
| As far as I know, this was the first academic contribution from a
| discord collaboration to ML. Back then discord was barely used
| for ML at all, though nowadays of course the largest discord in
| the world is midjourney.
|
| There were a bunch of interesting stories from those days. We
| almost didn't release at all (or at least the books component)
| because of fear of copyright backlash. Turns out no one cared,
| and then suddenly today the world cares a great deal.
|
| As a side note, I'll be participating in a legal action against
| Meta for the purpose of making ML models uncopyrightable:
| https://twitter.com/theshawwn/status/1641804013791215619?s=6....
| They DMCA'ed one of my repos distributing LLaMA, so we fought
| back and challenged the idea that weights can be copyrighted at
| all. This seems like the best outcome for hackers and individual
| researchers, for a few reasons. It's also one of the most ethical
| outcomes; since ~no one trains on data that they own, they
| shouldn't own the resulting model.
|
| One last thing. The Pile would've been far less relevant without
| the wonderful assistance of The Eye, a group of people who
| archive all kinds of things. They've hosted the datasets for
| years now. And although it seems strange to say that dataset
| hosting could make or break The Pile, back then there was nobody
| else willing to host us. https://the-eye.eu/
| sfriedr wrote:
| Could you share more about copyright? For example, aren't you
| worried that now, with all kinds of lawsuits happening [1] and
| copyright issues that were found in existing datasets [2], that
| you might get threatening letters from a lawyer some day?
|
| I'm the author of [3] where we introduced one of the first
| natural-language datasets that test graduate mathematics for
| LLMs, but some of the prompts we took from a copyrighted book
| and therefore thought about excluding them. Having them in the
| public dataset would be really nice though, hence I'm keen
| about your experience.
|
| I'd also be keen to hear how your challenge against the DMCA on
| sharing LLaMA's weights goes?
|
| [1] https://www.theguardian.com/books/2023/jul/05/authors-
| file-a... [2] https://arxiv.org/abs/2105.05241 [3]
| https://arxiv.org/abs/2301.13867
| sillysaurusx wrote:
| I think a lot of hackers shy away from doing impactful work
| because of fear. Sometimes those fears are justified, but
| it's remarkable how often things that seem like a big deal
| turn out not to matter. My advice for ambitious devs would be
| to do what seems interesting, and don't worry too much about
| threatening letters. Usually the worst thing that happens is
| that you agree to stop doing whatever generated the threat.
|
| Personally, I'm not worried. It would be a damn shame if
| academics come under fire merely for trying to operate on the
| cutting edge of science. None of us were trying to make
| money; we just wanted to make something interesting.
|
| > I'd also be keen to hear how your challenge against the
| DMCA on sharing LLaMA's weights goes?
|
| Thanks! I think we might be putting up a website for it soon,
| if only to explain ourselves. In the meantime - I hate this
| phrase, since I don't want followers - the only way to keep
| informed is to follow my Twitter, and perhaps keep an eye on
| my HN comments.
|
| You'll probably hear about it either way though, since it's a
| groundbreaking case. No one has tested the copyrightability
| of ML models before.
| Der_Einzige wrote:
| Getting sued is straight up a good thing for most peoples
| careers in tech. Haven't you watched silicon valley?
| jacquesm wrote:
| > It's also one of the most ethical outcomes; since ~no one
| trains on data that they own, they shouldn't own the resulting
| model.
|
| In my opinion the most ethical outcome would be that they are
| on the hook for the cumulative cost of the copyright they
| violated. That way authors would come out ahead instead of
| having their rights trashed 'because it's too late anyway'.
| cornel_io wrote:
| Whether or not training on publicly available data counts as
| a copyright violation is still completely up in the air
| legally, and clearly a lot of lawyers at all of the top tech
| companies think they're going to end up in the clear under
| fair use.
|
| At some point this stuff will have to get tested by making
| its way up the appeals stack in the US, and IMO there is only
| a minuscule chance that will result in Google, MS, and Meta
| getting slapped with anything more than a token fine (my bet
| is it won't even be that), let alone paying every person who
| ever wrote anything that was used in these datasets for
| copyright violations, which would basically be everyone.
| rpdillon wrote:
| > on the hook for the cumulative cost of the copyright they
| violated.
|
| I think there's a strong argument for a Fair Use defense,
| given the size of the models versus the size of the training
| sets, as well as the gulf in intended use: an AI model
| doesn't compete with e.g. a book. Obviously we'll have to see
| if play out in court to find out.
| ben_w wrote:
| Current AI models don't compete with a book, from what I've
| seen; I wouldn't want to bet how long it takes before they
| can compete with not just one but all books.
| archivist0 wrote:
| [dead]
| hedgehog wrote:
| I understand that LLMs to date have mostly been trained on a
| wide variety of copyright-encumbered data but in other domains
| (computer vision for example) the tradeoffs are different and
| in practice many models are trained on private / unencumbered
| data. If those weights are not protected by copyright then my
| concern is it will be hard to sufficiently protect them via
| license agreement and it will become yet another factor
| favoring the SaaSification of everything in tech.
| sillysaurusx wrote:
| This is true, and it's why I hesitated to file legal action.
| My goal was to benefit hackers. If the outcome causes
| problems for people who are just trying to share their work,
| I'd be upset.
|
| Ultimately what convinced me to proceed is that there are
| immense forces pressuring ML models to become SaaS companies.
| It's very difficult to offer an ML model for extended periods
| _without_ being a company. E.g. https://6b.eleuther.ai/ is
| down. Eleuther failing illustrates just how hard it is -- we
| were all working as hard as we could to design something that
| would last a long time, and a long time turned out to be two
| short years. Contrast that with other kinds of hacking (e.g.
| webdev, gamedev, hardware...) where the end result lasts
| basically forever.
|
| So if ML models aren't copyrightable, I think it'll hurt
| companies a lot more than individuals. In fact the goal is
| the other way around: to protect individuals. All I did was
| publish Facebook's own GPL download script to github, and it
| got DMCA'd. If we don't push back on that kind of behavior
| now, companies will get used to the idea that they control
| "their" model -- even when their model is anything but
| theirs.
| idiotsecant wrote:
| Is it useful to protect weights with copyright? What if I
| download your weights and retrain them for 5 seconds,
| changing each weight .0000001%? How much change is a new
| product? What if I change a single weight?
| hedgehog wrote:
| Like the parallel scenarios of taking a book and changing a
| few words, slapping a new logo on someone else's app, or
| stylizing a photo with a filter, those are questions that
| will be answered in court if people can't come to an
| agreement on their own.
| cschmidt wrote:
| If you're looking at The Pile, you also might consider the Red
| Pajama dataset. A new cleaned version was released recently
| https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...
___________________________________________________________________
(page generated 2023-07-11 23:00 UTC) |