[HN Gopher] The Pile: An 800GB dataset of diverse text for langu...
___________________________________________________________________
 
The Pile: An 800GB dataset of diverse text for language modeling
(2020)
 
Author : charlysl
Score  : 70 points
Date   : 2023-07-11 18:19 UTC (4 hours ago)
 
web link (arxiv.org)
w3m dump (arxiv.org)
 
| charlysl wrote:
| OP here. I learned about this while reading Stanford's LLM
| course's "Data" lecture [1]. Very interesting how it assesses the
| datasets used for GPT 2 and 3, etc, and how The Pile addresses
| their issues. A very interesting course!
| 
| [1] https://stanford-cs324.github.io/winter2022/lectures/data/
 
  | pjot wrote:
  | The Pile was also referenced in a post today of some guys
  | tweets about "leaked" gpt4 details
  | 
  | https://news.ycombinator.com/item?id=36675934
 
| Der_Einzige wrote:
| I came so close to getting my dataset DebateSum
| (https://huggingface.co/datasets/Hellisotherpeople/DebateSum)
| into the pile, but they decided at the last minute not to add it:
| https://github.com/EleutherAI/the-pile/issues/56
| 
| I'm still a tiny bit salty about that, but the pile is a
| wonderful dataset regardless.
 
  | orange_fritter wrote:
  | That dataset looks cool. Good work either way, I'm sure it'll
  | go somewhere
 
    | Der_Einzige wrote:
    | Stay tuned! I've got a paper I'm writing about a new followup
    | which is a 40x improvement in size (basically every open
    | source debate card... Ever) and a 40x improvement in metadata
    | and duplication detection. The work is all done since late
    | april and I've just been lazy/writer-blocked (ironic in a
    | world of high end LLMs) and haven't gotten the paper
    | finished.
    | 
    | Kinda of sad to have missed NeurIPS dataset track deadline
    | and ACL, but I know that anything close to this in scope is a
    | slam-dunk accept at the argument mining workshop
 
| dang wrote:
| Related:
| 
|  _The Pile: An 800GB Dataset of Diverse Text for Language
| Modeling_ - https://news.ycombinator.com/item?id=36272365 - June
| 2023 (5 comments)
| 
|  _The Pile: An 800GB Dataset of Diverse Text for Language
| Modeling_ - https://news.ycombinator.com/item?id=25607809 - Jan
| 2021 (60 comments)
 
| sillysaurusx wrote:
| Author here. And by author I mean I created books3 (the books
| component of The Pile) while everyone else did the hard work of
| actually writing the paper, ha. Stella and Leo Gao in particular
| did so much wonderful work on the paper, though it couldn't have
| happened without everyone's contributions.
| 
| As far as I know, this was the first academic contribution from a
| discord collaboration to ML. Back then discord was barely used
| for ML at all, though nowadays of course the largest discord in
| the world is midjourney.
| 
| There were a bunch of interesting stories from those days. We
| almost didn't release at all (or at least the books component)
| because of fear of copyright backlash. Turns out no one cared,
| and then suddenly today the world cares a great deal.
| 
| As a side note, I'll be participating in a legal action against
| Meta for the purpose of making ML models uncopyrightable:
| https://twitter.com/theshawwn/status/1641804013791215619?s=6....
| They DMCA'ed one of my repos distributing LLaMA, so we fought
| back and challenged the idea that weights can be copyrighted at
| all. This seems like the best outcome for hackers and individual
| researchers, for a few reasons. It's also one of the most ethical
| outcomes; since ~no one trains on data that they own, they
| shouldn't own the resulting model.
| 
| One last thing. The Pile would've been far less relevant without
| the wonderful assistance of The Eye, a group of people who
| archive all kinds of things. They've hosted the datasets for
| years now. And although it seems strange to say that dataset
| hosting could make or break The Pile, back then there was nobody
| else willing to host us. https://the-eye.eu/
 
  | sfriedr wrote:
  | Could you share more about copyright? For example, aren't you
  | worried that now, with all kinds of lawsuits happening [1] and
  | copyright issues that were found in existing datasets [2], that
  | you might get threatening letters from a lawyer some day?
  | 
  | I'm the author of [3] where we introduced one of the first
  | natural-language datasets that test graduate mathematics for
  | LLMs, but some of the prompts we took from a copyrighted book
  | and therefore thought about excluding them. Having them in the
  | public dataset would be really nice though, hence I'm keen
  | about your experience.
  | 
  | I'd also be keen to hear how your challenge against the DMCA on
  | sharing LLaMA's weights goes?
  | 
  | [1] https://www.theguardian.com/books/2023/jul/05/authors-
  | file-a... [2] https://arxiv.org/abs/2105.05241 [3]
  | https://arxiv.org/abs/2301.13867
 
    | sillysaurusx wrote:
    | I think a lot of hackers shy away from doing impactful work
    | because of fear. Sometimes those fears are justified, but
    | it's remarkable how often things that seem like a big deal
    | turn out not to matter. My advice for ambitious devs would be
    | to do what seems interesting, and don't worry too much about
    | threatening letters. Usually the worst thing that happens is
    | that you agree to stop doing whatever generated the threat.
    | 
    | Personally, I'm not worried. It would be a damn shame if
    | academics come under fire merely for trying to operate on the
    | cutting edge of science. None of us were trying to make
    | money; we just wanted to make something interesting.
    | 
    | > I'd also be keen to hear how your challenge against the
    | DMCA on sharing LLaMA's weights goes?
    | 
    | Thanks! I think we might be putting up a website for it soon,
    | if only to explain ourselves. In the meantime - I hate this
    | phrase, since I don't want followers - the only way to keep
    | informed is to follow my Twitter, and perhaps keep an eye on
    | my HN comments.
    | 
    | You'll probably hear about it either way though, since it's a
    | groundbreaking case. No one has tested the copyrightability
    | of ML models before.
 
    | Der_Einzige wrote:
    | Getting sued is straight up a good thing for most peoples
    | careers in tech. Haven't you watched silicon valley?
 
  | jacquesm wrote:
  | > It's also one of the most ethical outcomes; since ~no one
  | trains on data that they own, they shouldn't own the resulting
  | model.
  | 
  | In my opinion the most ethical outcome would be that they are
  | on the hook for the cumulative cost of the copyright they
  | violated. That way authors would come out ahead instead of
  | having their rights trashed 'because it's too late anyway'.
 
    | cornel_io wrote:
    | Whether or not training on publicly available data counts as
    | a copyright violation is still completely up in the air
    | legally, and clearly a lot of lawyers at all of the top tech
    | companies think they're going to end up in the clear under
    | fair use.
    | 
    | At some point this stuff will have to get tested by making
    | its way up the appeals stack in the US, and IMO there is only
    | a minuscule chance that will result in Google, MS, and Meta
    | getting slapped with anything more than a token fine (my bet
    | is it won't even be that), let alone paying every person who
    | ever wrote anything that was used in these datasets for
    | copyright violations, which would basically be everyone.
 
    | rpdillon wrote:
    | > on the hook for the cumulative cost of the copyright they
    | violated.
    | 
    | I think there's a strong argument for a Fair Use defense,
    | given the size of the models versus the size of the training
    | sets, as well as the gulf in intended use: an AI model
    | doesn't compete with e.g. a book. Obviously we'll have to see
    | if play out in court to find out.
 
      | ben_w wrote:
      | Current AI models don't compete with a book, from what I've
      | seen; I wouldn't want to bet how long it takes before they
      | can compete with not just one but all books.
 
  | archivist0 wrote:
  | [dead]
 
  | hedgehog wrote:
  | I understand that LLMs to date have mostly been trained on a
  | wide variety of copyright-encumbered data but in other domains
  | (computer vision for example) the tradeoffs are different and
  | in practice many models are trained on private / unencumbered
  | data. If those weights are not protected by copyright then my
  | concern is it will be hard to sufficiently protect them via
  | license agreement and it will become yet another factor
  | favoring the SaaSification of everything in tech.
 
    | sillysaurusx wrote:
    | This is true, and it's why I hesitated to file legal action.
    | My goal was to benefit hackers. If the outcome causes
    | problems for people who are just trying to share their work,
    | I'd be upset.
    | 
    | Ultimately what convinced me to proceed is that there are
    | immense forces pressuring ML models to become SaaS companies.
    | It's very difficult to offer an ML model for extended periods
    | _without_ being a company. E.g. https://6b.eleuther.ai/ is
    | down. Eleuther failing illustrates just how hard it is -- we
    | were all working as hard as we could to design something that
    | would last a long time, and a long time turned out to be two
    | short years. Contrast that with other kinds of hacking (e.g.
    | webdev, gamedev, hardware...) where the end result lasts
    | basically forever.
    | 
    | So if ML models aren't copyrightable, I think it'll hurt
    | companies a lot more than individuals. In fact the goal is
    | the other way around: to protect individuals. All I did was
    | publish Facebook's own GPL download script to github, and it
    | got DMCA'd. If we don't push back on that kind of behavior
    | now, companies will get used to the idea that they control
    | "their" model -- even when their model is anything but
    | theirs.
 
    | idiotsecant wrote:
    | Is it useful to protect weights with copyright? What if I
    | download your weights and retrain them for 5 seconds,
    | changing each weight .0000001%? How much change is a new
    | product? What if I change a single weight?
 
      | hedgehog wrote:
      | Like the parallel scenarios of taking a book and changing a
      | few words, slapping a new logo on someone else's app, or
      | stylizing a photo with a filter, those are questions that
      | will be answered in court if people can't come to an
      | agreement on their own.
 
| cschmidt wrote:
| If you're looking at The Pile, you also might consider the Red
| Pajama dataset. A new cleaned version was released recently
| https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...
 
___________________________________________________________________
(page generated 2023-07-11 23:00 UTC)