[HN Gopher] Pathways Language Model (PaLM): Scaling to 540B para...
___________________________________________________________________
 
Pathways Language Model (PaLM): Scaling to 540B parameters
 
Author : homarp
Score  : 115 points
Date   : 2022-04-04 16:55 UTC (6 hours ago)
 
web link (ai.googleblog.com)
w3m dump (ai.googleblog.com)
 
| alphabetting wrote:
| the joke explanations on page 38 of the full paper linked here
| are blowing my mind. it's crazy how far language models have come
| 
| https://storage.googleapis.com/pathways-language-model/PaLM-...
 
  | Xenixo wrote:
  | Wow.
  | 
  | The anti joke explanation was also very impressive.
 
    | modeless wrote:
    | This may be the most impressive thing I've seen a language
    | model do so far. That's incredible. The future is going to be
    | very weird.
 
  | nqzero wrote:
  | this thing is already more human than i am
 
  | WithinReason wrote:
  | I like how it appears that it had to convert from imperial to
  | metric before it could make an inference:
  | 
  |  _300 miles per hour is about 480 km /h. This is about the
  | speed of a commercial airplane. [...]_
 
    | sib wrote:
    | Also, that's a very slow commercial airplane. (Unless talking
    | about an old turboprop?)
 
  | TaylorAlexander wrote:
  | Your comment prompted me to tweet an image of that section,
  | complete with alt text (as much as can fit). If anyone cares to
  | see it in tweet form.
  | 
  | https://twitter.com/tlalexander/status/1511089810752126984
 
  | severine wrote:
  | So... we are training models here on HN, specially if we follow
  | the site's guidelines! Makes you think... which makes _it_
  | think!
  | 
  | Wow, interesting times, indeed.
 
  | theincredulousk wrote:
  | Haven't looked further, but I'm wondering about that. Is that
  | the result of training to be able to explain that specific
  | joke, or is it generalized?
  | 
  | In the past these things have been misleading. Some impressive
  | capability ends up being far more narrow than implied, so it's
  | kind of like just storing information and retrieving it with
  | extra steps.
 
    | whimsicalism wrote:
    | From the example, it seems hard to imagine that it has been
    | trained to explain this specific joke.
    | 
    | I understand language model skepticism is very big on HN, but
    | this is impressive.
 
      | mjburgess wrote:
      | How much of human written history can be compressed and
      | aproximately stored in 504Bn parameters?
      | 
      | It seems to me bascially certain that no compressed
      | representation of text can be an understanding of langugae,
      | so necessarily, any statistical algorithm here is always
      | using coincidental tricks. That it takes 500bn parameters
      | to do it, i think, is a clue that we dont even really need.
      | 
      | Words mean what we do with them -- you need to be here in
      | the world with us, to understand what we mean. There is
      | nothing in the patterns of our usage of words which
      | provides their semantics, so the whole field of
      | distributional analysis precludes this superstition.
      | 
      | You cannot, by mere statistical analysis of patterns in
      | mere text, understand the nature of the world. But it is
      | precisely this we communicate in text. We succeed because
      | we are both in the world, not because "w" occuring before
      | "d" somehow communicates anything.
      | 
      | Apparent correlations in text are meaningful to us, because
      | we created them, and we _have_ their semantics. The system
      | _must_ by is nature be a mere remebering.
 
        | hackinthebochs wrote:
        | >Words mean what we do with them -- you need to be here
        | in the world with us, to understand what we mean
        | 
        | This is like saying "humans can't fly because flight
        | requires flapping wings under your own power". Sure, its
        | true given the definition this statement is employing,
        | but so what? Nothing of substance is learned by
        | definition. We certainly are not learning about any
        | fundamental limitations of humans from such a definition.
        | Similarly, defining understanding language as "the
        | association of symbols with things/behaviors in the
        | world" demonstrates nothing of substance about the limits
        | of language models.
        | 
        | But beyond that, its clear to me the definition itself is
        | highly questionable. There are many fields where the vast
        | majority of uses of language do not directly correspond
        | with things or behaviors in the world. Pure math is an
        | obvious example. The understanding of pure math is a
        | purely abstract enterprise, one constituted by
        | relationships between other abstractions, bottoming out
        | at arbitrary placeholders (e.g. the number one is an
        | arbitrary placeholder situated in a larger arithmetical
        | structure). By your definition, a language model without
        | any contact with the world can understand purely abstract
        | systems as well as any human. But this just implies
        | there's something to understanding beyond merely
        | associations of symbols with things/behaviors in the
        | physical world.
 
        | whimsicalism wrote:
        | > It seems to me bascially certain that no compressed
        | representation of text can be an understanding of
        | langugae, so necessarily, any statistical algorithm here
        | is always using coincidental tricks. That it takes 500bn
        | parameters to do it, i think, is a clue that we dont even
        | really need.
        | 
        | I think your premise contains your conclusion, which
        | while common, is something you should strive to avoid.
        | 
        | I do think your opinion is a good example of the
        | prevailing sentiment on Hacker News. To me, it seems to
        | come from a discomfort with the fact that even "we"
        | emerge out of the basic interactions of basic building
        | blocks. Our brain has been able to build world knowledge
        | "merely by" analysis of electrical impulses being
        | transmitted to it on wires.
 
        | mjburgess wrote:
        | I have no discomfort with the notion that our bodies,
        | which grow in response to direct causal contact with our
        | environment, contain in-their-structure the generative
        | capbaility for knoweldge, imagination, skill, growth --
        | and so on.
        | 
        | I have no discomfort with the basically schiozphrenic
        | notion that the shapes of words have something to do with
        | the nature of the world. I just think its a kind of
        | insantity which absolutely destroys our ability to reason
        | carefully about the use of these systems.
        | 
        | That "tr" occurs before "ee" says as much about "trees"
        | as "leaves are green" says -- it is only that *we* have
        | the relevant semantics that the latter is meaningful when
        | interpreted in the light of our "environmental history"
        | recorded in our bodies, and given weight and utility by
        | our imaginations.
        | 
        | The structure of text is not the structure of the world.
        | This thesis is mad. Its a scientific thesis. It is
        | trivial to test it. It is trivial to wholey discred it.
        | It's pseudoscience.
        | 
        | No one here is a scientist and no one treats any of this
        | as science. Where's the criteria for the emprical
        | adequecy of NLP systems as models of language? Specifying
        | any, conducting actual hypothesis tests, and establishing
        | a _theory_ of how NLP systems model language -- this
        | would immediately reveal the smoke-and-mirros.
        | 
        | The work to reveal the statistical tricks underneath them
        | takes years, and no one has much motivation to do it. The
        | money lies in this sales pitch, and this is no science.
        | This is no scientific method.
 
        | whimsicalism wrote:
        | Agree to disagree. I think you are opining about things
        | that you are lacking fundamental knowledge on.
        | 
        | > The structure of text is not the structure of the
        | world. This thesis is mad. Its a scientific thesis. It is
        | trivial to test it. It is trivial to wholey discred it.
        | It's pseudoscience.
        | 
        | It's unclear what you even mean by that. Are the
        | electrical impulses coming to our brain the "structure of
        | the world"?
 
        | rafaelero wrote:
        | Ok, boomer.
 
        | tux1968 wrote:
        | Then wouldn't you have to believe that people who are
        | born blind and deaf, or unable to walk, do not really
        | "understand", since they're not connected to the world in
        | the same way as those born without those limits?
 
| NiceElephant wrote:
| I wonder if AI is a technology that will move from "local
| producers" to a more centralized setup, where everybody just buys
| it as a service, because it becomes too complicated to operate it
| by yourself.
| 
| What are examples in history where this has happened before? The
| production of light, heat and movement comes to mind, that, with
| the invention of electricity, moved from people's homes and
| businesses to (nuclear) power plants, which can only be operated
| by a fairly large team of specialists.
| 
| Anybody has other examples?
 
  | eunos wrote:
  | Hosting from your owning servers on home, localized data
  | centers to global cloud companies.
 
    | NiceElephant wrote:
    | Yeah, this kinda goes in the same direction, but in this
    | case, as well as for example agriculture, I feel it is mostly
    | for convenience. You could still do it at home if you wanted
    | to, in contrast to operating a nuclear power plant. I thought
    | chip-making might be another example, but I'm not sure that
    | was ever decentralized in its early days.
 
  | napoleon_thepig wrote:
  | This is kind of already happening with services like Google
  | cloud translation.
 
  | castratikron wrote:
  | Will we see an intelligence too cheap to meter?
  | 
  | https://www.atlanticcouncil.org/blogs/energysource/is-power-...
 
| WithinReason wrote:
| Based on their 3rd figure, it would take an approximately 100x
| larger model (and more data) to surpass the performance of the
| best humans
 
  | drusepth wrote:
  | Its performance on answers to chained inference questions (on
  | page 38 of https://storage.googleapis.com/pathways-language-
  | model/PaLM-...) has already surpassed the performance of this
  | human.
 
  | danuker wrote:
  | I placed a transparent plastic ruler on the screen to come to
  | the same conclusion, then I saw your comment.
 
    | WithinReason wrote:
    | Your methodology is much more sophisticated than mine
 
      | sib wrote:
      | My dad, who worked on jet engine production many decades
      | ago, would refer to MIL SPEC EYEBALL Mk I. (I _think_ he
      | was kidding...)
 
| londons_explore wrote:
| This huge language model was trained 'from scratch' - ie. before
| the first batch of data went into the training process, the model
| state was simply initialized using random noise.
| 
| I believe we are near the end of that. As models get more and
| more expensive to train, we'll see future huge models being
| 'seeded' with weights from previous models. Eventually nation-
| state levels of effort will be used to further train such
| networks to then distribute results to industry to use.
| 
| A whole industry will be built around licensing 'seeds' to build
| ML models on - you'll have to pay fees to all the 'ancestors' of
| models you use.
 
| [deleted]
 
| lukasb wrote:
| Does anyone know what the units are on the "performance
| improvement over SOTA" chart?
 
  | lukasb wrote:
  | Turns out it's a composite of "normalized task-specific
  | metrics", details in the paper. Shrug. Numbers go up!
 
  | r-zip wrote:
  | I was wondering the same. Without better y-axis labeling, it's
  | not that informative of a graphic.
 
    | whymauri wrote:
    | Poetic that the top post right now is (partially) about how
    | science communication over-simplifying figures results in a
    | popular misunderstanding of science, leading readers to
    | believe that conducting research is easier than it actually
    | is.
 
| The_rationalist wrote:
| not a single super large language model has beaten the state of
| the art in the key NLP tasks (POS tag, dep tag, coreference, wsd,
| ner, etc) They are always only used for higher level tasks, which
| is tragic.
 
  | oofbey wrote:
  | Why is that tragic? Classic NLP tasks are IMHO kinda pointless.
  | Nobody _actually_ cares about parse trees, etc. These things
  | were useful when that was the best we could do with ML, because
  | they allowed us to accomplish genuinely-useful NLP tasks by
  | writing code that uses things like parse trees, NER, etc. But
  | why bother with parse trees and junk like that if you can just
  | get the model to answer the question you actually care about?
 
| dgreensp wrote:
| When can we try using it?? :)
 
  | ausbah wrote:
  | i wonder if pruning and other methods that reduce size
  | drastically while not compromising on performance would be
  | possible
 
  | gjstein wrote:
  | Would love an answer on this too. It would be even better not
  | just to _try_ using this, but also be able to run it locally,
  | something that has been impossible for GPT-3.
 
    | whimsicalism wrote:
    | This is not something that will be possible to run locally.
    | 
    | If you had 1 bit per parameter (not realistic), it would
    | still take ~100 GB of RAM just to load into memory.
 
    | arkano wrote:
    | Does it look like it would be possible to run locally?
 
| [deleted]
 
| sidcool wrote:
| What do 540 billion parameters mean in this case?
 
  | minimaxir wrote:
  | 540B float32 values in the model. (although since this model
  | was trained via TPUs, likely bfloat16s instead)
 
| londons_explore wrote:
| Please Google.... Please include in your papers _non-cherry-
| picked_ sample outputs! And explicitly say that they aren 't
| cherry picked.
| 
| I understand that there is a chance that the output could be
| offensive/illegal. If necessary you can censor a few outputs, but
| make clear in the paper you've done that. It's better to do that
| than just show us the best picked outputs and pretend all outputs
| are as good.
 
| modeless wrote:
| This is why Google built TPUs. This alone justifies the whole
| program by itself. This level of natural language understanding,
| once it is harnessed for applications and made efficient enough
| for wide use, is going to revolutionize literally everything
| Google does. Owning the chips that can do this is incredibly
| valuable and companies that are stuck purchasing or renting
| whatever Nvidia makes are going to be at a disadvantage.
 
| endisneigh wrote:
| The model is insane, but could this realistically be used in
| production?
 
  | motoboi wrote:
  | Yes. You don't need a model in ram memory, NVME disks are fine.
 
    | ekelsen wrote:
    | That would have very slow inference latency if you had to
    | read the model off disk for every token.
 
      | 1024core wrote:
      | 540B parameters means ~1TB of floating bytes (assuming
      | BFLOAT16). Quadruple that for other associated stuff, and
      | you'd need a machine with 4TB of RAM.
 
        | endisneigh wrote:
        | right - and even if you did run happen to have a machine
        | with 4TB of ram - what type of latency would you have on
        | a single machine running this as a service? how many
        | machines would you need for google translate performance?
        | 
        | doesn't seem like you can run this as a service, yet.
 
  | anentropic wrote:
  | I too am curious what kind of hardware resources are needed to
  | run the model once it is trained
 
  | gk1 wrote:
  | Why not? I'm curious if you're picturing any specific
  | roadblocks in mind. OpenAI makes their large models available
  | through an API, removing any issues with model hosting and
  | operations.
 
    | minimaxir wrote:
    | Latency, mostly.
    | 
    | The GPT-3 APIs were _very_ slow on release, and even with the
    | current APIs it still takes a couple seconds to get results
    | from the 175B model.
 
| mountainriver wrote:
| Google for all their flaws really is building the future of AI.
| This is incredibly impressive and makes me think we are
| relatively close to GAI
 
| nickvincent wrote:
| Crazy impressive! A question about the training data: anyone
| familiar with this line of work know what social media platforms
| the "conversation" data component of the training set came from?
| There's a datasheet that points to prior work
| https://arxiv.org/abs/2001.09977, which sounds like it could be
| reddit, HN, or a similar platform?
 
| gk1 wrote:
| Is there an equivalent to Moore's Law for language models? It
| feels like every week an even bigger (and supposedly better)
| model is announced.
 
  | visarga wrote:
  | Scaling Laws for Neural Language Models -
  | https://arxiv.org/abs/2001.08361
 
    | lucidrains wrote:
    | revised scaling laws https://arxiv.org/abs/2203.15556
    | https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-
    | scalin...
 
| visarga wrote:
| Interesting that they used Chain of Thought Prompting[1] for
| improved reasoning so soon after its publication. Also related to
| DeepMind AlphaCode which generates code and filters results by
| unit tests, while Chain of Thought Prompting filters by checking
| for correct answer at the end.
| 
| Seems like language models can generate more training data for
| language models in an iterative manner.
| 
| [1] https://arxiv.org/abs/2201.11903
 
  | nullc wrote:
  | The general technique is pretty obvious, I discussed and
  | demonstrated it in some HN comments with GPT2 and GPT3 a couple
  | times in the last couple years, and suggested some speculative
  | extensions (which might be totally unworkable, unfortunately
  | these networks are too big for me to attempt to train to try it
  | out) https://news.ycombinator.com/item?id=24005638
 
    | gwern wrote:
    | In fact, people had already shown it working with GPT-3
    | before you wrote your comment:
    | https://twitter.com/kleptid/status/1284069270603866113
    | https://twitter.com/kleptid/status/1284098635689611264 Seeing
    | how much smarter it could be with dialogue was very exciting
    | back then, when people were still super-skeptical.
    | 
    | The followup work has also brought out a lot of interesting
    | points: why didn't anyone get that working with GPT-2, and
    | why wouldn't your GPT-2 suggestion have worked? Because
    | inner-monologue capabilities seem to only emerge at some
    | point past 100b-parameters (and/or equivalent level of
    | compute), furnishing one of the most striking examples of
    | emergent capability-spikes in large NNs. GPT-2 is just _way_
    | too small, and if you had tried, you would 've concluded
    | inner-monologue doesn't work. It doesn't work, and it keeps
    | on not working... until suddenly it does work.
 
  | make3 wrote:
  | The chain of thought paper is from Google, so they've known
  | about it internally for a while potentially
 
___________________________________________________________________
(page generated 2022-04-04 23:00 UTC)