|
| alphabetting wrote:
| the joke explanations on page 38 of the full paper linked here
| are blowing my mind. it's crazy how far language models have come
|
| https://storage.googleapis.com/pathways-language-model/PaLM-...
| Xenixo wrote:
| Wow.
|
| The anti joke explanation was also very impressive.
| modeless wrote:
| This may be the most impressive thing I've seen a language
| model do so far. That's incredible. The future is going to be
| very weird.
| nqzero wrote:
| this thing is already more human than i am
| WithinReason wrote:
| I like how it appears that it had to convert from imperial to
| metric before it could make an inference:
|
| _300 miles per hour is about 480 km /h. This is about the
| speed of a commercial airplane. [...]_
| sib wrote:
| Also, that's a very slow commercial airplane. (Unless talking
| about an old turboprop?)
| TaylorAlexander wrote:
| Your comment prompted me to tweet an image of that section,
| complete with alt text (as much as can fit). If anyone cares to
| see it in tweet form.
|
| https://twitter.com/tlalexander/status/1511089810752126984
| severine wrote:
| So... we are training models here on HN, specially if we follow
| the site's guidelines! Makes you think... which makes _it_
| think!
|
| Wow, interesting times, indeed.
| theincredulousk wrote:
| Haven't looked further, but I'm wondering about that. Is that
| the result of training to be able to explain that specific
| joke, or is it generalized?
|
| In the past these things have been misleading. Some impressive
| capability ends up being far more narrow than implied, so it's
| kind of like just storing information and retrieving it with
| extra steps.
| whimsicalism wrote:
| From the example, it seems hard to imagine that it has been
| trained to explain this specific joke.
|
| I understand language model skepticism is very big on HN, but
| this is impressive.
| mjburgess wrote:
| How much of human written history can be compressed and
| aproximately stored in 504Bn parameters?
|
| It seems to me bascially certain that no compressed
| representation of text can be an understanding of langugae,
| so necessarily, any statistical algorithm here is always
| using coincidental tricks. That it takes 500bn parameters
| to do it, i think, is a clue that we dont even really need.
|
| Words mean what we do with them -- you need to be here in
| the world with us, to understand what we mean. There is
| nothing in the patterns of our usage of words which
| provides their semantics, so the whole field of
| distributional analysis precludes this superstition.
|
| You cannot, by mere statistical analysis of patterns in
| mere text, understand the nature of the world. But it is
| precisely this we communicate in text. We succeed because
| we are both in the world, not because "w" occuring before
| "d" somehow communicates anything.
|
| Apparent correlations in text are meaningful to us, because
| we created them, and we _have_ their semantics. The system
| _must_ by is nature be a mere remebering.
| hackinthebochs wrote:
| >Words mean what we do with them -- you need to be here
| in the world with us, to understand what we mean
|
| This is like saying "humans can't fly because flight
| requires flapping wings under your own power". Sure, its
| true given the definition this statement is employing,
| but so what? Nothing of substance is learned by
| definition. We certainly are not learning about any
| fundamental limitations of humans from such a definition.
| Similarly, defining understanding language as "the
| association of symbols with things/behaviors in the
| world" demonstrates nothing of substance about the limits
| of language models.
|
| But beyond that, its clear to me the definition itself is
| highly questionable. There are many fields where the vast
| majority of uses of language do not directly correspond
| with things or behaviors in the world. Pure math is an
| obvious example. The understanding of pure math is a
| purely abstract enterprise, one constituted by
| relationships between other abstractions, bottoming out
| at arbitrary placeholders (e.g. the number one is an
| arbitrary placeholder situated in a larger arithmetical
| structure). By your definition, a language model without
| any contact with the world can understand purely abstract
| systems as well as any human. But this just implies
| there's something to understanding beyond merely
| associations of symbols with things/behaviors in the
| physical world.
| whimsicalism wrote:
| > It seems to me bascially certain that no compressed
| representation of text can be an understanding of
| langugae, so necessarily, any statistical algorithm here
| is always using coincidental tricks. That it takes 500bn
| parameters to do it, i think, is a clue that we dont even
| really need.
|
| I think your premise contains your conclusion, which
| while common, is something you should strive to avoid.
|
| I do think your opinion is a good example of the
| prevailing sentiment on Hacker News. To me, it seems to
| come from a discomfort with the fact that even "we"
| emerge out of the basic interactions of basic building
| blocks. Our brain has been able to build world knowledge
| "merely by" analysis of electrical impulses being
| transmitted to it on wires.
| mjburgess wrote:
| I have no discomfort with the notion that our bodies,
| which grow in response to direct causal contact with our
| environment, contain in-their-structure the generative
| capbaility for knoweldge, imagination, skill, growth --
| and so on.
|
| I have no discomfort with the basically schiozphrenic
| notion that the shapes of words have something to do with
| the nature of the world. I just think its a kind of
| insantity which absolutely destroys our ability to reason
| carefully about the use of these systems.
|
| That "tr" occurs before "ee" says as much about "trees"
| as "leaves are green" says -- it is only that *we* have
| the relevant semantics that the latter is meaningful when
| interpreted in the light of our "environmental history"
| recorded in our bodies, and given weight and utility by
| our imaginations.
|
| The structure of text is not the structure of the world.
| This thesis is mad. Its a scientific thesis. It is
| trivial to test it. It is trivial to wholey discred it.
| It's pseudoscience.
|
| No one here is a scientist and no one treats any of this
| as science. Where's the criteria for the emprical
| adequecy of NLP systems as models of language? Specifying
| any, conducting actual hypothesis tests, and establishing
| a _theory_ of how NLP systems model language -- this
| would immediately reveal the smoke-and-mirros.
|
| The work to reveal the statistical tricks underneath them
| takes years, and no one has much motivation to do it. The
| money lies in this sales pitch, and this is no science.
| This is no scientific method.
| whimsicalism wrote:
| Agree to disagree. I think you are opining about things
| that you are lacking fundamental knowledge on.
|
| > The structure of text is not the structure of the
| world. This thesis is mad. Its a scientific thesis. It is
| trivial to test it. It is trivial to wholey discred it.
| It's pseudoscience.
|
| It's unclear what you even mean by that. Are the
| electrical impulses coming to our brain the "structure of
| the world"?
| rafaelero wrote:
| Ok, boomer.
| tux1968 wrote:
| Then wouldn't you have to believe that people who are
| born blind and deaf, or unable to walk, do not really
| "understand", since they're not connected to the world in
| the same way as those born without those limits?
| NiceElephant wrote:
| I wonder if AI is a technology that will move from "local
| producers" to a more centralized setup, where everybody just buys
| it as a service, because it becomes too complicated to operate it
| by yourself.
|
| What are examples in history where this has happened before? The
| production of light, heat and movement comes to mind, that, with
| the invention of electricity, moved from people's homes and
| businesses to (nuclear) power plants, which can only be operated
| by a fairly large team of specialists.
|
| Anybody has other examples?
| eunos wrote:
| Hosting from your owning servers on home, localized data
| centers to global cloud companies.
| NiceElephant wrote:
| Yeah, this kinda goes in the same direction, but in this
| case, as well as for example agriculture, I feel it is mostly
| for convenience. You could still do it at home if you wanted
| to, in contrast to operating a nuclear power plant. I thought
| chip-making might be another example, but I'm not sure that
| was ever decentralized in its early days.
| napoleon_thepig wrote:
| This is kind of already happening with services like Google
| cloud translation.
| castratikron wrote:
| Will we see an intelligence too cheap to meter?
|
| https://www.atlanticcouncil.org/blogs/energysource/is-power-...
| WithinReason wrote:
| Based on their 3rd figure, it would take an approximately 100x
| larger model (and more data) to surpass the performance of the
| best humans
| drusepth wrote:
| Its performance on answers to chained inference questions (on
| page 38 of https://storage.googleapis.com/pathways-language-
| model/PaLM-...) has already surpassed the performance of this
| human.
| danuker wrote:
| I placed a transparent plastic ruler on the screen to come to
| the same conclusion, then I saw your comment.
| WithinReason wrote:
| Your methodology is much more sophisticated than mine
| sib wrote:
| My dad, who worked on jet engine production many decades
| ago, would refer to MIL SPEC EYEBALL Mk I. (I _think_ he
| was kidding...)
| londons_explore wrote:
| This huge language model was trained 'from scratch' - ie. before
| the first batch of data went into the training process, the model
| state was simply initialized using random noise.
|
| I believe we are near the end of that. As models get more and
| more expensive to train, we'll see future huge models being
| 'seeded' with weights from previous models. Eventually nation-
| state levels of effort will be used to further train such
| networks to then distribute results to industry to use.
|
| A whole industry will be built around licensing 'seeds' to build
| ML models on - you'll have to pay fees to all the 'ancestors' of
| models you use.
| [deleted]
| lukasb wrote:
| Does anyone know what the units are on the "performance
| improvement over SOTA" chart?
| lukasb wrote:
| Turns out it's a composite of "normalized task-specific
| metrics", details in the paper. Shrug. Numbers go up!
| r-zip wrote:
| I was wondering the same. Without better y-axis labeling, it's
| not that informative of a graphic.
| whymauri wrote:
| Poetic that the top post right now is (partially) about how
| science communication over-simplifying figures results in a
| popular misunderstanding of science, leading readers to
| believe that conducting research is easier than it actually
| is.
| The_rationalist wrote:
| not a single super large language model has beaten the state of
| the art in the key NLP tasks (POS tag, dep tag, coreference, wsd,
| ner, etc) They are always only used for higher level tasks, which
| is tragic.
| oofbey wrote:
| Why is that tragic? Classic NLP tasks are IMHO kinda pointless.
| Nobody _actually_ cares about parse trees, etc. These things
| were useful when that was the best we could do with ML, because
| they allowed us to accomplish genuinely-useful NLP tasks by
| writing code that uses things like parse trees, NER, etc. But
| why bother with parse trees and junk like that if you can just
| get the model to answer the question you actually care about?
| dgreensp wrote:
| When can we try using it?? :)
| ausbah wrote:
| i wonder if pruning and other methods that reduce size
| drastically while not compromising on performance would be
| possible
| gjstein wrote:
| Would love an answer on this too. It would be even better not
| just to _try_ using this, but also be able to run it locally,
| something that has been impossible for GPT-3.
| whimsicalism wrote:
| This is not something that will be possible to run locally.
|
| If you had 1 bit per parameter (not realistic), it would
| still take ~100 GB of RAM just to load into memory.
| arkano wrote:
| Does it look like it would be possible to run locally?
| [deleted]
| sidcool wrote:
| What do 540 billion parameters mean in this case?
| minimaxir wrote:
| 540B float32 values in the model. (although since this model
| was trained via TPUs, likely bfloat16s instead)
| londons_explore wrote:
| Please Google.... Please include in your papers _non-cherry-
| picked_ sample outputs! And explicitly say that they aren 't
| cherry picked.
|
| I understand that there is a chance that the output could be
| offensive/illegal. If necessary you can censor a few outputs, but
| make clear in the paper you've done that. It's better to do that
| than just show us the best picked outputs and pretend all outputs
| are as good.
| modeless wrote:
| This is why Google built TPUs. This alone justifies the whole
| program by itself. This level of natural language understanding,
| once it is harnessed for applications and made efficient enough
| for wide use, is going to revolutionize literally everything
| Google does. Owning the chips that can do this is incredibly
| valuable and companies that are stuck purchasing or renting
| whatever Nvidia makes are going to be at a disadvantage.
| endisneigh wrote:
| The model is insane, but could this realistically be used in
| production?
| motoboi wrote:
| Yes. You don't need a model in ram memory, NVME disks are fine.
| ekelsen wrote:
| That would have very slow inference latency if you had to
| read the model off disk for every token.
| 1024core wrote:
| 540B parameters means ~1TB of floating bytes (assuming
| BFLOAT16). Quadruple that for other associated stuff, and
| you'd need a machine with 4TB of RAM.
| endisneigh wrote:
| right - and even if you did run happen to have a machine
| with 4TB of ram - what type of latency would you have on
| a single machine running this as a service? how many
| machines would you need for google translate performance?
|
| doesn't seem like you can run this as a service, yet.
| anentropic wrote:
| I too am curious what kind of hardware resources are needed to
| run the model once it is trained
| gk1 wrote:
| Why not? I'm curious if you're picturing any specific
| roadblocks in mind. OpenAI makes their large models available
| through an API, removing any issues with model hosting and
| operations.
| minimaxir wrote:
| Latency, mostly.
|
| The GPT-3 APIs were _very_ slow on release, and even with the
| current APIs it still takes a couple seconds to get results
| from the 175B model.
| mountainriver wrote:
| Google for all their flaws really is building the future of AI.
| This is incredibly impressive and makes me think we are
| relatively close to GAI
| nickvincent wrote:
| Crazy impressive! A question about the training data: anyone
| familiar with this line of work know what social media platforms
| the "conversation" data component of the training set came from?
| There's a datasheet that points to prior work
| https://arxiv.org/abs/2001.09977, which sounds like it could be
| reddit, HN, or a similar platform?
| gk1 wrote:
| Is there an equivalent to Moore's Law for language models? It
| feels like every week an even bigger (and supposedly better)
| model is announced.
| visarga wrote:
| Scaling Laws for Neural Language Models -
| https://arxiv.org/abs/2001.08361
| lucidrains wrote:
| revised scaling laws https://arxiv.org/abs/2203.15556
| https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-
| scalin...
| visarga wrote:
| Interesting that they used Chain of Thought Prompting[1] for
| improved reasoning so soon after its publication. Also related to
| DeepMind AlphaCode which generates code and filters results by
| unit tests, while Chain of Thought Prompting filters by checking
| for correct answer at the end.
|
| Seems like language models can generate more training data for
| language models in an iterative manner.
|
| [1] https://arxiv.org/abs/2201.11903
| nullc wrote:
| The general technique is pretty obvious, I discussed and
| demonstrated it in some HN comments with GPT2 and GPT3 a couple
| times in the last couple years, and suggested some speculative
| extensions (which might be totally unworkable, unfortunately
| these networks are too big for me to attempt to train to try it
| out) https://news.ycombinator.com/item?id=24005638
| gwern wrote:
| In fact, people had already shown it working with GPT-3
| before you wrote your comment:
| https://twitter.com/kleptid/status/1284069270603866113
| https://twitter.com/kleptid/status/1284098635689611264 Seeing
| how much smarter it could be with dialogue was very exciting
| back then, when people were still super-skeptical.
|
| The followup work has also brought out a lot of interesting
| points: why didn't anyone get that working with GPT-2, and
| why wouldn't your GPT-2 suggestion have worked? Because
| inner-monologue capabilities seem to only emerge at some
| point past 100b-parameters (and/or equivalent level of
| compute), furnishing one of the most striking examples of
| emergent capability-spikes in large NNs. GPT-2 is just _way_
| too small, and if you had tried, you would 've concluded
| inner-monologue doesn't work. It doesn't work, and it keeps
| on not working... until suddenly it does work.
| make3 wrote:
| The chain of thought paper is from Google, so they've known
| about it internally for a while potentially
___________________________________________________________________
(page generated 2022-04-04 23:00 UTC) |