proxy70

	[HN Gopher] Pathways Language Model (PaLM): Scaling to 540B para... ___________________________________________________________________ Pathways Language Model (PaLM): Scaling to 540B parameters Author : homarp Score : 115 points Date : 2022-04-04 16:55 UTC (6 hours ago)
	web link (ai.googleblog.com)
	w3m dump (ai.googleblog.com)
	\| alphabetting wrote: \| the joke explanations on page 38 of the full paper linked here \| are blowing my mind. it's crazy how far language models have come \| \| https://storage.googleapis.com/pathways-language-model/PaLM-... \| Xenixo wrote: \| Wow. \| \| The anti joke explanation was also very impressive. \| modeless wrote: \| This may be the most impressive thing I've seen a language \| model do so far. That's incredible. The future is going to be \| very weird. \| nqzero wrote: \| this thing is already more human than i am \| WithinReason wrote: \| I like how it appears that it had to convert from imperial to \| metric before it could make an inference: \| \| _300 miles per hour is about 480 km /h. This is about the \| speed of a commercial airplane. [...]_ \| sib wrote: \| Also, that's a very slow commercial airplane. (Unless talking \| about an old turboprop?) \| TaylorAlexander wrote: \| Your comment prompted me to tweet an image of that section, \| complete with alt text (as much as can fit). If anyone cares to \| see it in tweet form. \| \| https://twitter.com/tlalexander/status/1511089810752126984 \| severine wrote: \| So... we are training models here on HN, specially if we follow \| the site's guidelines! Makes you think... which makes _it_ \| think! \| \| Wow, interesting times, indeed. \| theincredulousk wrote: \| Haven't looked further, but I'm wondering about that. Is that \| the result of training to be able to explain that specific \| joke, or is it generalized? \| \| In the past these things have been misleading. Some impressive \| capability ends up being far more narrow than implied, so it's \| kind of like just storing information and retrieving it with \| extra steps. \| whimsicalism wrote: \| From the example, it seems hard to imagine that it has been \| trained to explain this specific joke. \| \| I understand language model skepticism is very big on HN, but \| this is impressive. \| mjburgess wrote: \| How much of human written history can be compressed and \| aproximately stored in 504Bn parameters? \| \| It seems to me bascially certain that no compressed \| representation of text can be an understanding of langugae, \| so necessarily, any statistical algorithm here is always \| using coincidental tricks. That it takes 500bn parameters \| to do it, i think, is a clue that we dont even really need. \| \| Words mean what we do with them -- you need to be here in \| the world with us, to understand what we mean. There is \| nothing in the patterns of our usage of words which \| provides their semantics, so the whole field of \| distributional analysis precludes this superstition. \| \| You cannot, by mere statistical analysis of patterns in \| mere text, understand the nature of the world. But it is \| precisely this we communicate in text. We succeed because \| we are both in the world, not because "w" occuring before \| "d" somehow communicates anything. \| \| Apparent correlations in text are meaningful to us, because \| we created them, and we _have_ their semantics. The system \| _must_ by is nature be a mere remebering. \| hackinthebochs wrote: \| >Words mean what we do with them -- you need to be here \| in the world with us, to understand what we mean \| \| This is like saying "humans can't fly because flight \| requires flapping wings under your own power". Sure, its \| true given the definition this statement is employing, \| but so what? Nothing of substance is learned by \| definition. We certainly are not learning about any \| fundamental limitations of humans from such a definition. \| Similarly, defining understanding language as "the \| association of symbols with things/behaviors in the \| world" demonstrates nothing of substance about the limits \| of language models. \| \| But beyond that, its clear to me the definition itself is \| highly questionable. There are many fields where the vast \| majority of uses of language do not directly correspond \| with things or behaviors in the world. Pure math is an \| obvious example. The understanding of pure math is a \| purely abstract enterprise, one constituted by \| relationships between other abstractions, bottoming out \| at arbitrary placeholders (e.g. the number one is an \| arbitrary placeholder situated in a larger arithmetical \| structure). By your definition, a language model without \| any contact with the world can understand purely abstract \| systems as well as any human. But this just implies \| there's something to understanding beyond merely \| associations of symbols with things/behaviors in the \| physical world. \| whimsicalism wrote: \| > It seems to me bascially certain that no compressed \| representation of text can be an understanding of \| langugae, so necessarily, any statistical algorithm here \| is always using coincidental tricks. That it takes 500bn \| parameters to do it, i think, is a clue that we dont even \| really need. \| \| I think your premise contains your conclusion, which \| while common, is something you should strive to avoid. \| \| I do think your opinion is a good example of the \| prevailing sentiment on Hacker News. To me, it seems to \| come from a discomfort with the fact that even "we" \| emerge out of the basic interactions of basic building \| blocks. Our brain has been able to build world knowledge \| "merely by" analysis of electrical impulses being \| transmitted to it on wires. \| mjburgess wrote: \| I have no discomfort with the notion that our bodies, \| which grow in response to direct causal contact with our \| environment, contain in-their-structure the generative \| capbaility for knoweldge, imagination, skill, growth -- \| and so on. \| \| I have no discomfort with the basically schiozphrenic \| notion that the shapes of words have something to do with \| the nature of the world. I just think its a kind of \| insantity which absolutely destroys our ability to reason \| carefully about the use of these systems. \| \| That "tr" occurs before "ee" says as much about "trees" \| as "leaves are green" says -- it is only that we have \| the relevant semantics that the latter is meaningful when \| interpreted in the light of our "environmental history" \| recorded in our bodies, and given weight and utility by \| our imaginations. \| \| The structure of text is not the structure of the world. \| This thesis is mad. Its a scientific thesis. It is \| trivial to test it. It is trivial to wholey discred it. \| It's pseudoscience. \| \| No one here is a scientist and no one treats any of this \| as science. Where's the criteria for the emprical \| adequecy of NLP systems as models of language? Specifying \| any, conducting actual hypothesis tests, and establishing \| a _theory_ of how NLP systems model language -- this \| would immediately reveal the smoke-and-mirros. \| \| The work to reveal the statistical tricks underneath them \| takes years, and no one has much motivation to do it. The \| money lies in this sales pitch, and this is no science. \| This is no scientific method. \| whimsicalism wrote: \| Agree to disagree. I think you are opining about things \| that you are lacking fundamental knowledge on. \| \| > The structure of text is not the structure of the \| world. This thesis is mad. Its a scientific thesis. It is \| trivial to test it. It is trivial to wholey discred it. \| It's pseudoscience. \| \| It's unclear what you even mean by that. Are the \| electrical impulses coming to our brain the "structure of \| the world"? \| rafaelero wrote: \| Ok, boomer. \| tux1968 wrote: \| Then wouldn't you have to believe that people who are \| born blind and deaf, or unable to walk, do not really \| "understand", since they're not connected to the world in \| the same way as those born without those limits? \| NiceElephant wrote: \| I wonder if AI is a technology that will move from "local \| producers" to a more centralized setup, where everybody just buys \| it as a service, because it becomes too complicated to operate it \| by yourself. \| \| What are examples in history where this has happened before? The \| production of light, heat and movement comes to mind, that, with \| the invention of electricity, moved from people's homes and \| businesses to (nuclear) power plants, which can only be operated \| by a fairly large team of specialists. \| \| Anybody has other examples? \| eunos wrote: \| Hosting from your owning servers on home, localized data \| centers to global cloud companies. \| NiceElephant wrote: \| Yeah, this kinda goes in the same direction, but in this \| case, as well as for example agriculture, I feel it is mostly \| for convenience. You could still do it at home if you wanted \| to, in contrast to operating a nuclear power plant. I thought \| chip-making might be another example, but I'm not sure that \| was ever decentralized in its early days. \| napoleon_thepig wrote: \| This is kind of already happening with services like Google \| cloud translation. \| castratikron wrote: \| Will we see an intelligence too cheap to meter? \| \| https://www.atlanticcouncil.org/blogs/energysource/is-power-... \| WithinReason wrote: \| Based on their 3rd figure, it would take an approximately 100x \| larger model (and more data) to surpass the performance of the \| best humans \| drusepth wrote: \| Its performance on answers to chained inference questions (on \| page 38 of https://storage.googleapis.com/pathways-language- \| model/PaLM-...) has already surpassed the performance of this \| human. \| danuker wrote: \| I placed a transparent plastic ruler on the screen to come to \| the same conclusion, then I saw your comment. \| WithinReason wrote: \| Your methodology is much more sophisticated than mine \| sib wrote: \| My dad, who worked on jet engine production many decades \| ago, would refer to MIL SPEC EYEBALL Mk I. (I _think_ he \| was kidding...) \| londons_explore wrote: \| This huge language model was trained 'from scratch' - ie. before \| the first batch of data went into the training process, the model \| state was simply initialized using random noise. \| \| I believe we are near the end of that. As models get more and \| more expensive to train, we'll see future huge models being \| 'seeded' with weights from previous models. Eventually nation- \| state levels of effort will be used to further train such \| networks to then distribute results to industry to use. \| \| A whole industry will be built around licensing 'seeds' to build \| ML models on - you'll have to pay fees to all the 'ancestors' of \| models you use. \| [deleted] \| lukasb wrote: \| Does anyone know what the units are on the "performance \| improvement over SOTA" chart? \| lukasb wrote: \| Turns out it's a composite of "normalized task-specific \| metrics", details in the paper. Shrug. Numbers go up! \| r-zip wrote: \| I was wondering the same. Without better y-axis labeling, it's \| not that informative of a graphic. \| whymauri wrote: \| Poetic that the top post right now is (partially) about how \| science communication over-simplifying figures results in a \| popular misunderstanding of science, leading readers to \| believe that conducting research is easier than it actually \| is. \| The_rationalist wrote: \| not a single super large language model has beaten the state of \| the art in the key NLP tasks (POS tag, dep tag, coreference, wsd, \| ner, etc) They are always only used for higher level tasks, which \| is tragic. \| oofbey wrote: \| Why is that tragic? Classic NLP tasks are IMHO kinda pointless. \| Nobody _actually_ cares about parse trees, etc. These things \| were useful when that was the best we could do with ML, because \| they allowed us to accomplish genuinely-useful NLP tasks by \| writing code that uses things like parse trees, NER, etc. But \| why bother with parse trees and junk like that if you can just \| get the model to answer the question you actually care about? \| dgreensp wrote: \| When can we try using it?? :) \| ausbah wrote: \| i wonder if pruning and other methods that reduce size \| drastically while not compromising on performance would be \| possible \| gjstein wrote: \| Would love an answer on this too. It would be even better not \| just to _try_ using this, but also be able to run it locally, \| something that has been impossible for GPT-3. \| whimsicalism wrote: \| This is not something that will be possible to run locally. \| \| If you had 1 bit per parameter (not realistic), it would \| still take ~100 GB of RAM just to load into memory. \| arkano wrote: \| Does it look like it would be possible to run locally? \| [deleted] \| sidcool wrote: \| What do 540 billion parameters mean in this case? \| minimaxir wrote: \| 540B float32 values in the model. (although since this model \| was trained via TPUs, likely bfloat16s instead) \| londons_explore wrote: \| Please Google.... Please include in your papers _non-cherry- \| picked_ sample outputs! And explicitly say that they aren 't \| cherry picked. \| \| I understand that there is a chance that the output could be \| offensive/illegal. If necessary you can censor a few outputs, but \| make clear in the paper you've done that. It's better to do that \| than just show us the best picked outputs and pretend all outputs \| are as good. \| modeless wrote: \| This is why Google built TPUs. This alone justifies the whole \| program by itself. This level of natural language understanding, \| once it is harnessed for applications and made efficient enough \| for wide use, is going to revolutionize literally everything \| Google does. Owning the chips that can do this is incredibly \| valuable and companies that are stuck purchasing or renting \| whatever Nvidia makes are going to be at a disadvantage. \| endisneigh wrote: \| The model is insane, but could this realistically be used in \| production? \| motoboi wrote: \| Yes. You don't need a model in ram memory, NVME disks are fine. \| ekelsen wrote: \| That would have very slow inference latency if you had to \| read the model off disk for every token. \| 1024core wrote: \| 540B parameters means ~1TB of floating bytes (assuming \| BFLOAT16). Quadruple that for other associated stuff, and \| you'd need a machine with 4TB of RAM. \| endisneigh wrote: \| right - and even if you did run happen to have a machine \| with 4TB of ram - what type of latency would you have on \| a single machine running this as a service? how many \| machines would you need for google translate performance? \| \| doesn't seem like you can run this as a service, yet. \| anentropic wrote: \| I too am curious what kind of hardware resources are needed to \| run the model once it is trained \| gk1 wrote: \| Why not? I'm curious if you're picturing any specific \| roadblocks in mind. OpenAI makes their large models available \| through an API, removing any issues with model hosting and \| operations. \| minimaxir wrote: \| Latency, mostly. \| \| The GPT-3 APIs were _very_ slow on release, and even with the \| current APIs it still takes a couple seconds to get results \| from the 175B model. \| mountainriver wrote: \| Google for all their flaws really is building the future of AI. \| This is incredibly impressive and makes me think we are \| relatively close to GAI \| nickvincent wrote: \| Crazy impressive! A question about the training data: anyone \| familiar with this line of work know what social media platforms \| the "conversation" data component of the training set came from? \| There's a datasheet that points to prior work \| https://arxiv.org/abs/2001.09977, which sounds like it could be \| reddit, HN, or a similar platform? \| gk1 wrote: \| Is there an equivalent to Moore's Law for language models? It \| feels like every week an even bigger (and supposedly better) \| model is announced. \| visarga wrote: \| Scaling Laws for Neural Language Models - \| https://arxiv.org/abs/2001.08361 \| lucidrains wrote: \| revised scaling laws https://arxiv.org/abs/2203.15556 \| https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new- \| scalin... \| visarga wrote: \| Interesting that they used Chain of Thought Prompting[1] for \| improved reasoning so soon after its publication. Also related to \| DeepMind AlphaCode which generates code and filters results by \| unit tests, while Chain of Thought Prompting filters by checking \| for correct answer at the end. \| \| Seems like language models can generate more training data for \| language models in an iterative manner. \| \| [1] https://arxiv.org/abs/2201.11903 \| nullc wrote: \| The general technique is pretty obvious, I discussed and \| demonstrated it in some HN comments with GPT2 and GPT3 a couple \| times in the last couple years, and suggested some speculative \| extensions (which might be totally unworkable, unfortunately \| these networks are too big for me to attempt to train to try it \| out) https://news.ycombinator.com/item?id=24005638 \| gwern wrote: \| In fact, people had already shown it working with GPT-3 \| before you wrote your comment: \| https://twitter.com/kleptid/status/1284069270603866113 \| https://twitter.com/kleptid/status/1284098635689611264 Seeing \| how much smarter it could be with dialogue was very exciting \| back then, when people were still super-skeptical. \| \| The followup work has also brought out a lot of interesting \| points: why didn't anyone get that working with GPT-2, and \| why wouldn't your GPT-2 suggestion have worked? Because \| inner-monologue capabilities seem to only emerge at some \| point past 100b-parameters (and/or equivalent level of \| compute), furnishing one of the most striking examples of \| emergent capability-spikes in large NNs. GPT-2 is just _way_ \| too small, and if you had tried, you would 've concluded \| inner-monologue doesn't work. It doesn't work, and it keeps \| on not working... until suddenly it does work. \| make3 wrote: \| The chain of thought paper is from Google, so they've known \| about it internally for a while potentially ___________________________________________________________________ (page generated 2022-04-04 23:00 UTC)