proxy70

	[HN Gopher] From deep to long learning? ___________________________________________________________________ From deep to long learning? Author : headalgorithm Score : 353 points Date : 2023-04-09 12:41 UTC (10 hours ago)
	web link (hazyresearch.stanford.edu)
	w3m dump (hazyresearch.stanford.edu)
	\| [deleted] \| marshmallowmad wrote: \| I don't quite understand why context length needs to keep \| growing. It seems to me like many tasks (e.g. customizing your \| LLM on your own data) would benefit from the model doing some \| sort of sped up fine-tuning on any prompt that gets added. That \| way we don't have to find all these hacks to find the most \| relevant context and repeat the same info in prompts. Curious if \| anyone has insight here as this has caused me some confusion \| lately! \| [deleted] \| raphlinus wrote: \| The thing that stuck out to me is the assertion that FFT is \| poorly supported on modern GPU. That's surprising to me, as \| there's cuFFT offically supported by Nvidia, and vkFFT that \| achieves similar performance portably using compute shaders. I \| believe these are based on f32 math, so perhaps the potential win \| is using tensor cores to compute FFT at lower precision? It seems \| surprising to me that decomposing into matrix operations is the \| win here, it seems you'd do better writing a kernel that makes \| use of the cooperative matrix (aka WMMA, tensor core, \| simd_matrix) capabilities of the GPU. \| svantana wrote: \| Looking at the source paper, they are claiming a 2.2x speedup \| over cuFFT for convolutions, so it's not an earth-shattering \| gain, but still. \| aqme28 wrote: \| It seems to me like these long context lengths are going to make \| a huge difference in the capabilities of things people are able \| to produce. Agents can use LLMs to subdivide a large corpus into \| manageable chunks. More context just makes them much much better \| at that. \| \| Regardless, for most programming tasks, I doubt my equivalent \| human context length is any better than 32k tokens. \| jerpint wrote: \| The problem is not the short term context length, but long \| term. If you want to do things like long term goal planning, \| keeping track of distant past events can be of high value \| taneq wrote: \| Maybe that's what our inner monologue is for... to cycle \| relevant context through regularly so it stays in scope. I \| mean, that's not actually why we have one but it'd be cute. \| PaulHoule wrote: \| I'd argue for planning that the answer is to couple the LLM \| to some other system rather than try to improve the LLM. \| Combinatorial optimization is a well-understood problem that \| is NP-complete in theory but practical to solve in practice. \| For doing math we might as well give the LLM a pocket \| calculator, why not couple it to a planner, theorem prover \| and similar tools? \| pixl97 wrote: \| This is pretty much what people are testing now with Auto- \| GPT \| sharemywin wrote: \| That's the interesting thing. if LLM start to just become \| interfaces for other systems you could probably just use \| them to train smaller systems like in the style of alpaca \| and other recent LLMs. \| pmoriarty wrote: \| _" For doing math we might as well give the LLM a pocket \| calculator, why not couple it to a planner, theorem prover \| and similar tools?"_ \| \| ChatGPT has already been hooked up to Wolfram Alpha.[1] \| \| For hooking up other things, see HuggingGPT and \| TaskMatrix.AI.[2][3] \| \| [1] - https://writings.stephenwolfram.com/2023/03/chatgpt- \| gets-its... \| \| [2] - https://arxiv.org/pdf/2303.17580.pdf \| \| [3] - https://arxiv.org/pdf/2303.16434.pdf \| btbuildem wrote: \| Precisely. LLMs are the Perl of the future \| PartiallyTyped wrote: \| Our context is hierarchical, at multiple different resolutions, \| so I don't think it's comparable. \| aqme28 wrote: \| I was trying to describe a hierarchy of LLMs as an agent. I \| don't think that's a uniquely human ability. \| btbuildem wrote: \| I've had good results using that approach with LLMs for \| domain-specific problem solving. \| letitgo12345 wrote: \| Good chance so is OAI's https://arxiv.org/abs/2110.13711 \| (paper has the long context lead at OAI as a co-author) \| XorNot wrote: \| I'd say when you're actively trying to remember an event and \| "narrowing it down" mentally is a process that feels \| suspiciously like a dialogue with an LLM where you keep \| asking for the answer to be improved. \| NhanH wrote: \| Human expert relies in the "chunking" effect for expertise, \| which is mostly not part of the context length (working \| memory). For a pseudo-analogy, each human is fine-tuned from \| the raw intelligence (by training and education). In that way \| generic LLM probably can't beat us yet so don't despair! \| onos wrote: \| Any job focused on fact retrieval is at risk. \| RandomLensman wrote: \| Need to know that the facts exist in the first place, \| though. \| orbifold wrote: \| GPT-4 is already at a level that it has graduate level \| math and physics knowledge, with a level of recall that \| most students can only dream of. \| RandomLensman wrote: \| Do those actually matter in jobs? Most valuable facts I \| encounter are far more niche, often in someone's head and \| not written down, only stored privately somewhere etc. \| jimkoen wrote: \| What kind of job is only focused on fact retrieval? \| istjohn wrote: \| My bigget blocker in web development is fact retrieval. \| As someone who only dabbles, I don't struggle with how to \| logically design my project, but I'm constantly \| forgetting CSS and JS details like how to accomplish a \| specific task with CSS flexbox or how to sort an array in \| JS vs in Python. On old personal projects, I forget the \| names of my own functions and whether they return a list \| or a dict. Hell, I'll forget function signatures for \| functions I just wrote. I forget external library and API \| details. If I had perfect recall, I would 100x my web dev \| productivity. \| fnordpiglet wrote: \| Jeopardy contestant \| kmeisthax wrote: \| The problem with comparing LLM and human context length is that \| LLMs don't update their weights based on their input. The 32k \| of context that they do have is their _only_ memory. \| \| Humans have multiple layers of memory and can recall things and \| concepts from years in the past - akin to millions of tokens' \| worth of recall in an LLM. Yes, that memory is extremely lossy, \| but it's there. \| jacquesm wrote: \| 'Superhuman' has many dimensions. It can be speed, it can be \| ability to retain a lot of information, it can be the ability \| to deal with a large quantity of information at once and many \| other dimensions besides. For the longest time Chess was \| considered a domain where computers could be dilettantes but \| could never dominate. Chess is now in 'superhuman' territory \| and likely we will never see a reversal because any insight \| that benefits humans also benefits computers but not the other \| way around. \| \| The fact that this is such a multi-dimensional problem is \| frequently overlooked in the debate about AI/AGI etc, it may \| not matter all that much if the non AGI AI is already \| superhuman on enough dimensions other than the ones that the \| 'but it isn't AGI' crowd cling to. The consequences are what \| matters, _not_ the fine print or the implementation details, \| and those consequences are directly tied to the number of \| dimensions along which a computer can beat humanity. \| \| To give an example: if a chess program was 3000 Elo before but \| so slow that it would lose under competition rules then humans \| still dominated chess. Likewise if it would be only 2500 Elo \| but fast enough, it would still lose from the best humans. But \| for a large fraction of society it would have already moved out \| into 'superhuman' territory. And a couple of technological \| leaps of progress later and we're _all_ looking at that AI as \| if it has moved into superhuman regions. \| \| This sort of thing will happen on many fronts, and all of those \| fronts are moving, if enough of them go past the threshold then \| whether it is AGI or not is irrelevant and for every person \| that threshold is at different points. Maybe a computer will be \| able to calculate faster and better than you can, maybe it will \| be able to translate text faster and better than you, maybe it \| will be able to organize information faster and better than \| you. At which point we cross the line into saying that it can \| _think_ faster and better than you is hard, but we can see that \| we are getting close to that line without even knowing exactly \| where that line is. \| pmoriarty wrote: \| _" Chess is now in 'superhuman' territory and likely we will \| never see a reversal because any insight that benefits humans \| also benefits computers but not the other way around."_ \| \| What is considered human is malleable. It is conceivable that \| humans will be enhanced in various biological and non- \| biological ways to a point that they can once again compete \| with computers. \| Mezzie wrote: \| We may also just change the rules of chess. \| \| "Chess" is a human created game after all. \| evrimoztamur wrote: \| Somebody actually tried with \| https://en.m.wikipedia.org/wiki/Arimaa, but it didn't \| last too long until it was also figured out! \| dmd wrote: \| Look I found the guy who keeps moving the goalposts! \| anthomtb wrote: \| I think you're making a joke. \| \| But moving goalposts is an integral part of all \| professional sports. The 3 point line in basketball and \| engine size and aspiration in motorsports are obvious \| examples. I don't see how adjusting chess rules to dis- \| favor AI competitors is any different. \| Mezzie wrote: \| It's a pretty common way for humans to deal with not \| being able to do something/something not working. \| andrepd wrote: \| > For the longest time Chess was considered a domain where \| computers could be dilettantes but could never dominate \| \| I don't think this was ever true. Chess programs appeared \| EXTREMELY early in, and everyone recognised that it was a \| matter of time until hardware was quick enough to evaluate so \| many positions per second that grandmasters could be defeated \| by sheer calculation. \| jacquesm wrote: \| I was playing chess pretty fanatically when Sargon came out \| and that was _my_ impression as a computer person, but the \| chess people around me really didn 't think that computers \| would ever beat the top GMs. \| macintux wrote: \| Most major advances are predicted by someone, and often \| seem obvious in hindsight, but it seems like we often \| conflate those and remember them having been obvious \| before they happened. \| jacquesm wrote: \| The number of people working with computers back then was \| but a handful so that gave me a bit of a different \| perspective but I think that anybody that was both into \| computers and into chess back then would have made the \| same prediction. It still happened faster than I thought \| it would. \| ChatGTP wrote: \| I'm curious, what is the point your trying to convey? \| jstanley wrote: \| > if a chess program was 3000 Elo before but so slow that it \| would lose under competition rules then humans still \| dominated chess. \| \| How are you working out that it has a 3000 Elo if it's not \| winning games? \| jacquesm wrote: \| That's the way it is done right now: by playing it against \| other software implementations and judging them in the \| exact same way they would judge humans. \| \| Stockfish has an Elo rating over 3500 in spite of no human \| being even close to that. \| skybrian wrote: \| One dimension that I think is pretty important is compute \| cost due to its effect on how they're used. The chatbots are \| expensive to run, which means they're implemented as request- \| response API's that cost money, which means that loops are \| expensive and they normally don't do any idle-time thinking. \| \| When you play a turn-based game with a bot, that means you \| don't need to worry about its reaction time. It's paused most \| of the time, waiting on you. A sorcerer's apprentice scenario \| isn't going to happen when you're single-stepping. \| \| Moving to routine use of bots that run continuously with fast \| reaction times will be much more dangerous. \| jacquesm wrote: \| Yes, that's a very good point. Effectively it is 'ping \| pong' right now, when you get to always on + push things \| will change quite a bit. Model efficiency is a very active \| field. \| m3kw9 wrote: \| Is like a CPU, we've all seen that movie before, it's super \| human in calculating numbers, it's one dimensional \| nonetheless like chess programs. \| mark_l_watson wrote: \| Interesting about Butterfly architecture for hardware FFT \| support. In the 1980s, DARPA provided two types of exotic \| hardware to the company I worked for: the first Connection \| Machine, and the Butterfly machine. I wrote Star Lisp code for \| the CM, but never touched the Butterfly machine. \| \| Off topic, but I am curious what hardware Apple will release in \| the future for more direct AI support. Their Core ML libraries \| working with Apple Silicon have been very effective so far. The \| next step would likely be a built in foundation LLM model, \| extending what they have supported with BERT models, etc. \| sroussey wrote: \| I think it will be quite a few years before an LLM is built \| into their silicon. \| \| But I do see a Neural Engine 2.0 in the future that will better \| handle these things in the more near term future. \| Sugimot0 wrote: \| iirc creating new ml optimized chips/chiplets was part of the \| appeal for risc-v right? Is risc-v relevant yet, or are there \| any promising risc-v chips on the way? I know there's a lot \| of hype around them, so i'm curious to know how much is just \| noise or what the real sentiment is from the \| experts/industry. \| [deleted] \| cs702 wrote: \| _> The next step would likely be a built in foundation LLM \| model, extending what they have supported with BERT models, \| etc._ \| \| I'm thinking deeper. It wouldn't surprise me if _self- \| attention_ itself becomes a _primitive building block_ of \| future co-processors, e.g., with instructions and memory \| layouts engineered to make ultra-low-precision self-attention \| as compute- and memory-efficient as possible. I 'm expecting \| LLMs with hundreds of billions and eventually trillions of \| parameters will be able to run locally on my laptop and mobile \| phone, in the not-too-distant future.[a] \| \| [a] If this sounds far-fetched, consider that you can _already_ \| run LLMs with tens of billions of parameters on mobile phones: \| https://justine.lol/mmap/ \| fpgaminer wrote: \| > I'm expecting LLMs with hundreds of billions and eventually \| trillions of parameters will be able to run locally on my \| laptop and mobile phone, in the not-too-distant future \| \| Perhaps. There's been a lot of focus on training-compute \| optimal models in the industry. Rightfully so, as proofs of \| concept. That's what led to this perceived parameter count \| race in published models. \| \| But remember the other side of the scaling laws. For \| inference, which is what we want to do on our phones, it's \| better to be inference-compute optimal. That means smaller \| models trained for longer. \| \| As far as we know today there are no limits of the scaling \| laws. A 1B parameter model _can_ beat a 1T parameter model, \| if trained for long enough. Of course it's exponential, so \| you'd have to pour incalculable training resources into such \| an extreme example. But I find these extreme examples \| elucidating. \| \| My pet theory these days is that we'll discover some way of \| "simulating" multiple parameters from one stored parameter. \| We know that training-compute optimal models are extremely \| over-parameterized. So it isn't the raw capacity of the model \| that's important. It seems like during training the degrees \| of freedom is what allows larger models to be more sample \| efficient. If we can find a cheap way of having one parameter \| simulate multiple degrees of freedom, it will likely give us \| the ability to gain the advantages of larger models during \| training, without the inference costs later. \| \| I don't disagree that we're likely to see more and more \| parameter capacity from our devices. I'm just pointing out \| that the parameter count race is a bit of an illusion. OpenAI \| discovered the scaling laws and needed a proof of concept. If \| they could show AI reaching X threshold first, they could \| capture the market. The fastest way to do that is to be \| training-compute optimal. So they had to scale to 175B \| parameters or more. Now that it's proven, and that there's a \| market for such an AI, their and other's focus can be on \| inference-optimal models which are smaller but just as smart. \| cs702 wrote: \| _> I don 't disagree that we're likely to see more and more \| parameter capacity from our devices. I'm just pointing out \| that the parameter count race is a bit of an illusion. \| OpenAI discovered the scaling laws and needed a proof of \| concept. If they could show AI reaching X threshold first, \| they could capture the market. The fastest way to do that \| is to be training-compute optimal. So they had to scale to \| 175B parameters or more. Now that it's proven, and that \| there's a market for such an AI, their and other's focus \| can be on inference-optimal models which are smaller but \| just as smart._ \| \| Good point. That could very well be what they're thinking \| about, in addition to potential improvements in training \| data and RLHF methods. \| \| Also, I agree it would be great if anyone figures out how \| to do something akin to "making a smaller model act as if \| it were gigantic during training" OR "pruning a gigantic \| model's 'dead paths' as it learns during training," to get \| the benefits of scale in training without its costs at \| inference. \| skepticATX wrote: \| Can someone with a better understanding than I have comment about \| the relationship between these results and this paper: \| https://arxiv.org/abs/2109.09115, which seems to demonstrate that \| a longer context length has diminishing returns? \| rsfern wrote: \| I'm not deeply familiar with all these papers, but two things \| stand out to me \| \| The model architectures are different, and in the very latest \| paper they scale these not-transformer models to sequence \| length of 64k, where the paper you linked only considers up to \| 8k \| AvAn12 wrote: \| With longer (50k+) context lengths, is this just becoming a new \| form of search? \| intalentive wrote: \| Yes, now we just need it to provide citations. \| jmole wrote: \| I think the K,Q,V representation was what fundamentally gave \| rise to LLMs (from "Attention is all you need"), and I'm \| certain that it wouldn't have happened without the researchers \| having a background in search @ Google. \| \| Or in other words, it was always a new form of search. \| bckr wrote: \| [Astronaut looking at earth with the logo of superhuman AI \| superimposed] \| \| Wait, it's all just search? \| \| [astronaut with gun] \| \| Always has been \| logophobia wrote: \| I've applied the S4 operator to successfully do long-length video \| classification. It's massively more efficient than a similarly \| scaled transformer, but it doesn't train as well. Still, even \| with S4 I got some impressive results, looking forward to more. \| Buttons840 wrote: \| If I want to do sequence modelling, let's say, predict the 9th \| element of the following sequence: [1, 2, 3, 4, 5, 6, 7, 8, 9] -- \| that is, I know the 8 most recent tokens as my "context", and I \| want to predict the next, 9 in this case -- \| \| Can someone explain to me why a transformer or RNN is better at \| this than a simple linear layer with an equivalent number of \| parameters? A linear layer can receive the context [1, 2, 3, 4, \| 5, 6, 7, 8], properly one-hot encoded / embedded, etc, and \| predict the next sequence. Can a linear layer do just as well as \| a transformer? This setup allows linear layers to predict \| sequences with an arbitrary context size, so why so much hype \| about transformers and RNNs and other sequence focused \| architectures? \| \| Perhaps the difference is that given the same number of \| parameters, the transformer uses those parameters to perform easy \| computations whereas the linear layer just does one gigantic \| matrix multiplication which isn't very efficient? \| QuadmasterXLII wrote: \| The big difference is that the transformer is (approximately) \| permutation equivariant, which makes a massive difference in \| generalization and training speed. \| Buttons840 wrote: \| I see, so [1, 2, 3, 4, 5, 6, 7, 8] is more similar to [5, 3, \| 1, 7, 8, 2, 4, 6] than with a linear layer? That's what you \| mean by permutation equivariant? \| \| I understand each context input is embedded with its \| position, but I suppose the transformer can learn to ignore \| the position and just look at the context as an unordered \| set? \| eachro wrote: \| What makes it approximately permutation equivariant (vs \| entirely)? As I understand things, if the order is jumbled, \| the attention matrix does get its rows and cols permuted in \| the way you'd expect so I'd have thought they'd be entirely \| permutation equivariant. \| YetAnotherNick wrote: \| Inputs have position encoding in them. \| 1024core wrote: \| While the race to incorporate longer and longer context (2K PaLM \| -> 32K now) is interesting, I don't think that'll scale. It'll \| just add too much noise to the history: how do you establish a \| causal relationship between what you're holding in your hand, \| versus the million other things (context) that you've seen in the \| past. You'll end up with spurious correlations. \| \| What I think (and this is just me talking out of my ass) will be \| required is some form of associative long-term memory. Basically, \| give the model a way to store some embeddings in some form of \| memory, and then retrieve them based on context: so it doesn't \| matter if you encountered that item 2 tokens ago, or 2B. \| \| At least this is what my current intuition tells me. \| lucidrains wrote: \| that line of research is still going. \| https://github.com/lucidrains/block-recurrent-transformer-py... \| i think it is worth continuing research on both fronts. \| 1024core wrote: \| Of course, I'm not saying "don't do research". I'm just \| saying that I don't think this context-length war will lead \| us to long-term sustainable gains. \| [deleted] \| nathias wrote: \| this and evaluation based on context that lets it mutate past \| content and we are set \| skybrian wrote: \| On the other hand, it seems like training on large amounts of \| text for next-token prediction would tend to reduce reliance on \| spurious correlations? I don't think this intuitive sort of \| speculation can predict what it will do. \| edulix wrote: \| Instead of long learning or long contexts, at some point \| artificial neural networks will have to transition to \| continuous/online learning - learn while using the network. This \| way, limitations are broken like they are in our minds. \| \| Similar to what Numenta HTM networks do, but scalable and \| performant for real use cases. \| \| BTW, perhaps human-like conscience emerge as a "self-attention- \| like" mechanism between context and learning. Just saying. \| qumpis wrote: \| Learn how? I think having infinite context is perfect - no need \| to learn on my data online and risk exposing it to others. \| thomasahle wrote: \| Alternatively we need the model to have a long term memory, and \| be able to load stuff to/from that while reading. \| [deleted] \| jmole wrote: \| Oddly enough, I was reading their paper just last night: \| https://arxiv.org/pdf/2302.10866.pdf \| \| I think we're going to see a lot more in the \| wavelet/convolution/fft space when thinking about how to increase \| context length. \| \| I think there's also a lot of room for innovation in the \| positional encoding and how it's represented in transformer \| models, it seems like people have been trying lots of things and \| going with what works, but most of it is like: "look, a new \| orthonormal basis!". \| \| Hyena sort of seems like the first step in moving to positional \| _embeddings_ (or joint positional /attentional embeddings). \| \| Very cool work. \| cs702 wrote: \| I agree this sort of approach looks promising. Maybe using FFTs \| recurrently to approximate convolutions with input-length \| filters is the way forward. It's a clever idea. I'm making my \| way through the paper. Don't fully understand it yet. \| \| The main issue I've seen with other wannabe-sub-quadratic- \| replacements for self-attention is that they all rely on some \| kind of low-rank/sparse approximation that in practice renders \| LLMs incapable of modeling enough pairwise relationships \| between tokens to achieve state-of-the-art performance. \| \| I'm curious to see if this kind of approach solves the issue. \| imustachyou wrote: \| S4 and its class of state-space models are an impressive \| mathematical and signal-processing innovation, and I thought it \| was awesome how they destroyed previous baselines for long-range \| tasks. \| \| Have there been any state-space models adapted for arbitrary text \| generation? \| \| Language models like ChatGPT are trained to predict new words \| based on the previous ones and are excellent for generation, a \| harder task than translation or classification. I'm doubtful \| about the adaptability of text models that deal with fixed-sized \| input/outputs and don't have an architecture that is as natural \| for generating indefinitely long sequences. \| sdenton4 wrote: \| Go read about S4, from these authors. It's about having a \| learnable state-space model which can be efficiently \| implemented as either an RNN or (very long) convolution, \| according to the needs of train or inference. \| Buttons840 wrote: \| Do these scale as well as transformers? My understanding is \| that classic RNNs don't scale well, and that is one reason \| why transformers became popular. \| \| As a pleb who doesn't even own a data center, I've been \| hoping that a superior machine learning architecture will be \| discovered that doesn't scale well. We would be fortunate if \| our personal computers end up being half as good as \| Microsoft's or Amazon's best models; fortunate if the best \| architecture gains little from an additional 10,000 GPUs. \| This would help spread the benefits of AI evenly among anyone \| with a phone or computer -- a utopia compared to the other \| possibility, that everyone can learn how to build AI, but \| only those with a few hundred million to throw at a data \| center can actually control the means of production -- err, I \| mean, the means of intelligence. \| \| Philosophically, this wouldn't be unlike people. Humans are \| still the greatest intelligence we're aware of, and humans \| don't scale. I'm hoping computer intelligence ends up not \| scaling well either. \| sdenton4 wrote: \| That's the point of having multiple realizations of the \| same underlying model. \| \| The (depthwise) convolutional realization is extremely \| efficient for training, and the RNN is extremely efficient \| for inference. The scaling in both of these cases is much \| better than attention layers - as they discuss in the \| article. \| 3327 wrote: \| [dead] \| pmontra wrote: \| Is this the way we work? We are told a fact a very few times and \| we remember it all life long, no 32k or 32M context. \| \| I think that they are following the easy path, much like the Giga \| Hertz race in CPU, and will hit a wall. Maybe the wall will be so \| far away that it will give us an AGI but maybe it will give us \| superhuman machines only in well defined contexts. We'll have to \| squeeze our instructions in a prompt too small for some tasks and \| get a bot behaving like the main character of the Memento movie \| (he remembered only the last few minutes and very old memories.) \| [deleted] \| Ozzie_osman wrote: \| There will be those of us that understand how all these models \| work, and there will be those of us that simply use them. \| istjohn wrote: \| That's true of just about any technology. Most carpenters would \| fail to explain why their hammer drives a nail into wood \| instead of bouncing off, why it has a handle that extends past \| the hand's grip, why the head of the hammer neither shatters \| nor mushrooms over time, to say nothing of their nail guns and \| circular saws. \| Herval_freire wrote: \| Someone in a previous comment on LLM research said that according \| to what he knew about LLM research that we were in a local \| maximum that no further improvement was likely possible. \| \| I disagreed with him, and this article is evidence that is in \| favor of my point. If research like this continues to move \| forward LLMs will improve at a rapid rate. \| \| Different threads attract different groups of people with \| different areas of expertise. So I will sort of reiterate this \| topic here as I'm interested. What are most people's thoughts on \| this "local maximum" thing have we actually hit a dead end? \| Especially given the proliferation of effort towards producing \| research like the one showed in the topic here. \| ChatGTP wrote: \| I mean I just read that article and it doesn't seem like a lot \| will change, sure it can summarize a whole book, or read a \| larger chunk of code to do thing with, but didn't really see it \| talk about taking things to "the next level" so to speak. \| \| The researchers are also, excited. \| \| _We're especially motivated by applications that could benefit \| from longer-sequence models - high-resolution imaging, new \| modalities of data, language models that can read entire books. \| Imagine giving a language model an entire book and having it \| summarize the plot, or conditioning a code generation model on \| all the code you've ever written. The possibilities are wild - \| and we're excited._ \| Herval_freire wrote: \| But this research came out mere weeks after the release of \| gpt-4. That is in itself rapid. If small incremental changes \| like this continue on a sort of monthly basis, the trendline \| points towards something that's not a dead end. That's my \| view of it. \| \| As with most technology there isn't necessarily always a \| constant influx of inflection points and paradigm shifts. \| Improvement will likely creep up on us incrementally. \| Suddenly one day its clearly more intelligent then a human \| and we can't point to when it happened. \| [deleted] \| jamilton wrote: \| GPT-4's release doesn't seem like the relevant time marker, \| since nothing in the article builds on it or depends on it. \| The paper for H3 was submitted in December 2022. \| \| The pace the last few years definitely seems rapid, just \| don't want there to be a false impression. \| intalentive wrote: \| Once you exhaust the dataset of all written language, the next \| step is multi-modal -- images, audio, video. What will next- \| token prediction give us on such a dataset? Better versions of \| what we have now -- style transfer, summarization, captioning, \| translation, prompt-based generation, etc., but with \| synesthesia. \| \| There is still plenty of improvement ahead but I don't think \| anything genuinely surprising will come from the current regime \| of feedforward models. What is missing is action -- an \| interactive feedback loop between agent and environment. \| Progress in RL and robotics has been very slow by comparison \| and unless we see a breakthrough there, I would guess the GPT \| phase plateaus in the next 5-10 years. \| skybrian wrote: \| I expect progress will be much less predictable. Some kinds \| of action look pretty easy; it depends on the domain. \| \| For example, I expect skill at writing some kinds of code to \| improve dramatically because running tests in a sandbox looks \| easy. It's already being researched. [1] Extending that to \| device drivers might be a bit harder. Fuzzing is already \| mostly automated and smarter fuzzing could get pretty scary. \| \| [1] https://nanothoughts.substack.com/p/reflecting-on- \| reflexion \| Salgat wrote: \| Basically, if the knowledge exists online in a way that can \| be pieced together in a straight forward manner, GPT will \| figure it out, but for information that requires \| experimentation and creating new information to derive \| results, it won't be of much use. For example, GPT can't \| iteratively try different programming techniques to speed \| up a block of parallelizable code, it'll simply give you \| the best guess that it can find off google. \| skybrian wrote: \| It won't iterate on its own, but you can do it. You can \| ask it for a list of things to try, and they will be \| different alternatives. You can also tell it the result \| of an experiment and it will often figure out what to \| fix. \| \| If you follow the link I shared, some researchers \| automated asking GPT4 to write tests, running the tests \| in a sandbox, and feeding the results back in. \| UncleEntity wrote: \| > What will next-token prediction give us on such a dataset? \| \| Haven't we pretty much figured out these things are doing \| more than just predicting the next token at this point? \| \| There's probably a lot to be done with a "prediction \| machine", birds aren't all that smart but can catch bugs in \| midair. \| skybrian wrote: \| People keep underestimating what next-token prediction can \| do, but they're not wrong that it's how LLM's work. \| \| It's actually a good question: what will next-token \| prediction be able to do on new datasets? The error is \| thinking you can answer it, even in broad terms. \| cs702 wrote: \| This looks really interesting! If these guys succeed in bringing \| self-attention's computational cost down from O(n2) to O(n log \| n), that would be a huge win. The quadratic cost makes it very \| difficult to increase sequence length on current hardware. I'm \| going to take a closer look. \| \| There are other interesting ongoing efforts to increase sequence \| length. One that has worked for me is this dynamic routing \| algorithm, related to self-attention, that can handle sequences \| with 1M+ tokens in a single GPU: \| https://github.com/glassroom/heinsen_routing . Right now, you can \| take 1,000 sequences of hidden states computed by a pretrained \| transformer, each sequence with, say, 1024 tokens, concatenate \| them into a single ultra-long sequence with 1,024,000 hidden \| states, slap 1,024,000 position encodings on top, and feed the \| whole thing to that routing algorithm to predict the next token \| (or whatever other training objective you want to optimize for). \| It works. Search the README for "Very Long Sequences". \| \| If anyone here has other suggestions for working with long \| sequences (hundreds of thousands to millions of tokens), _I 'd \| love to learn about them_. \| [deleted] \| og_kalu wrote: \| There are already linear attention advances. Gpt-4-32k is \| almost certainly using some form of flash attention. \| \| Attention isn't really O(n2) anymore. \| cs702 wrote: \| My understanding is that FlashAttention's memory use is \| linear, or close to linear in practice, but computation is \| still O(n2). I'm unaware of anyone being able to apply \| FlashAttention on, say, a million tokens, because it must \| execute ~1/2 x 1,000,000^2 x n_head dot-products, each in a \| subspace with d_head dimensions. That's not exactly \| computationally cheap! \| og_kalu wrote: \| No you're right. I mistook you. Compute isn't linear yet. \| lucidrains wrote: \| It is only linear in terms of memory, not compute. Flash \| attention is a big advance, but not enough for 1 million \| tokens \| [deleted] \| [deleted] ___________________________________________________________________ (page generated 2023-04-09 23:00 UTC)