|
| [deleted]
| marshmallowmad wrote:
| I don't quite understand why context length needs to keep
| growing. It seems to me like many tasks (e.g. customizing your
| LLM on your own data) would benefit from the model doing some
| sort of sped up fine-tuning on any prompt that gets added. That
| way we don't have to find all these hacks to find the most
| relevant context and repeat the same info in prompts. Curious if
| anyone has insight here as this has caused me some confusion
| lately!
| [deleted]
| raphlinus wrote:
| The thing that stuck out to me is the assertion that FFT is
| poorly supported on modern GPU. That's surprising to me, as
| there's cuFFT offically supported by Nvidia, and vkFFT that
| achieves similar performance portably using compute shaders. I
| believe these are based on f32 math, so perhaps the potential win
| is using tensor cores to compute FFT at lower precision? It seems
| surprising to me that decomposing into matrix operations is the
| win here, it seems you'd do better writing a kernel that makes
| use of the cooperative matrix (aka WMMA, tensor core,
| simd_matrix) capabilities of the GPU.
| svantana wrote:
| Looking at the source paper, they are claiming a 2.2x speedup
| over cuFFT for convolutions, so it's not an earth-shattering
| gain, but still.
| aqme28 wrote:
| It seems to me like these long context lengths are going to make
| a huge difference in the capabilities of things people are able
| to produce. Agents can use LLMs to subdivide a large corpus into
| manageable chunks. More context just makes them much much better
| at that.
|
| Regardless, for most programming tasks, I doubt my equivalent
| human context length is any better than 32k tokens.
| jerpint wrote:
| The problem is not the short term context length, but long
| term. If you want to do things like long term goal planning,
| keeping track of distant past events can be of high value
| taneq wrote:
| Maybe that's what our inner monologue is for... to cycle
| relevant context through regularly so it stays in scope. I
| mean, that's not actually why we have one but it'd be cute.
| PaulHoule wrote:
| I'd argue for planning that the answer is to couple the LLM
| to some other system rather than try to improve the LLM.
| Combinatorial optimization is a well-understood problem that
| is NP-complete in theory but practical to solve in practice.
| For doing math we might as well give the LLM a pocket
| calculator, why not couple it to a planner, theorem prover
| and similar tools?
| pixl97 wrote:
| This is pretty much what people are testing now with Auto-
| GPT
| sharemywin wrote:
| That's the interesting thing. if LLM start to just become
| interfaces for other systems you could probably just use
| them to train smaller systems like in the style of alpaca
| and other recent LLMs.
| pmoriarty wrote:
| _" For doing math we might as well give the LLM a pocket
| calculator, why not couple it to a planner, theorem prover
| and similar tools?"_
|
| ChatGPT has already been hooked up to Wolfram Alpha.[1]
|
| For hooking up other things, see HuggingGPT and
| TaskMatrix.AI.[2][3]
|
| [1] - https://writings.stephenwolfram.com/2023/03/chatgpt-
| gets-its...
|
| [2] - https://arxiv.org/pdf/2303.17580.pdf
|
| [3] - https://arxiv.org/pdf/2303.16434.pdf
| btbuildem wrote:
| Precisely. LLMs are the Perl of the future
| PartiallyTyped wrote:
| Our context is hierarchical, at multiple different resolutions,
| so I don't think it's comparable.
| aqme28 wrote:
| I was trying to describe a hierarchy of LLMs as an agent. I
| don't think that's a uniquely human ability.
| btbuildem wrote:
| I've had good results using that approach with LLMs for
| domain-specific problem solving.
| letitgo12345 wrote:
| Good chance so is OAI's https://arxiv.org/abs/2110.13711
| (paper has the long context lead at OAI as a co-author)
| XorNot wrote:
| I'd say when you're actively trying to remember an event and
| "narrowing it down" mentally is a process that feels
| suspiciously like a dialogue with an LLM where you keep
| asking for the answer to be improved.
| NhanH wrote:
| Human expert relies in the "chunking" effect for expertise,
| which is mostly not part of the context length (working
| memory). For a pseudo-analogy, each human is fine-tuned from
| the raw intelligence (by training and education). In that way
| generic LLM probably can't beat us yet so don't despair!
| onos wrote:
| Any job focused on fact retrieval is at risk.
| RandomLensman wrote:
| Need to know that the facts exist in the first place,
| though.
| orbifold wrote:
| GPT-4 is already at a level that it has graduate level
| math and physics knowledge, with a level of recall that
| most students can only dream of.
| RandomLensman wrote:
| Do those actually matter in jobs? Most valuable facts I
| encounter are far more niche, often in someone's head and
| not written down, only stored privately somewhere etc.
| jimkoen wrote:
| What kind of job is only focused on fact retrieval?
| istjohn wrote:
| My bigget blocker in web development is fact retrieval.
| As someone who only dabbles, I don't struggle with how to
| logically design my project, but I'm constantly
| forgetting CSS and JS details like how to accomplish a
| specific task with CSS flexbox or how to sort an array in
| JS vs in Python. On old personal projects, I forget the
| names of my own functions and whether they return a list
| or a dict. Hell, I'll forget function signatures for
| functions I just wrote. I forget external library and API
| details. If I had perfect recall, I would 100x my web dev
| productivity.
| fnordpiglet wrote:
| Jeopardy contestant
| kmeisthax wrote:
| The problem with comparing LLM and human context length is that
| LLMs don't update their weights based on their input. The 32k
| of context that they do have is their _only_ memory.
|
| Humans have multiple layers of memory and can recall things and
| concepts from years in the past - akin to millions of tokens'
| worth of recall in an LLM. Yes, that memory is extremely lossy,
| but it's there.
| jacquesm wrote:
| 'Superhuman' has many dimensions. It can be speed, it can be
| ability to retain a lot of information, it can be the ability
| to deal with a large quantity of information at once and many
| other dimensions besides. For the longest time Chess was
| considered a domain where computers could be dilettantes but
| could never dominate. Chess is now in 'superhuman' territory
| and likely we will never see a reversal because any insight
| that benefits humans also benefits computers but not the other
| way around.
|
| The fact that this is such a multi-dimensional problem is
| frequently overlooked in the debate about AI/AGI etc, it may
| not matter all that much if the non AGI AI is already
| superhuman on enough dimensions other than the ones that the
| 'but it isn't AGI' crowd cling to. The consequences are what
| matters, _not_ the fine print or the implementation details,
| and those consequences are directly tied to the number of
| dimensions along which a computer can beat humanity.
|
| To give an example: if a chess program was 3000 Elo before but
| so slow that it would lose under competition rules then humans
| still dominated chess. Likewise if it would be only 2500 Elo
| but fast enough, it would still lose from the best humans. But
| for a large fraction of society it would have already moved out
| into 'superhuman' territory. And a couple of technological
| leaps of progress later and we're _all_ looking at that AI as
| if it has moved into superhuman regions.
|
| This sort of thing will happen on many fronts, and all of those
| fronts are moving, if enough of them go past the threshold then
| whether it is AGI or not is irrelevant and for every person
| that threshold is at different points. Maybe a computer will be
| able to calculate faster and better than you can, maybe it will
| be able to translate text faster and better than you, maybe it
| will be able to organize information faster and better than
| you. At which point we cross the line into saying that it can
| _think_ faster and better than you is hard, but we can see that
| we are getting close to that line without even knowing exactly
| where that line is.
| pmoriarty wrote:
| _" Chess is now in 'superhuman' territory and likely we will
| never see a reversal because any insight that benefits humans
| also benefits computers but not the other way around."_
|
| What is considered human is malleable. It is conceivable that
| humans will be enhanced in various biological and non-
| biological ways to a point that they can once again compete
| with computers.
| Mezzie wrote:
| We may also just change the rules of chess.
|
| "Chess" is a human created game after all.
| evrimoztamur wrote:
| Somebody actually tried with
| https://en.m.wikipedia.org/wiki/Arimaa, but it didn't
| last too long until it was also figured out!
| dmd wrote:
| Look I found the guy who keeps moving the goalposts!
| anthomtb wrote:
| I think you're making a joke.
|
| But moving goalposts is an integral part of all
| professional sports. The 3 point line in basketball and
| engine size and aspiration in motorsports are obvious
| examples. I don't see how adjusting chess rules to dis-
| favor AI competitors is any different.
| Mezzie wrote:
| It's a pretty common way for humans to deal with not
| being able to do something/something not working.
| andrepd wrote:
| > For the longest time Chess was considered a domain where
| computers could be dilettantes but could never dominate
|
| I don't think this was ever true. Chess programs appeared
| EXTREMELY early in, and everyone recognised that it was a
| matter of time until hardware was quick enough to evaluate so
| many positions per second that grandmasters could be defeated
| by sheer calculation.
| jacquesm wrote:
| I was playing chess pretty fanatically when Sargon came out
| and that was _my_ impression as a computer person, but the
| chess people around me really didn 't think that computers
| would ever beat the top GMs.
| macintux wrote:
| Most major advances are predicted by someone, and often
| seem obvious in hindsight, but it seems like we often
| conflate those and remember them having been obvious
| before they happened.
| jacquesm wrote:
| The number of people working with computers back then was
| but a handful so that gave me a bit of a different
| perspective but I think that anybody that was both into
| computers and into chess back then would have made the
| same prediction. It still happened faster than I thought
| it would.
| ChatGTP wrote:
| I'm curious, what is the point your trying to convey?
| jstanley wrote:
| > if a chess program was 3000 Elo before but so slow that it
| would lose under competition rules then humans still
| dominated chess.
|
| How are you working out that it has a 3000 Elo if it's not
| winning games?
| jacquesm wrote:
| That's the way it is done right now: by playing it against
| other software implementations and judging them in the
| exact same way they would judge humans.
|
| Stockfish has an Elo rating over 3500 in spite of no human
| being even close to that.
| skybrian wrote:
| One dimension that I think is pretty important is compute
| cost due to its effect on how they're used. The chatbots are
| expensive to run, which means they're implemented as request-
| response API's that cost money, which means that loops are
| expensive and they normally don't do any idle-time thinking.
|
| When you play a turn-based game with a bot, that means you
| don't need to worry about its reaction time. It's paused most
| of the time, waiting on you. A sorcerer's apprentice scenario
| isn't going to happen when you're single-stepping.
|
| Moving to routine use of bots that run continuously with fast
| reaction times will be much more dangerous.
| jacquesm wrote:
| Yes, that's a very good point. Effectively it is 'ping
| pong' right now, when you get to always on + push things
| will change quite a bit. Model efficiency is a very active
| field.
| m3kw9 wrote:
| Is like a CPU, we've all seen that movie before, it's super
| human in calculating numbers, it's one dimensional
| nonetheless like chess programs.
| mark_l_watson wrote:
| Interesting about Butterfly architecture for hardware FFT
| support. In the 1980s, DARPA provided two types of exotic
| hardware to the company I worked for: the first Connection
| Machine, and the Butterfly machine. I wrote Star Lisp code for
| the CM, but never touched the Butterfly machine.
|
| Off topic, but I am curious what hardware Apple will release in
| the future for more direct AI support. Their Core ML libraries
| working with Apple Silicon have been very effective so far. The
| next step would likely be a built in foundation LLM model,
| extending what they have supported with BERT models, etc.
| sroussey wrote:
| I think it will be quite a few years before an LLM is built
| into their silicon.
|
| But I do see a Neural Engine 2.0 in the future that will better
| handle these things in the more near term future.
| Sugimot0 wrote:
| iirc creating new ml optimized chips/chiplets was part of the
| appeal for risc-v right? Is risc-v relevant yet, or are there
| any promising risc-v chips on the way? I know there's a lot
| of hype around them, so i'm curious to know how much is just
| noise or what the real sentiment is from the
| experts/industry.
| [deleted]
| cs702 wrote:
| _> The next step would likely be a built in foundation LLM
| model, extending what they have supported with BERT models,
| etc._
|
| I'm thinking deeper. It wouldn't surprise me if _self-
| attention_ itself becomes a _primitive building block_ of
| future co-processors, e.g., with instructions and memory
| layouts engineered to make ultra-low-precision self-attention
| as compute- and memory-efficient as possible. I 'm expecting
| LLMs with hundreds of billions and eventually trillions of
| parameters will be able to run locally on my laptop and mobile
| phone, in the not-too-distant future.[a]
|
| [a] If this sounds far-fetched, consider that you can _already_
| run LLMs with tens of billions of parameters on mobile phones:
| https://justine.lol/mmap/
| fpgaminer wrote:
| > I'm expecting LLMs with hundreds of billions and eventually
| trillions of parameters will be able to run locally on my
| laptop and mobile phone, in the not-too-distant future
|
| Perhaps. There's been a lot of focus on training-compute
| optimal models in the industry. Rightfully so, as proofs of
| concept. That's what led to this perceived parameter count
| race in published models.
|
| But remember the other side of the scaling laws. For
| inference, which is what we want to do on our phones, it's
| better to be inference-compute optimal. That means smaller
| models trained for longer.
|
| As far as we know today there are no limits of the scaling
| laws. A 1B parameter model _can_ beat a 1T parameter model,
| if trained for long enough. Of course it's exponential, so
| you'd have to pour incalculable training resources into such
| an extreme example. But I find these extreme examples
| elucidating.
|
| My pet theory these days is that we'll discover some way of
| "simulating" multiple parameters from one stored parameter.
| We know that training-compute optimal models are extremely
| over-parameterized. So it isn't the raw capacity of the model
| that's important. It seems like during training the degrees
| of freedom is what allows larger models to be more sample
| efficient. If we can find a cheap way of having one parameter
| simulate multiple degrees of freedom, it will likely give us
| the ability to gain the advantages of larger models during
| training, without the inference costs later.
|
| I don't disagree that we're likely to see more and more
| parameter capacity from our devices. I'm just pointing out
| that the parameter count race is a bit of an illusion. OpenAI
| discovered the scaling laws and needed a proof of concept. If
| they could show AI reaching X threshold first, they could
| capture the market. The fastest way to do that is to be
| training-compute optimal. So they had to scale to 175B
| parameters or more. Now that it's proven, and that there's a
| market for such an AI, their and other's focus can be on
| inference-optimal models which are smaller but just as smart.
| cs702 wrote:
| _> I don 't disagree that we're likely to see more and more
| parameter capacity from our devices. I'm just pointing out
| that the parameter count race is a bit of an illusion.
| OpenAI discovered the scaling laws and needed a proof of
| concept. If they could show AI reaching X threshold first,
| they could capture the market. The fastest way to do that
| is to be training-compute optimal. So they had to scale to
| 175B parameters or more. Now that it's proven, and that
| there's a market for such an AI, their and other's focus
| can be on inference-optimal models which are smaller but
| just as smart._
|
| Good point. That could very well be what they're thinking
| about, in addition to potential improvements in training
| data and RLHF methods.
|
| Also, I agree it would be great if anyone figures out how
| to do something akin to "making a smaller model act as if
| it were gigantic during training" OR "pruning a gigantic
| model's 'dead paths' as it learns during training," to get
| the benefits of scale in training without its costs at
| inference.
| skepticATX wrote:
| Can someone with a better understanding than I have comment about
| the relationship between these results and this paper:
| https://arxiv.org/abs/2109.09115, which seems to demonstrate that
| a longer context length has diminishing returns?
| rsfern wrote:
| I'm not deeply familiar with all these papers, but two things
| stand out to me
|
| The model architectures are different, and in the very latest
| paper they scale these not-transformer models to sequence
| length of 64k, where the paper you linked only considers up to
| 8k
| AvAn12 wrote:
| With longer (50k+) context lengths, is this just becoming a new
| form of search?
| intalentive wrote:
| Yes, now we just need it to provide citations.
| jmole wrote:
| I think the K,Q,V representation was what fundamentally gave
| rise to LLMs (from "Attention is all you need"), and I'm
| certain that it wouldn't have happened without the researchers
| having a background in search @ Google.
|
| Or in other words, it was always a new form of search.
| bckr wrote:
| [Astronaut looking at earth with the logo of superhuman AI
| superimposed]
|
| Wait, it's all just search?
|
| [astronaut with gun]
|
| Always has been
| logophobia wrote:
| I've applied the S4 operator to successfully do long-length video
| classification. It's massively more efficient than a similarly
| scaled transformer, but it doesn't train as well. Still, even
| with S4 I got some impressive results, looking forward to more.
| Buttons840 wrote:
| If I want to do sequence modelling, let's say, predict the 9th
| element of the following sequence: [1, 2, 3, 4, 5, 6, 7, 8, 9] --
| that is, I know the 8 most recent tokens as my "context", and I
| want to predict the next, 9 in this case --
|
| Can someone explain to me why a transformer or RNN is better at
| this than a simple linear layer with an equivalent number of
| parameters? A linear layer can receive the context [1, 2, 3, 4,
| 5, 6, 7, 8], properly one-hot encoded / embedded, etc, and
| predict the next sequence. Can a linear layer do just as well as
| a transformer? This setup allows linear layers to predict
| sequences with an arbitrary context size, so why so much hype
| about transformers and RNNs and other sequence focused
| architectures?
|
| Perhaps the difference is that given the same number of
| parameters, the transformer uses those parameters to perform easy
| computations whereas the linear layer just does one gigantic
| matrix multiplication which isn't very efficient?
| QuadmasterXLII wrote:
| The big difference is that the transformer is (approximately)
| permutation equivariant, which makes a massive difference in
| generalization and training speed.
| Buttons840 wrote:
| I see, so [1, 2, 3, 4, 5, 6, 7, 8] is more similar to [5, 3,
| 1, 7, 8, 2, 4, 6] than with a linear layer? That's what you
| mean by permutation equivariant?
|
| I understand each context input is embedded with its
| position, but I suppose the transformer can learn to ignore
| the position and just look at the context as an unordered
| set?
| eachro wrote:
| What makes it approximately permutation equivariant (vs
| entirely)? As I understand things, if the order is jumbled,
| the attention matrix does get its rows and cols permuted in
| the way you'd expect so I'd have thought they'd be entirely
| permutation equivariant.
| YetAnotherNick wrote:
| Inputs have position encoding in them.
| 1024core wrote:
| While the race to incorporate longer and longer context (2K PaLM
| -> 32K now) is interesting, I don't think that'll scale. It'll
| just add too much noise to the history: how do you establish a
| causal relationship between what you're holding in your hand,
| versus the million other things (context) that you've seen in the
| past. You'll end up with spurious correlations.
|
| What I think (and this is just me talking out of my ass) will be
| required is some form of associative long-term memory. Basically,
| give the model a way to store some embeddings in some form of
| memory, and then retrieve them based on context: so it doesn't
| matter if you encountered that item 2 tokens ago, or 2B.
|
| At least this is what my current intuition tells me.
| lucidrains wrote:
| that line of research is still going.
| https://github.com/lucidrains/block-recurrent-transformer-py...
| i think it is worth continuing research on both fronts.
| 1024core wrote:
| Of course, I'm not saying "don't do research". I'm just
| saying that I don't think this context-length war will lead
| us to long-term sustainable gains.
| [deleted]
| nathias wrote:
| this and evaluation based on context that lets it mutate past
| content and we are set
| skybrian wrote:
| On the other hand, it seems like training on large amounts of
| text for next-token prediction would tend to reduce reliance on
| spurious correlations? I don't think this intuitive sort of
| speculation can predict what it will do.
| edulix wrote:
| Instead of long learning or long contexts, at some point
| artificial neural networks will have to transition to
| continuous/online learning - learn while using the network. This
| way, limitations are broken like they are in our minds.
|
| Similar to what Numenta HTM networks do, but scalable and
| performant for real use cases.
|
| BTW, perhaps human-like conscience emerge as a "self-attention-
| like" mechanism between context and learning. Just saying.
| qumpis wrote:
| Learn how? I think having infinite context is perfect - no need
| to learn on my data online and risk exposing it to others.
| thomasahle wrote:
| Alternatively we need the model to have a long term memory, and
| be able to load stuff to/from that while reading.
| [deleted]
| jmole wrote:
| Oddly enough, I was reading their paper just last night:
| https://arxiv.org/pdf/2302.10866.pdf
|
| I think we're going to see a lot more in the
| wavelet/convolution/fft space when thinking about how to increase
| context length.
|
| I think there's also a lot of room for innovation in the
| positional encoding and how it's represented in transformer
| models, it seems like people have been trying lots of things and
| going with what works, but most of it is like: "look, a new
| orthonormal basis!".
|
| Hyena sort of seems like the first step in moving to positional
| _embeddings_ (or joint positional /attentional embeddings).
|
| Very cool work.
| cs702 wrote:
| I agree this sort of approach looks promising. Maybe using FFTs
| recurrently to approximate convolutions with input-length
| filters is the way forward. It's a clever idea. I'm making my
| way through the paper. Don't fully understand it yet.
|
| The main issue I've seen with other wannabe-sub-quadratic-
| replacements for self-attention is that they all rely on some
| kind of low-rank/sparse approximation that in practice renders
| LLMs incapable of modeling enough pairwise relationships
| between tokens to achieve state-of-the-art performance.
|
| I'm curious to see if this kind of approach solves the issue.
| imustachyou wrote:
| S4 and its class of state-space models are an impressive
| mathematical and signal-processing innovation, and I thought it
| was awesome how they destroyed previous baselines for long-range
| tasks.
|
| Have there been any state-space models adapted for arbitrary text
| generation?
|
| Language models like ChatGPT are trained to predict new words
| based on the previous ones and are excellent for generation, a
| harder task than translation or classification. I'm doubtful
| about the adaptability of text models that deal with fixed-sized
| input/outputs and don't have an architecture that is as natural
| for generating indefinitely long sequences.
| sdenton4 wrote:
| Go read about S4, from these authors. It's about having a
| learnable state-space model which can be efficiently
| implemented as either an RNN or (very long) convolution,
| according to the needs of train or inference.
| Buttons840 wrote:
| Do these scale as well as transformers? My understanding is
| that classic RNNs don't scale well, and that is one reason
| why transformers became popular.
|
| As a pleb who doesn't even own a data center, I've been
| hoping that a superior machine learning architecture will be
| discovered that doesn't scale well. We would be fortunate if
| our personal computers end up being half as good as
| Microsoft's or Amazon's best models; fortunate if the best
| architecture gains little from an additional 10,000 GPUs.
| This would help spread the benefits of AI evenly among anyone
| with a phone or computer -- a utopia compared to the other
| possibility, that everyone can learn how to build AI, but
| only those with a few hundred million to throw at a data
| center can actually control the means of production -- err, I
| mean, the means of intelligence.
|
| Philosophically, this wouldn't be unlike people. Humans are
| still the greatest intelligence we're aware of, and humans
| don't scale. I'm hoping computer intelligence ends up not
| scaling well either.
| sdenton4 wrote:
| That's the point of having multiple realizations of the
| same underlying model.
|
| The (depthwise) convolutional realization is extremely
| efficient for training, and the RNN is extremely efficient
| for inference. The scaling in both of these cases is much
| better than attention layers - as they discuss in the
| article.
| 3327 wrote:
| [dead]
| pmontra wrote:
| Is this the way we work? We are told a fact a very few times and
| we remember it all life long, no 32k or 32M context.
|
| I think that they are following the easy path, much like the Giga
| Hertz race in CPU, and will hit a wall. Maybe the wall will be so
| far away that it will give us an AGI but maybe it will give us
| superhuman machines only in well defined contexts. We'll have to
| squeeze our instructions in a prompt too small for some tasks and
| get a bot behaving like the main character of the Memento movie
| (he remembered only the last few minutes and very old memories.)
| [deleted]
| Ozzie_osman wrote:
| There will be those of us that understand how all these models
| work, and there will be those of us that simply use them.
| istjohn wrote:
| That's true of just about any technology. Most carpenters would
| fail to explain why their hammer drives a nail into wood
| instead of bouncing off, why it has a handle that extends past
| the hand's grip, why the head of the hammer neither shatters
| nor mushrooms over time, to say nothing of their nail guns and
| circular saws.
| Herval_freire wrote:
| Someone in a previous comment on LLM research said that according
| to what he knew about LLM research that we were in a local
| maximum that no further improvement was likely possible.
|
| I disagreed with him, and this article is evidence that is in
| favor of my point. If research like this continues to move
| forward LLMs will improve at a rapid rate.
|
| Different threads attract different groups of people with
| different areas of expertise. So I will sort of reiterate this
| topic here as I'm interested. What are most people's thoughts on
| this "local maximum" thing have we actually hit a dead end?
| Especially given the proliferation of effort towards producing
| research like the one showed in the topic here.
| ChatGTP wrote:
| I mean I just read that article and it doesn't seem like a lot
| will change, sure it can summarize a whole book, or read a
| larger chunk of code to do thing with, but didn't really see it
| talk about taking things to "the next level" so to speak.
|
| The researchers are also, excited.
|
| _We're especially motivated by applications that could benefit
| from longer-sequence models - high-resolution imaging, new
| modalities of data, language models that can read entire books.
| Imagine giving a language model an entire book and having it
| summarize the plot, or conditioning a code generation model on
| all the code you've ever written. The possibilities are wild -
| and we're excited._
| Herval_freire wrote:
| But this research came out mere weeks after the release of
| gpt-4. That is in itself rapid. If small incremental changes
| like this continue on a sort of monthly basis, the trendline
| points towards something that's not a dead end. That's my
| view of it.
|
| As with most technology there isn't necessarily always a
| constant influx of inflection points and paradigm shifts.
| Improvement will likely creep up on us incrementally.
| Suddenly one day its clearly more intelligent then a human
| and we can't point to when it happened.
| [deleted]
| jamilton wrote:
| GPT-4's release doesn't seem like the relevant time marker,
| since nothing in the article builds on it or depends on it.
| The paper for H3 was submitted in December 2022.
|
| The pace the last few years definitely seems rapid, just
| don't want there to be a false impression.
| intalentive wrote:
| Once you exhaust the dataset of all written language, the next
| step is multi-modal -- images, audio, video. What will next-
| token prediction give us on such a dataset? Better versions of
| what we have now -- style transfer, summarization, captioning,
| translation, prompt-based generation, etc., but with
| synesthesia.
|
| There is still plenty of improvement ahead but I don't think
| anything genuinely surprising will come from the current regime
| of feedforward models. What is missing is action -- an
| interactive feedback loop between agent and environment.
| Progress in RL and robotics has been very slow by comparison
| and unless we see a breakthrough there, I would guess the GPT
| phase plateaus in the next 5-10 years.
| skybrian wrote:
| I expect progress will be much less predictable. Some kinds
| of action look pretty easy; it depends on the domain.
|
| For example, I expect skill at writing some kinds of code to
| improve dramatically because running tests in a sandbox looks
| easy. It's already being researched. [1] Extending that to
| device drivers might be a bit harder. Fuzzing is already
| mostly automated and smarter fuzzing could get pretty scary.
|
| [1] https://nanothoughts.substack.com/p/reflecting-on-
| reflexion
| Salgat wrote:
| Basically, if the knowledge exists online in a way that can
| be pieced together in a straight forward manner, GPT will
| figure it out, but for information that requires
| experimentation and creating new information to derive
| results, it won't be of much use. For example, GPT can't
| iteratively try different programming techniques to speed
| up a block of parallelizable code, it'll simply give you
| the best guess that it can find off google.
| skybrian wrote:
| It won't iterate on its own, but you can do it. You can
| ask it for a list of things to try, and they will be
| different alternatives. You can also tell it the result
| of an experiment and it will often figure out what to
| fix.
|
| If you follow the link I shared, some researchers
| automated asking GPT4 to write tests, running the tests
| in a sandbox, and feeding the results back in.
| UncleEntity wrote:
| > What will next-token prediction give us on such a dataset?
|
| Haven't we pretty much figured out these things are doing
| more than just predicting the next token at this point?
|
| There's probably a lot to be done with a "prediction
| machine", birds aren't all that smart but can catch bugs in
| midair.
| skybrian wrote:
| People keep underestimating what next-token prediction can
| do, but they're not wrong that it's how LLM's work.
|
| It's actually a good question: what will next-token
| prediction be able to do on new datasets? The error is
| thinking you can answer it, even in broad terms.
| cs702 wrote:
| This looks really interesting! If these guys succeed in bringing
| self-attention's computational cost down from O(n2) to O(n log
| n), that would be a huge win. The quadratic cost makes it very
| difficult to increase sequence length on current hardware. I'm
| going to take a closer look.
|
| There are other interesting ongoing efforts to increase sequence
| length. One that has worked for me is this dynamic routing
| algorithm, related to self-attention, that can handle sequences
| with 1M+ tokens in a single GPU:
| https://github.com/glassroom/heinsen_routing . Right now, you can
| take 1,000 sequences of hidden states computed by a pretrained
| transformer, each sequence with, say, 1024 tokens, concatenate
| them into a single ultra-long sequence with 1,024,000 hidden
| states, slap 1,024,000 position encodings on top, and feed the
| whole thing to that routing algorithm to predict the next token
| (or whatever other training objective you want to optimize for).
| It works. Search the README for "Very Long Sequences".
|
| If anyone here has other suggestions for working with long
| sequences (hundreds of thousands to millions of tokens), _I 'd
| love to learn about them_.
| [deleted]
| og_kalu wrote:
| There are already linear attention advances. Gpt-4-32k is
| almost certainly using some form of flash attention.
|
| Attention isn't really O(n2) anymore.
| cs702 wrote:
| My understanding is that FlashAttention's memory use is
| linear, or close to linear in practice, but computation is
| still O(n2). I'm unaware of anyone being able to apply
| FlashAttention on, say, a million tokens, because it must
| execute ~1/2 x 1,000,000^2 x n_head dot-products, each in a
| subspace with d_head dimensions. That's not exactly
| computationally cheap!
| og_kalu wrote:
| No you're right. I mistook you. Compute isn't linear yet.
| lucidrains wrote:
| It is only linear in terms of memory, not compute. Flash
| attention is a big advance, but not enough for 1 million
| tokens
| [deleted]
| [deleted]
___________________________________________________________________
(page generated 2023-04-09 23:00 UTC) |