[HN Gopher] From deep to long learning?
___________________________________________________________________
 
From deep to long learning?
 
Author : headalgorithm
Score  : 353 points
Date   : 2023-04-09 12:41 UTC (10 hours ago)
 
web link (hazyresearch.stanford.edu)
w3m dump (hazyresearch.stanford.edu)
 
| [deleted]
 
| marshmallowmad wrote:
| I don't quite understand why context length needs to keep
| growing. It seems to me like many tasks (e.g. customizing your
| LLM on your own data) would benefit from the model doing some
| sort of sped up fine-tuning on any prompt that gets added. That
| way we don't have to find all these hacks to find the most
| relevant context and repeat the same info in prompts. Curious if
| anyone has insight here as this has caused me some confusion
| lately!
 
  | [deleted]
 
| raphlinus wrote:
| The thing that stuck out to me is the assertion that FFT is
| poorly supported on modern GPU. That's surprising to me, as
| there's cuFFT offically supported by Nvidia, and vkFFT that
| achieves similar performance portably using compute shaders. I
| believe these are based on f32 math, so perhaps the potential win
| is using tensor cores to compute FFT at lower precision? It seems
| surprising to me that decomposing into matrix operations is the
| win here, it seems you'd do better writing a kernel that makes
| use of the cooperative matrix (aka WMMA, tensor core,
| simd_matrix) capabilities of the GPU.
 
  | svantana wrote:
  | Looking at the source paper, they are claiming a 2.2x speedup
  | over cuFFT for convolutions, so it's not an earth-shattering
  | gain, but still.
 
| aqme28 wrote:
| It seems to me like these long context lengths are going to make
| a huge difference in the capabilities of things people are able
| to produce. Agents can use LLMs to subdivide a large corpus into
| manageable chunks. More context just makes them much much better
| at that.
| 
| Regardless, for most programming tasks, I doubt my equivalent
| human context length is any better than 32k tokens.
 
  | jerpint wrote:
  | The problem is not the short term context length, but long
  | term. If you want to do things like long term goal planning,
  | keeping track of distant past events can be of high value
 
    | taneq wrote:
    | Maybe that's what our inner monologue is for... to cycle
    | relevant context through regularly so it stays in scope. I
    | mean, that's not actually why we have one but it'd be cute.
 
    | PaulHoule wrote:
    | I'd argue for planning that the answer is to couple the LLM
    | to some other system rather than try to improve the LLM.
    | Combinatorial optimization is a well-understood problem that
    | is NP-complete in theory but practical to solve in practice.
    | For doing math we might as well give the LLM a pocket
    | calculator, why not couple it to a planner, theorem prover
    | and similar tools?
 
      | pixl97 wrote:
      | This is pretty much what people are testing now with Auto-
      | GPT
 
      | sharemywin wrote:
      | That's the interesting thing. if LLM start to just become
      | interfaces for other systems you could probably just use
      | them to train smaller systems like in the style of alpaca
      | and other recent LLMs.
 
      | pmoriarty wrote:
      | _" For doing math we might as well give the LLM a pocket
      | calculator, why not couple it to a planner, theorem prover
      | and similar tools?"_
      | 
      | ChatGPT has already been hooked up to Wolfram Alpha.[1]
      | 
      | For hooking up other things, see HuggingGPT and
      | TaskMatrix.AI.[2][3]
      | 
      | [1] - https://writings.stephenwolfram.com/2023/03/chatgpt-
      | gets-its...
      | 
      | [2] - https://arxiv.org/pdf/2303.17580.pdf
      | 
      | [3] - https://arxiv.org/pdf/2303.16434.pdf
 
      | btbuildem wrote:
      | Precisely. LLMs are the Perl of the future
 
  | PartiallyTyped wrote:
  | Our context is hierarchical, at multiple different resolutions,
  | so I don't think it's comparable.
 
    | aqme28 wrote:
    | I was trying to describe a hierarchy of LLMs as an agent. I
    | don't think that's a uniquely human ability.
 
    | btbuildem wrote:
    | I've had good results using that approach with LLMs for
    | domain-specific problem solving.
 
    | letitgo12345 wrote:
    | Good chance so is OAI's https://arxiv.org/abs/2110.13711
    | (paper has the long context lead at OAI as a co-author)
 
    | XorNot wrote:
    | I'd say when you're actively trying to remember an event and
    | "narrowing it down" mentally is a process that feels
    | suspiciously like a dialogue with an LLM where you keep
    | asking for the answer to be improved.
 
  | NhanH wrote:
  | Human expert relies in the "chunking" effect for expertise,
  | which is mostly not part of the context length (working
  | memory). For a pseudo-analogy, each human is fine-tuned from
  | the raw intelligence (by training and education). In that way
  | generic LLM probably can't beat us yet so don't despair!
 
    | onos wrote:
    | Any job focused on fact retrieval is at risk.
 
      | RandomLensman wrote:
      | Need to know that the facts exist in the first place,
      | though.
 
        | orbifold wrote:
        | GPT-4 is already at a level that it has graduate level
        | math and physics knowledge, with a level of recall that
        | most students can only dream of.
 
        | RandomLensman wrote:
        | Do those actually matter in jobs? Most valuable facts I
        | encounter are far more niche, often in someone's head and
        | not written down, only stored privately somewhere etc.
 
      | jimkoen wrote:
      | What kind of job is only focused on fact retrieval?
 
        | istjohn wrote:
        | My bigget blocker in web development is fact retrieval.
        | As someone who only dabbles, I don't struggle with how to
        | logically design my project, but I'm constantly
        | forgetting CSS and JS details like how to accomplish a
        | specific task with CSS flexbox or how to sort an array in
        | JS vs in Python. On old personal projects, I forget the
        | names of my own functions and whether they return a list
        | or a dict. Hell, I'll forget function signatures for
        | functions I just wrote. I forget external library and API
        | details. If I had perfect recall, I would 100x my web dev
        | productivity.
 
        | fnordpiglet wrote:
        | Jeopardy contestant
 
  | kmeisthax wrote:
  | The problem with comparing LLM and human context length is that
  | LLMs don't update their weights based on their input. The 32k
  | of context that they do have is their _only_ memory.
  | 
  | Humans have multiple layers of memory and can recall things and
  | concepts from years in the past - akin to millions of tokens'
  | worth of recall in an LLM. Yes, that memory is extremely lossy,
  | but it's there.
 
  | jacquesm wrote:
  | 'Superhuman' has many dimensions. It can be speed, it can be
  | ability to retain a lot of information, it can be the ability
  | to deal with a large quantity of information at once and many
  | other dimensions besides. For the longest time Chess was
  | considered a domain where computers could be dilettantes but
  | could never dominate. Chess is now in 'superhuman' territory
  | and likely we will never see a reversal because any insight
  | that benefits humans also benefits computers but not the other
  | way around.
  | 
  | The fact that this is such a multi-dimensional problem is
  | frequently overlooked in the debate about AI/AGI etc, it may
  | not matter all that much if the non AGI AI is already
  | superhuman on enough dimensions other than the ones that the
  | 'but it isn't AGI' crowd cling to. The consequences are what
  | matters, _not_ the fine print or the implementation details,
  | and those consequences are directly tied to the number of
  | dimensions along which a computer can beat humanity.
  | 
  | To give an example: if a chess program was 3000 Elo before but
  | so slow that it would lose under competition rules then humans
  | still dominated chess. Likewise if it would be only 2500 Elo
  | but fast enough, it would still lose from the best humans. But
  | for a large fraction of society it would have already moved out
  | into 'superhuman' territory. And a couple of technological
  | leaps of progress later and we're _all_ looking at that AI as
  | if it has moved into superhuman regions.
  | 
  | This sort of thing will happen on many fronts, and all of those
  | fronts are moving, if enough of them go past the threshold then
  | whether it is AGI or not is irrelevant and for every person
  | that threshold is at different points. Maybe a computer will be
  | able to calculate faster and better than you can, maybe it will
  | be able to translate text faster and better than you, maybe it
  | will be able to organize information faster and better than
  | you. At which point we cross the line into saying that it can
  | _think_ faster and better than you is hard, but we can see that
  | we are getting close to that line without even knowing exactly
  | where that line is.
 
    | pmoriarty wrote:
    | _" Chess is now in 'superhuman' territory and likely we will
    | never see a reversal because any insight that benefits humans
    | also benefits computers but not the other way around."_
    | 
    | What is considered human is malleable. It is conceivable that
    | humans will be enhanced in various biological and non-
    | biological ways to a point that they can once again compete
    | with computers.
 
      | Mezzie wrote:
      | We may also just change the rules of chess.
      | 
      | "Chess" is a human created game after all.
 
        | evrimoztamur wrote:
        | Somebody actually tried with
        | https://en.m.wikipedia.org/wiki/Arimaa, but it didn't
        | last too long until it was also figured out!
 
        | dmd wrote:
        | Look I found the guy who keeps moving the goalposts!
 
        | anthomtb wrote:
        | I think you're making a joke.
        | 
        | But moving goalposts is an integral part of all
        | professional sports. The 3 point line in basketball and
        | engine size and aspiration in motorsports are obvious
        | examples. I don't see how adjusting chess rules to dis-
        | favor AI competitors is any different.
 
        | Mezzie wrote:
        | It's a pretty common way for humans to deal with not
        | being able to do something/something not working.
 
    | andrepd wrote:
    | > For the longest time Chess was considered a domain where
    | computers could be dilettantes but could never dominate
    | 
    | I don't think this was ever true. Chess programs appeared
    | EXTREMELY early in, and everyone recognised that it was a
    | matter of time until hardware was quick enough to evaluate so
    | many positions per second that grandmasters could be defeated
    | by sheer calculation.
 
      | jacquesm wrote:
      | I was playing chess pretty fanatically when Sargon came out
      | and that was _my_ impression as a computer person, but the
      | chess people around me really didn 't think that computers
      | would ever beat the top GMs.
 
        | macintux wrote:
        | Most major advances are predicted by someone, and often
        | seem obvious in hindsight, but it seems like we often
        | conflate those and remember them having been obvious
        | before they happened.
 
        | jacquesm wrote:
        | The number of people working with computers back then was
        | but a handful so that gave me a bit of a different
        | perspective but I think that anybody that was both into
        | computers and into chess back then would have made the
        | same prediction. It still happened faster than I thought
        | it would.
 
    | ChatGTP wrote:
    | I'm curious, what is the point your trying to convey?
 
    | jstanley wrote:
    | > if a chess program was 3000 Elo before but so slow that it
    | would lose under competition rules then humans still
    | dominated chess.
    | 
    | How are you working out that it has a 3000 Elo if it's not
    | winning games?
 
      | jacquesm wrote:
      | That's the way it is done right now: by playing it against
      | other software implementations and judging them in the
      | exact same way they would judge humans.
      | 
      | Stockfish has an Elo rating over 3500 in spite of no human
      | being even close to that.
 
    | skybrian wrote:
    | One dimension that I think is pretty important is compute
    | cost due to its effect on how they're used. The chatbots are
    | expensive to run, which means they're implemented as request-
    | response API's that cost money, which means that loops are
    | expensive and they normally don't do any idle-time thinking.
    | 
    | When you play a turn-based game with a bot, that means you
    | don't need to worry about its reaction time. It's paused most
    | of the time, waiting on you. A sorcerer's apprentice scenario
    | isn't going to happen when you're single-stepping.
    | 
    | Moving to routine use of bots that run continuously with fast
    | reaction times will be much more dangerous.
 
      | jacquesm wrote:
      | Yes, that's a very good point. Effectively it is 'ping
      | pong' right now, when you get to always on + push things
      | will change quite a bit. Model efficiency is a very active
      | field.
 
    | m3kw9 wrote:
    | Is like a CPU, we've all seen that movie before, it's super
    | human in calculating numbers, it's one dimensional
    | nonetheless like chess programs.
 
| mark_l_watson wrote:
| Interesting about Butterfly architecture for hardware FFT
| support. In the 1980s, DARPA provided two types of exotic
| hardware to the company I worked for: the first Connection
| Machine, and the Butterfly machine. I wrote Star Lisp code for
| the CM, but never touched the Butterfly machine.
| 
| Off topic, but I am curious what hardware Apple will release in
| the future for more direct AI support. Their Core ML libraries
| working with Apple Silicon have been very effective so far. The
| next step would likely be a built in foundation LLM model,
| extending what they have supported with BERT models, etc.
 
  | sroussey wrote:
  | I think it will be quite a few years before an LLM is built
  | into their silicon.
  | 
  | But I do see a Neural Engine 2.0 in the future that will better
  | handle these things in the more near term future.
 
    | Sugimot0 wrote:
    | iirc creating new ml optimized chips/chiplets was part of the
    | appeal for risc-v right? Is risc-v relevant yet, or are there
    | any promising risc-v chips on the way? I know there's a lot
    | of hype around them, so i'm curious to know how much is just
    | noise or what the real sentiment is from the
    | experts/industry.
 
      | [deleted]
 
  | cs702 wrote:
  | _> The next step would likely be a built in foundation LLM
  | model, extending what they have supported with BERT models,
  | etc._
  | 
  | I'm thinking deeper. It wouldn't surprise me if _self-
  | attention_ itself becomes a _primitive building block_ of
  | future co-processors, e.g., with instructions and memory
  | layouts engineered to make ultra-low-precision self-attention
  | as compute- and memory-efficient as possible. I 'm expecting
  | LLMs with hundreds of billions and eventually trillions of
  | parameters will be able to run locally on my laptop and mobile
  | phone, in the not-too-distant future.[a]
  | 
  | [a] If this sounds far-fetched, consider that you can _already_
  | run LLMs with tens of billions of parameters on mobile phones:
  | https://justine.lol/mmap/
 
    | fpgaminer wrote:
    | > I'm expecting LLMs with hundreds of billions and eventually
    | trillions of parameters will be able to run locally on my
    | laptop and mobile phone, in the not-too-distant future
    | 
    | Perhaps. There's been a lot of focus on training-compute
    | optimal models in the industry. Rightfully so, as proofs of
    | concept. That's what led to this perceived parameter count
    | race in published models.
    | 
    | But remember the other side of the scaling laws. For
    | inference, which is what we want to do on our phones, it's
    | better to be inference-compute optimal. That means smaller
    | models trained for longer.
    | 
    | As far as we know today there are no limits of the scaling
    | laws. A 1B parameter model _can_ beat a 1T parameter model,
    | if trained for long enough. Of course it's exponential, so
    | you'd have to pour incalculable training resources into such
    | an extreme example. But I find these extreme examples
    | elucidating.
    | 
    | My pet theory these days is that we'll discover some way of
    | "simulating" multiple parameters from one stored parameter.
    | We know that training-compute optimal models are extremely
    | over-parameterized. So it isn't the raw capacity of the model
    | that's important. It seems like during training the degrees
    | of freedom is what allows larger models to be more sample
    | efficient. If we can find a cheap way of having one parameter
    | simulate multiple degrees of freedom, it will likely give us
    | the ability to gain the advantages of larger models during
    | training, without the inference costs later.
    | 
    | I don't disagree that we're likely to see more and more
    | parameter capacity from our devices. I'm just pointing out
    | that the parameter count race is a bit of an illusion. OpenAI
    | discovered the scaling laws and needed a proof of concept. If
    | they could show AI reaching X threshold first, they could
    | capture the market. The fastest way to do that is to be
    | training-compute optimal. So they had to scale to 175B
    | parameters or more. Now that it's proven, and that there's a
    | market for such an AI, their and other's focus can be on
    | inference-optimal models which are smaller but just as smart.
 
      | cs702 wrote:
      | _> I don 't disagree that we're likely to see more and more
      | parameter capacity from our devices. I'm just pointing out
      | that the parameter count race is a bit of an illusion.
      | OpenAI discovered the scaling laws and needed a proof of
      | concept. If they could show AI reaching X threshold first,
      | they could capture the market. The fastest way to do that
      | is to be training-compute optimal. So they had to scale to
      | 175B parameters or more. Now that it's proven, and that
      | there's a market for such an AI, their and other's focus
      | can be on inference-optimal models which are smaller but
      | just as smart._
      | 
      | Good point. That could very well be what they're thinking
      | about, in addition to potential improvements in training
      | data and RLHF methods.
      | 
      | Also, I agree it would be great if anyone figures out how
      | to do something akin to "making a smaller model act as if
      | it were gigantic during training" OR "pruning a gigantic
      | model's 'dead paths' as it learns during training," to get
      | the benefits of scale in training without its costs at
      | inference.
 
| skepticATX wrote:
| Can someone with a better understanding than I have comment about
| the relationship between these results and this paper:
| https://arxiv.org/abs/2109.09115, which seems to demonstrate that
| a longer context length has diminishing returns?
 
  | rsfern wrote:
  | I'm not deeply familiar with all these papers, but two things
  | stand out to me
  | 
  | The model architectures are different, and in the very latest
  | paper they scale these not-transformer models to sequence
  | length of 64k, where the paper you linked only considers up to
  | 8k
 
| AvAn12 wrote:
| With longer (50k+) context lengths, is this just becoming a new
| form of search?
 
  | intalentive wrote:
  | Yes, now we just need it to provide citations.
 
  | jmole wrote:
  | I think the K,Q,V representation was what fundamentally gave
  | rise to LLMs (from "Attention is all you need"), and I'm
  | certain that it wouldn't have happened without the researchers
  | having a background in search @ Google.
  | 
  | Or in other words, it was always a new form of search.
 
    | bckr wrote:
    | [Astronaut looking at earth with the logo of superhuman AI
    | superimposed]
    | 
    | Wait, it's all just search?
    | 
    | [astronaut with gun]
    | 
    | Always has been
 
| logophobia wrote:
| I've applied the S4 operator to successfully do long-length video
| classification. It's massively more efficient than a similarly
| scaled transformer, but it doesn't train as well. Still, even
| with S4 I got some impressive results, looking forward to more.
 
| Buttons840 wrote:
| If I want to do sequence modelling, let's say, predict the 9th
| element of the following sequence: [1, 2, 3, 4, 5, 6, 7, 8, 9] --
| that is, I know the 8 most recent tokens as my "context", and I
| want to predict the next, 9 in this case --
| 
| Can someone explain to me why a transformer or RNN is better at
| this than a simple linear layer with an equivalent number of
| parameters? A linear layer can receive the context [1, 2, 3, 4,
| 5, 6, 7, 8], properly one-hot encoded / embedded, etc, and
| predict the next sequence. Can a linear layer do just as well as
| a transformer? This setup allows linear layers to predict
| sequences with an arbitrary context size, so why so much hype
| about transformers and RNNs and other sequence focused
| architectures?
| 
| Perhaps the difference is that given the same number of
| parameters, the transformer uses those parameters to perform easy
| computations whereas the linear layer just does one gigantic
| matrix multiplication which isn't very efficient?
 
  | QuadmasterXLII wrote:
  | The big difference is that the transformer is (approximately)
  | permutation equivariant, which makes a massive difference in
  | generalization and training speed.
 
    | Buttons840 wrote:
    | I see, so [1, 2, 3, 4, 5, 6, 7, 8] is more similar to [5, 3,
    | 1, 7, 8, 2, 4, 6] than with a linear layer? That's what you
    | mean by permutation equivariant?
    | 
    | I understand each context input is embedded with its
    | position, but I suppose the transformer can learn to ignore
    | the position and just look at the context as an unordered
    | set?
 
    | eachro wrote:
    | What makes it approximately permutation equivariant (vs
    | entirely)? As I understand things, if the order is jumbled,
    | the attention matrix does get its rows and cols permuted in
    | the way you'd expect so I'd have thought they'd be entirely
    | permutation equivariant.
 
      | YetAnotherNick wrote:
      | Inputs have position encoding in them.
 
| 1024core wrote:
| While the race to incorporate longer and longer context (2K PaLM
| -> 32K now) is interesting, I don't think that'll scale. It'll
| just add too much noise to the history: how do you establish a
| causal relationship between what you're holding in your hand,
| versus the million other things (context) that you've seen in the
| past. You'll end up with spurious correlations.
| 
| What I think (and this is just me talking out of my ass) will be
| required is some form of associative long-term memory. Basically,
| give the model a way to store some embeddings in some form of
| memory, and then retrieve them based on context: so it doesn't
| matter if you encountered that item 2 tokens ago, or 2B.
| 
| At least this is what my current intuition tells me.
 
  | lucidrains wrote:
  | that line of research is still going.
  | https://github.com/lucidrains/block-recurrent-transformer-py...
  | i think it is worth continuing research on both fronts.
 
    | 1024core wrote:
    | Of course, I'm not saying "don't do research". I'm just
    | saying that I don't think this context-length war will lead
    | us to long-term sustainable gains.
 
      | [deleted]
 
  | nathias wrote:
  | this and evaluation based on context that lets it mutate past
  | content and we are set
 
  | skybrian wrote:
  | On the other hand, it seems like training on large amounts of
  | text for next-token prediction would tend to reduce reliance on
  | spurious correlations? I don't think this intuitive sort of
  | speculation can predict what it will do.
 
| edulix wrote:
| Instead of long learning or long contexts, at some point
| artificial neural networks will have to transition to
| continuous/online learning - learn while using the network. This
| way, limitations are broken like they are in our minds.
| 
| Similar to what Numenta HTM networks do, but scalable and
| performant for real use cases.
| 
| BTW, perhaps human-like conscience emerge as a "self-attention-
| like" mechanism between context and learning. Just saying.
 
  | qumpis wrote:
  | Learn how? I think having infinite context is perfect - no need
  | to learn on my data online and risk exposing it to others.
 
  | thomasahle wrote:
  | Alternatively we need the model to have a long term memory, and
  | be able to load stuff to/from that while reading.
 
| [deleted]
 
| jmole wrote:
| Oddly enough, I was reading their paper just last night:
| https://arxiv.org/pdf/2302.10866.pdf
| 
| I think we're going to see a lot more in the
| wavelet/convolution/fft space when thinking about how to increase
| context length.
| 
| I think there's also a lot of room for innovation in the
| positional encoding and how it's represented in transformer
| models, it seems like people have been trying lots of things and
| going with what works, but most of it is like: "look, a new
| orthonormal basis!".
| 
| Hyena sort of seems like the first step in moving to positional
| _embeddings_ (or joint positional /attentional embeddings).
| 
| Very cool work.
 
  | cs702 wrote:
  | I agree this sort of approach looks promising. Maybe using FFTs
  | recurrently to approximate convolutions with input-length
  | filters is the way forward. It's a clever idea. I'm making my
  | way through the paper. Don't fully understand it yet.
  | 
  | The main issue I've seen with other wannabe-sub-quadratic-
  | replacements for self-attention is that they all rely on some
  | kind of low-rank/sparse approximation that in practice renders
  | LLMs incapable of modeling enough pairwise relationships
  | between tokens to achieve state-of-the-art performance.
  | 
  | I'm curious to see if this kind of approach solves the issue.
 
| imustachyou wrote:
| S4 and its class of state-space models are an impressive
| mathematical and signal-processing innovation, and I thought it
| was awesome how they destroyed previous baselines for long-range
| tasks.
| 
| Have there been any state-space models adapted for arbitrary text
| generation?
| 
| Language models like ChatGPT are trained to predict new words
| based on the previous ones and are excellent for generation, a
| harder task than translation or classification. I'm doubtful
| about the adaptability of text models that deal with fixed-sized
| input/outputs and don't have an architecture that is as natural
| for generating indefinitely long sequences.
 
  | sdenton4 wrote:
  | Go read about S4, from these authors. It's about having a
  | learnable state-space model which can be efficiently
  | implemented as either an RNN or (very long) convolution,
  | according to the needs of train or inference.
 
    | Buttons840 wrote:
    | Do these scale as well as transformers? My understanding is
    | that classic RNNs don't scale well, and that is one reason
    | why transformers became popular.
    | 
    | As a pleb who doesn't even own a data center, I've been
    | hoping that a superior machine learning architecture will be
    | discovered that doesn't scale well. We would be fortunate if
    | our personal computers end up being half as good as
    | Microsoft's or Amazon's best models; fortunate if the best
    | architecture gains little from an additional 10,000 GPUs.
    | This would help spread the benefits of AI evenly among anyone
    | with a phone or computer -- a utopia compared to the other
    | possibility, that everyone can learn how to build AI, but
    | only those with a few hundred million to throw at a data
    | center can actually control the means of production -- err, I
    | mean, the means of intelligence.
    | 
    | Philosophically, this wouldn't be unlike people. Humans are
    | still the greatest intelligence we're aware of, and humans
    | don't scale. I'm hoping computer intelligence ends up not
    | scaling well either.
 
      | sdenton4 wrote:
      | That's the point of having multiple realizations of the
      | same underlying model.
      | 
      | The (depthwise) convolutional realization is extremely
      | efficient for training, and the RNN is extremely efficient
      | for inference. The scaling in both of these cases is much
      | better than attention layers - as they discuss in the
      | article.
 
| 3327 wrote:
| [dead]
 
| pmontra wrote:
| Is this the way we work? We are told a fact a very few times and
| we remember it all life long, no 32k or 32M context.
| 
| I think that they are following the easy path, much like the Giga
| Hertz race in CPU, and will hit a wall. Maybe the wall will be so
| far away that it will give us an AGI but maybe it will give us
| superhuman machines only in well defined contexts. We'll have to
| squeeze our instructions in a prompt too small for some tasks and
| get a bot behaving like the main character of the Memento movie
| (he remembered only the last few minutes and very old memories.)
 
  | [deleted]
 
| Ozzie_osman wrote:
| There will be those of us that understand how all these models
| work, and there will be those of us that simply use them.
 
  | istjohn wrote:
  | That's true of just about any technology. Most carpenters would
  | fail to explain why their hammer drives a nail into wood
  | instead of bouncing off, why it has a handle that extends past
  | the hand's grip, why the head of the hammer neither shatters
  | nor mushrooms over time, to say nothing of their nail guns and
  | circular saws.
 
| Herval_freire wrote:
| Someone in a previous comment on LLM research said that according
| to what he knew about LLM research that we were in a local
| maximum that no further improvement was likely possible.
| 
| I disagreed with him, and this article is evidence that is in
| favor of my point. If research like this continues to move
| forward LLMs will improve at a rapid rate.
| 
| Different threads attract different groups of people with
| different areas of expertise. So I will sort of reiterate this
| topic here as I'm interested. What are most people's thoughts on
| this "local maximum" thing have we actually hit a dead end?
| Especially given the proliferation of effort towards producing
| research like the one showed in the topic here.
 
  | ChatGTP wrote:
  | I mean I just read that article and it doesn't seem like a lot
  | will change, sure it can summarize a whole book, or read a
  | larger chunk of code to do thing with, but didn't really see it
  | talk about taking things to "the next level" so to speak.
  | 
  | The researchers are also, excited.
  | 
  |  _We're especially motivated by applications that could benefit
  | from longer-sequence models - high-resolution imaging, new
  | modalities of data, language models that can read entire books.
  | Imagine giving a language model an entire book and having it
  | summarize the plot, or conditioning a code generation model on
  | all the code you've ever written. The possibilities are wild -
  | and we're excited._
 
    | Herval_freire wrote:
    | But this research came out mere weeks after the release of
    | gpt-4. That is in itself rapid. If small incremental changes
    | like this continue on a sort of monthly basis, the trendline
    | points towards something that's not a dead end. That's my
    | view of it.
    | 
    | As with most technology there isn't necessarily always a
    | constant influx of inflection points and paradigm shifts.
    | Improvement will likely creep up on us incrementally.
    | Suddenly one day its clearly more intelligent then a human
    | and we can't point to when it happened.
 
      | [deleted]
 
      | jamilton wrote:
      | GPT-4's release doesn't seem like the relevant time marker,
      | since nothing in the article builds on it or depends on it.
      | The paper for H3 was submitted in December 2022.
      | 
      | The pace the last few years definitely seems rapid, just
      | don't want there to be a false impression.
 
  | intalentive wrote:
  | Once you exhaust the dataset of all written language, the next
  | step is multi-modal -- images, audio, video. What will next-
  | token prediction give us on such a dataset? Better versions of
  | what we have now -- style transfer, summarization, captioning,
  | translation, prompt-based generation, etc., but with
  | synesthesia.
  | 
  | There is still plenty of improvement ahead but I don't think
  | anything genuinely surprising will come from the current regime
  | of feedforward models. What is missing is action -- an
  | interactive feedback loop between agent and environment.
  | Progress in RL and robotics has been very slow by comparison
  | and unless we see a breakthrough there, I would guess the GPT
  | phase plateaus in the next 5-10 years.
 
    | skybrian wrote:
    | I expect progress will be much less predictable. Some kinds
    | of action look pretty easy; it depends on the domain.
    | 
    | For example, I expect skill at writing some kinds of code to
    | improve dramatically because running tests in a sandbox looks
    | easy. It's already being researched. [1] Extending that to
    | device drivers might be a bit harder. Fuzzing is already
    | mostly automated and smarter fuzzing could get pretty scary.
    | 
    | [1] https://nanothoughts.substack.com/p/reflecting-on-
    | reflexion
 
      | Salgat wrote:
      | Basically, if the knowledge exists online in a way that can
      | be pieced together in a straight forward manner, GPT will
      | figure it out, but for information that requires
      | experimentation and creating new information to derive
      | results, it won't be of much use. For example, GPT can't
      | iteratively try different programming techniques to speed
      | up a block of parallelizable code, it'll simply give you
      | the best guess that it can find off google.
 
        | skybrian wrote:
        | It won't iterate on its own, but you can do it. You can
        | ask it for a list of things to try, and they will be
        | different alternatives. You can also tell it the result
        | of an experiment and it will often figure out what to
        | fix.
        | 
        | If you follow the link I shared, some researchers
        | automated asking GPT4 to write tests, running the tests
        | in a sandbox, and feeding the results back in.
 
    | UncleEntity wrote:
    | > What will next-token prediction give us on such a dataset?
    | 
    | Haven't we pretty much figured out these things are doing
    | more than just predicting the next token at this point?
    | 
    | There's probably a lot to be done with a "prediction
    | machine", birds aren't all that smart but can catch bugs in
    | midair.
 
      | skybrian wrote:
      | People keep underestimating what next-token prediction can
      | do, but they're not wrong that it's how LLM's work.
      | 
      | It's actually a good question: what will next-token
      | prediction be able to do on new datasets? The error is
      | thinking you can answer it, even in broad terms.
 
| cs702 wrote:
| This looks really interesting! If these guys succeed in bringing
| self-attention's computational cost down from O(n2) to O(n log
| n), that would be a huge win. The quadratic cost makes it very
| difficult to increase sequence length on current hardware. I'm
| going to take a closer look.
| 
| There are other interesting ongoing efforts to increase sequence
| length. One that has worked for me is this dynamic routing
| algorithm, related to self-attention, that can handle sequences
| with 1M+ tokens in a single GPU:
| https://github.com/glassroom/heinsen_routing . Right now, you can
| take 1,000 sequences of hidden states computed by a pretrained
| transformer, each sequence with, say, 1024 tokens, concatenate
| them into a single ultra-long sequence with 1,024,000 hidden
| states, slap 1,024,000 position encodings on top, and feed the
| whole thing to that routing algorithm to predict the next token
| (or whatever other training objective you want to optimize for).
| It works. Search the README for "Very Long Sequences".
| 
| If anyone here has other suggestions for working with long
| sequences (hundreds of thousands to millions of tokens), _I 'd
| love to learn about them_.
 
  | [deleted]
 
  | og_kalu wrote:
  | There are already linear attention advances. Gpt-4-32k is
  | almost certainly using some form of flash attention.
  | 
  | Attention isn't really O(n2) anymore.
 
    | cs702 wrote:
    | My understanding is that FlashAttention's memory use is
    | linear, or close to linear in practice, but computation is
    | still O(n2). I'm unaware of anyone being able to apply
    | FlashAttention on, say, a million tokens, because it must
    | execute ~1/2 x 1,000,000^2 x n_head dot-products, each in a
    | subspace with d_head dimensions. That's not exactly
    | computationally cheap!
 
      | og_kalu wrote:
      | No you're right. I mistook you. Compute isn't linear yet.
 
    | lucidrains wrote:
    | It is only linear in terms of memory, not compute. Flash
    | attention is a big advance, but not enough for 1 million
    | tokens
 
  | [deleted]
 
  | [deleted]
 
___________________________________________________________________
(page generated 2023-04-09 23:00 UTC)