[HN Gopher] The Bitter Lesson (2019)
___________________________________________________________________
 
The Bitter Lesson (2019)
 
Author : winkywooster
Score  : 62 points
Date   : 2022-04-02 17:16 UTC (5 hours ago)
 
web link (www.incompleteideas.net)
w3m dump (www.incompleteideas.net)
 
| yamrzou wrote:
| Previous discussions:
| 
| 2019: https://news.ycombinator.com/item?id=19393432
| 
| 2020: https://news.ycombinator.com/item?id=23781400
 
  | dang wrote:
  | Thanks! Macroexpanded:
  | 
  |  _The Bitter Lesson (2019)_ -
  | https://news.ycombinator.com/item?id=23781400 - July 2020 (85
  | comments)
  | 
  |  _The Bitter Lesson_ -
  | https://news.ycombinator.com/item?id=19393432 - March 2019 (53
  | comments)
 
| civilized wrote:
| > Early methods conceived of vision as searching for edges, or
| generalized cylinders, or in terms of SIFT features. But today
| all this is discarded. Modern deep-learning neural networks use
| only the notions of convolution and certain kinds of invariances,
| and perform much better.
| 
| This assessment is a bit off.
| 
| First, convolution and invariance are definitely not the only
| things you need. Modern DL architectures use lots of very clever
| gadgets inspired by decades of interdisciplinary research.
| 
| Second, architecture still matters a lot in neural networks, and
| domain experts still make architectural decisions heavily
| informed by domain insights into what their goals are and what
| tools might make progress towards these goals. For example,
| convolution + max-pooling makes sense as a combination because of
| historically successful techniques in computer vision. It wasn't
| something randomly tried or brute forced.
| 
| The role of domain expertise has not gone away. You just have to
| leverage it in ways that are lower-level, less obvious, less
| explicitly connected to the goal in a way that a human would
| expect based on high-level conceptual reasoning.
| 
| From what I've heard, the author's thesis is most true for chess.
| The game tree for chess isn't so huge as Go, so it's more
| amenable to brute forcing. The breakthrough in Go was not from
| Moore's Law, it was from innovative DL/RL techniques.
| 
| Computation may enable more compute-heavy techniques, but it
| doesn't mean it's obvious what these techniques are or that they
| are well-characterized as simpler or more "brute force" than past
| approaches.
 
  | a-dub wrote:
  | > First, convolution and invariance are definitely not the only
  | things you need. Modern DL architectures use lots of very
  | clever gadgets inspired by decades of interdisciplinary
  | research.
  | 
  | i have noticed this. rather than replacing feature engineering,
  | it seems that you find some of those ideas from psychophysics
  | just manually built into the networks.
 
| antiquark wrote:
| The author is applying the "past performance guarantees future
| results" fallacy.
 
| tejohnso wrote:
| This reminds me of George Hotz's Comma.ai end to end
| reinforcement learning approach vs Tesla's feature-engineering
| based approach described in an article I read.
| 
| Hotz feels that "not only will comma outpace Tesla, but that
| Tesla will eventually adopt comma's method."[1]
| 
| [1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die
| 
| Previous discussion on the article:
| https://news.ycombinator.com/item?id=30738763
 
  | fnbr wrote:
  | I think an end-to-end RL approach will eventually work- but
  | _eventually_ could be in a really long time. It's also a
  | question of scale: even if Comma's approach is fundamentally
  | better, how much better is it? If Tesla has 1000x more cars,
  | and their approach is 10x worse, they'll still improve 100x
  | faster than Comma.
 
| jasfi wrote:
| So don't build your AI/AGI approach too high-level. But you still
| need to represent common sense somehow.
 
| kjksf wrote:
| And this is why I'm much less pessimistic than most about
| robotaxis.
| 
| Waymo has a working robotaxi in a limited area and they got there
| with a fleet of 600 cars and mere millions of driving data.
| 
| Now imagine they trained on 100x cars i.e. 60k cars and billions
| of driving data.
| 
| Guess what, Tesla already has FSD running, under human
| supervision, in 60k cars and that fleet is driving billions of
| miles.
| 
| They are colleting 100x data sets as I write this.
| 
| We also continue to significantly improve hardware for both NN
| inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia
| GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI
| Hardware https://www.ai-startups.org/top/hardware/)
| 
| If the bitter lesson extends to problem of self-driving, we're
| doing everything right to solve it.
| 
| It's just a matter of time to collect enough training data, have
| enough compute to train the neural network and enough compute to
| run the network in the car.
 
  | Animats wrote:
  | Waymo is not a raw neural network. Waymo has an explicit
  | geometric world model, and you can look at it.
 
  | VHRanger wrote:
  | More data doesn't help if the additional data points don't add
  | information to the dataset.
  | 
  | At some point it's better to add features than simply more rows
  | of observations.
  | 
  | Arguably text and images are special cases here because we do
  | self supervised learning (which you cant do for self driving
  | for obvious reasons).
  | 
  | What TSLA should have done a long time ago is keep investing in
  | additional sensors to enrich data points, rather than blindly
  | collecting more of the same.
 
  | fxtentacle wrote:
  | You're not wrong, but I believe you're so far off on the
  | necessary scale that it'll never solve the problem.
  | 
  | For an AI to learn to play Bomberman at an acceptable level,
  | you need to run 2-3 billion training steps with RL learning
  | where the AI is free to explore new actions to collect data
  | about how well they work. I'm part of team CloudGamepad and
  | we'll compete in the Bomberland AI challenge finals tomorrow,
  | so I do have some practical experience there. Before I looked
  | at things in detail, I also vastly overestimated reinforcement
  | learning's capabilities.
  | 
  | For an AI to learn useful policy without the ability to confirm
  | what an action does, you need exponentially more data. There's
  | great papers by DeepMind and OpenAI that try to ease the pain a
  | bit, but as-is, I don't think even a trillion miles driven
  | would be enough data. Letting the AI try out things, of course,
  | is dangerous, as we have seen in the past.
  | 
  | But the truly nasty part about AI and RL in particular is that
  | the AI will act as if anything that it didn't see often enough
  | during training simply doesn't exist. If it never sees a pink
  | truck from the side, no "virtual neurons" will grow to detect
  | this. AIs in general don't generalize. So if your driving
  | dataset lacks enough examples of 0.1% black swan events, you
  | can be sure that your AI is going to go totally haywire when
  | they happen. Like "I've never seen a truck sideways before =>
  | it doesn't exist => boom."
 
    | naveen99 wrote:
    | What were the new data augmentation methods for optical flow
    | you referred to on a previous comment on this topic ?
 
    | shadowgovt wrote:
    | The sensors self-driving cars use are far less sensitive to
    | color than human eyes.
    | 
    | You can generalize your concept to the other sensors, but
    | sensor fusion compensates somewhat... The odds of an input
    | being something never seen across _all_ sensor modalities
    | become pretty low.
    | 
    | (And when it did see something weird, it can generally handle
    | it the way humans do... Drive defensively).
 
    | gwern wrote:
    | > But the truly nasty part about AI and RL in particular is
    | that the AI will act as if anything that it didn't see often
    | enough during training simply doesn't exist. If it never sees
    | a pink truck from the side, no "virtual neurons" will grow to
    | detect this. AIs in general don't generalize. So if your
    | driving dataset lacks enough examples of 0.1% black swan
    | events, you can be sure that your AI is going to go totally
    | haywire when they happen. Like "I've never seen a truck
    | sideways before => it doesn't exist => boom."
    | 
    | Let's not overstate the problem here. There are plenty of AI
    | things which would work well to recognize a sideways truck.
    | Look at CLIP, which can also be plugged into DRL agents (per
    | the cake); find an image of your pink truck and text prompt
    | CLIP with "a photograph of a pink truck" and a bunch of
    | random prompts, and I bet you it'll pick the correct one.
    | Small-scale DRL trained solely on a single task is extremely
    | brittle, yes, but trained over a diversity of tasks and you
    | start seeing transfer to new tasks and composition of
    | behaviors and flexibility (look at, say, Hide-and-Seek or
    | XLAND).
    | 
    | These are all in line with the bitter hypothesis that much of
    | what is wrong with them is not some fundamental problem that
    | will require special hand-designed "generalization modules"
    | bolted onto them by generations of grad students laboring in
    | the math mines, but simply that they are still trained on too
    | undiverse problems for too short a time with too little data
    | using too little models, and that just as we already see
    | strikingly better results in terms of generalization &
    | composition & rare datapoints from past scaling, we'll see
    | more in the future.
    | 
    | What goes wrong with Tesla cars specifically, I don't know,
    | but I will point out that Waymo manages to kill many fewer
    | people and so we shouldn't consider Tesla performance to even
    | be SOTA on the self-driving task, much less tell us anything
    | about fundamental limits to self-driving cars and/or NNs.
 
      | mattnewton wrote:
      | > What goes wrong with Tesla cars specifically, I don't
      | know, but I will point out that Waymo manages to kill many
      | fewer people and so we shouldn't consider Tesla performance
      | to even be SOTA on the self-driving task, much less tell us
      | anything about fundamental limits to self-driving cars
      | and/or NNs.
      | 
      | Side note, but I think Waymo is treating this more like a
      | JPL "moon landing" style problem and Tesla is trying to
      | sell cars today. It's very different to start with making
      | it possible and then scaling it down vs trying to build
      | something working backwards from the sensors and compute
      | economical to ship today.
 
| [deleted]
 
| fxtentacle wrote:
| I used to agree, but now I disagree. You don't need to look any
| further than Google's ubiquitous mobilenet v3 architecture. It
| needs a lot less compute but outperforms v1 and v2 in almost
| every way. It also outperforms most other image recognition
| encoders at 1% the FLOPS.
| 
| And if you read the paper, there's experienced professionals
| explaining why they made which change. It's a deliberate
| handcrafted design. Sure, they used parameter sweeps, too, but
| that's more the AI equivalent of using Excel over paper tables.
 
  | vegesm wrote:
  | Actually, MobileNetV3 is a supporting example of the bitter
  | lesson and not the other way round. The point of Sutton's essay
  | is that it isn't worth adding inductive biases (specific loss
  | functions, handcrafted features, special architectures) to our
  | algorithm. Having lots of data, just put that into a generic
  | architecture and it eventually outperforms manually tuned ones.
  | 
  | MobileNetV3 uses architecture search, which is a prime example
  | of the above: even the architecture hyperparameters are derived
  | from data. The handcrafted optimizations just concern speed and
  | do not include any inductive biases.
 
    | fxtentacle wrote:
    | "The handcrafted optimizations just concern speed"
    | 
    | That is the goal here. Efficient execution on mobile
    | hardware. Mobilenet v1 and v2 did similar parameter sweeps,
    | but perform much worse. The main novel thing about v3 is
    | precisely the handcrafted changes. I'd treat that as an
    | indication that those handcrafted changes in v3 far exceed
    | what could be achieved with lots of compute in v1 and v2.
    | 
    | Also, I don't think any amount of compute can come up with
    | new efficient non-linearity formulas like hswish in v3.
 
  | sitkack wrote:
  | Sutton is talking about a long term trend. Would Google have
  | been able to achieve this w/o a lot of computation? I don't
  | think it refutes the essay in any way. If anything, model
  | compression takes even more computation. We can't scale
  | heuristics, we can scale computation.
 
  | koeng wrote:
  | Link to the paper?
 
    | fxtentacle wrote:
    | https://arxiv.org/abs/1905.02244v5
 
  | fnbr wrote:
  | Right, but that's not a counterexample. The bitter lesson
  | suggests that, eventually, it'll be difficult to outperform a
  | learning system manually. It doesn't say that this is always
  | true. DeepBlue _was_ better than all other chess players at the
  | time. But now, AlphaZero is better.
  | 
  | I believe the same is true for neural network architecture
  | search: at some point, learning systems will be better than all
  | humans. Maybe that's not true today, but I wouldn't bet on that
  | _always_ being false.
 
    | fxtentacle wrote:
    | The article says:
    | 
    | "We have to learn the bitter lesson that building in how we
    | think we think does not work in the long run."
    | 
    | And I would argue: It saves at least 100x in compute time. So
    | by hand-designing relevant areas, I can build an AI today
    | which otherwise would become possible due to Moore's law in
    | about 7 years. Those 7 years are the reason to do it. That's
    | plenty of time to create a startup and cash out.
 
      | bee_rider wrote:
      | I think the "we" in this case is researchers and scientists
      | trying to advance human knowledge, not startup folks.
      | Startups of course expend lots of effort on doing things
      | that don't end up helping humanity in the long run.
 
___________________________________________________________________
(page generated 2022-04-02 23:00 UTC)