|
| yamrzou wrote:
| Previous discussions:
|
| 2019: https://news.ycombinator.com/item?id=19393432
|
| 2020: https://news.ycombinator.com/item?id=23781400
| dang wrote:
| Thanks! Macroexpanded:
|
| _The Bitter Lesson (2019)_ -
| https://news.ycombinator.com/item?id=23781400 - July 2020 (85
| comments)
|
| _The Bitter Lesson_ -
| https://news.ycombinator.com/item?id=19393432 - March 2019 (53
| comments)
| civilized wrote:
| > Early methods conceived of vision as searching for edges, or
| generalized cylinders, or in terms of SIFT features. But today
| all this is discarded. Modern deep-learning neural networks use
| only the notions of convolution and certain kinds of invariances,
| and perform much better.
|
| This assessment is a bit off.
|
| First, convolution and invariance are definitely not the only
| things you need. Modern DL architectures use lots of very clever
| gadgets inspired by decades of interdisciplinary research.
|
| Second, architecture still matters a lot in neural networks, and
| domain experts still make architectural decisions heavily
| informed by domain insights into what their goals are and what
| tools might make progress towards these goals. For example,
| convolution + max-pooling makes sense as a combination because of
| historically successful techniques in computer vision. It wasn't
| something randomly tried or brute forced.
|
| The role of domain expertise has not gone away. You just have to
| leverage it in ways that are lower-level, less obvious, less
| explicitly connected to the goal in a way that a human would
| expect based on high-level conceptual reasoning.
|
| From what I've heard, the author's thesis is most true for chess.
| The game tree for chess isn't so huge as Go, so it's more
| amenable to brute forcing. The breakthrough in Go was not from
| Moore's Law, it was from innovative DL/RL techniques.
|
| Computation may enable more compute-heavy techniques, but it
| doesn't mean it's obvious what these techniques are or that they
| are well-characterized as simpler or more "brute force" than past
| approaches.
| a-dub wrote:
| > First, convolution and invariance are definitely not the only
| things you need. Modern DL architectures use lots of very
| clever gadgets inspired by decades of interdisciplinary
| research.
|
| i have noticed this. rather than replacing feature engineering,
| it seems that you find some of those ideas from psychophysics
| just manually built into the networks.
| antiquark wrote:
| The author is applying the "past performance guarantees future
| results" fallacy.
| tejohnso wrote:
| This reminds me of George Hotz's Comma.ai end to end
| reinforcement learning approach vs Tesla's feature-engineering
| based approach described in an article I read.
|
| Hotz feels that "not only will comma outpace Tesla, but that
| Tesla will eventually adopt comma's method."[1]
|
| [1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die
|
| Previous discussion on the article:
| https://news.ycombinator.com/item?id=30738763
| fnbr wrote:
| I think an end-to-end RL approach will eventually work- but
| _eventually_ could be in a really long time. It's also a
| question of scale: even if Comma's approach is fundamentally
| better, how much better is it? If Tesla has 1000x more cars,
| and their approach is 10x worse, they'll still improve 100x
| faster than Comma.
| jasfi wrote:
| So don't build your AI/AGI approach too high-level. But you still
| need to represent common sense somehow.
| kjksf wrote:
| And this is why I'm much less pessimistic than most about
| robotaxis.
|
| Waymo has a working robotaxi in a limited area and they got there
| with a fleet of 600 cars and mere millions of driving data.
|
| Now imagine they trained on 100x cars i.e. 60k cars and billions
| of driving data.
|
| Guess what, Tesla already has FSD running, under human
| supervision, in 60k cars and that fleet is driving billions of
| miles.
|
| They are colleting 100x data sets as I write this.
|
| We also continue to significantly improve hardware for both NN
| inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia
| GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI
| Hardware https://www.ai-startups.org/top/hardware/)
|
| If the bitter lesson extends to problem of self-driving, we're
| doing everything right to solve it.
|
| It's just a matter of time to collect enough training data, have
| enough compute to train the neural network and enough compute to
| run the network in the car.
| Animats wrote:
| Waymo is not a raw neural network. Waymo has an explicit
| geometric world model, and you can look at it.
| VHRanger wrote:
| More data doesn't help if the additional data points don't add
| information to the dataset.
|
| At some point it's better to add features than simply more rows
| of observations.
|
| Arguably text and images are special cases here because we do
| self supervised learning (which you cant do for self driving
| for obvious reasons).
|
| What TSLA should have done a long time ago is keep investing in
| additional sensors to enrich data points, rather than blindly
| collecting more of the same.
| fxtentacle wrote:
| You're not wrong, but I believe you're so far off on the
| necessary scale that it'll never solve the problem.
|
| For an AI to learn to play Bomberman at an acceptable level,
| you need to run 2-3 billion training steps with RL learning
| where the AI is free to explore new actions to collect data
| about how well they work. I'm part of team CloudGamepad and
| we'll compete in the Bomberland AI challenge finals tomorrow,
| so I do have some practical experience there. Before I looked
| at things in detail, I also vastly overestimated reinforcement
| learning's capabilities.
|
| For an AI to learn useful policy without the ability to confirm
| what an action does, you need exponentially more data. There's
| great papers by DeepMind and OpenAI that try to ease the pain a
| bit, but as-is, I don't think even a trillion miles driven
| would be enough data. Letting the AI try out things, of course,
| is dangerous, as we have seen in the past.
|
| But the truly nasty part about AI and RL in particular is that
| the AI will act as if anything that it didn't see often enough
| during training simply doesn't exist. If it never sees a pink
| truck from the side, no "virtual neurons" will grow to detect
| this. AIs in general don't generalize. So if your driving
| dataset lacks enough examples of 0.1% black swan events, you
| can be sure that your AI is going to go totally haywire when
| they happen. Like "I've never seen a truck sideways before =>
| it doesn't exist => boom."
| naveen99 wrote:
| What were the new data augmentation methods for optical flow
| you referred to on a previous comment on this topic ?
| shadowgovt wrote:
| The sensors self-driving cars use are far less sensitive to
| color than human eyes.
|
| You can generalize your concept to the other sensors, but
| sensor fusion compensates somewhat... The odds of an input
| being something never seen across _all_ sensor modalities
| become pretty low.
|
| (And when it did see something weird, it can generally handle
| it the way humans do... Drive defensively).
| gwern wrote:
| > But the truly nasty part about AI and RL in particular is
| that the AI will act as if anything that it didn't see often
| enough during training simply doesn't exist. If it never sees
| a pink truck from the side, no "virtual neurons" will grow to
| detect this. AIs in general don't generalize. So if your
| driving dataset lacks enough examples of 0.1% black swan
| events, you can be sure that your AI is going to go totally
| haywire when they happen. Like "I've never seen a truck
| sideways before => it doesn't exist => boom."
|
| Let's not overstate the problem here. There are plenty of AI
| things which would work well to recognize a sideways truck.
| Look at CLIP, which can also be plugged into DRL agents (per
| the cake); find an image of your pink truck and text prompt
| CLIP with "a photograph of a pink truck" and a bunch of
| random prompts, and I bet you it'll pick the correct one.
| Small-scale DRL trained solely on a single task is extremely
| brittle, yes, but trained over a diversity of tasks and you
| start seeing transfer to new tasks and composition of
| behaviors and flexibility (look at, say, Hide-and-Seek or
| XLAND).
|
| These are all in line with the bitter hypothesis that much of
| what is wrong with them is not some fundamental problem that
| will require special hand-designed "generalization modules"
| bolted onto them by generations of grad students laboring in
| the math mines, but simply that they are still trained on too
| undiverse problems for too short a time with too little data
| using too little models, and that just as we already see
| strikingly better results in terms of generalization &
| composition & rare datapoints from past scaling, we'll see
| more in the future.
|
| What goes wrong with Tesla cars specifically, I don't know,
| but I will point out that Waymo manages to kill many fewer
| people and so we shouldn't consider Tesla performance to even
| be SOTA on the self-driving task, much less tell us anything
| about fundamental limits to self-driving cars and/or NNs.
| mattnewton wrote:
| > What goes wrong with Tesla cars specifically, I don't
| know, but I will point out that Waymo manages to kill many
| fewer people and so we shouldn't consider Tesla performance
| to even be SOTA on the self-driving task, much less tell us
| anything about fundamental limits to self-driving cars
| and/or NNs.
|
| Side note, but I think Waymo is treating this more like a
| JPL "moon landing" style problem and Tesla is trying to
| sell cars today. It's very different to start with making
| it possible and then scaling it down vs trying to build
| something working backwards from the sensors and compute
| economical to ship today.
| [deleted]
| fxtentacle wrote:
| I used to agree, but now I disagree. You don't need to look any
| further than Google's ubiquitous mobilenet v3 architecture. It
| needs a lot less compute but outperforms v1 and v2 in almost
| every way. It also outperforms most other image recognition
| encoders at 1% the FLOPS.
|
| And if you read the paper, there's experienced professionals
| explaining why they made which change. It's a deliberate
| handcrafted design. Sure, they used parameter sweeps, too, but
| that's more the AI equivalent of using Excel over paper tables.
| vegesm wrote:
| Actually, MobileNetV3 is a supporting example of the bitter
| lesson and not the other way round. The point of Sutton's essay
| is that it isn't worth adding inductive biases (specific loss
| functions, handcrafted features, special architectures) to our
| algorithm. Having lots of data, just put that into a generic
| architecture and it eventually outperforms manually tuned ones.
|
| MobileNetV3 uses architecture search, which is a prime example
| of the above: even the architecture hyperparameters are derived
| from data. The handcrafted optimizations just concern speed and
| do not include any inductive biases.
| fxtentacle wrote:
| "The handcrafted optimizations just concern speed"
|
| That is the goal here. Efficient execution on mobile
| hardware. Mobilenet v1 and v2 did similar parameter sweeps,
| but perform much worse. The main novel thing about v3 is
| precisely the handcrafted changes. I'd treat that as an
| indication that those handcrafted changes in v3 far exceed
| what could be achieved with lots of compute in v1 and v2.
|
| Also, I don't think any amount of compute can come up with
| new efficient non-linearity formulas like hswish in v3.
| sitkack wrote:
| Sutton is talking about a long term trend. Would Google have
| been able to achieve this w/o a lot of computation? I don't
| think it refutes the essay in any way. If anything, model
| compression takes even more computation. We can't scale
| heuristics, we can scale computation.
| koeng wrote:
| Link to the paper?
| fxtentacle wrote:
| https://arxiv.org/abs/1905.02244v5
| fnbr wrote:
| Right, but that's not a counterexample. The bitter lesson
| suggests that, eventually, it'll be difficult to outperform a
| learning system manually. It doesn't say that this is always
| true. DeepBlue _was_ better than all other chess players at the
| time. But now, AlphaZero is better.
|
| I believe the same is true for neural network architecture
| search: at some point, learning systems will be better than all
| humans. Maybe that's not true today, but I wouldn't bet on that
| _always_ being false.
| fxtentacle wrote:
| The article says:
|
| "We have to learn the bitter lesson that building in how we
| think we think does not work in the long run."
|
| And I would argue: It saves at least 100x in compute time. So
| by hand-designing relevant areas, I can build an AI today
| which otherwise would become possible due to Moore's law in
| about 7 years. Those 7 years are the reason to do it. That's
| plenty of time to create a startup and cash out.
| bee_rider wrote:
| I think the "we" in this case is researchers and scientists
| trying to advance human knowledge, not startup folks.
| Startups of course expend lots of effort on doing things
| that don't end up helping humanity in the long run.
___________________________________________________________________
(page generated 2022-04-02 23:00 UTC) |