proxy70

	[HN Gopher] The Bitter Lesson (2019) ___________________________________________________________________ The Bitter Lesson (2019) Author : winkywooster Score : 62 points Date : 2022-04-02 17:16 UTC (5 hours ago)
	web link (www.incompleteideas.net)
	w3m dump (www.incompleteideas.net)
	\| yamrzou wrote: \| Previous discussions: \| \| 2019: https://news.ycombinator.com/item?id=19393432 \| \| 2020: https://news.ycombinator.com/item?id=23781400 \| dang wrote: \| Thanks! Macroexpanded: \| \| _The Bitter Lesson (2019)_ - \| https://news.ycombinator.com/item?id=23781400 - July 2020 (85 \| comments) \| \| _The Bitter Lesson_ - \| https://news.ycombinator.com/item?id=19393432 - March 2019 (53 \| comments) \| civilized wrote: \| > Early methods conceived of vision as searching for edges, or \| generalized cylinders, or in terms of SIFT features. But today \| all this is discarded. Modern deep-learning neural networks use \| only the notions of convolution and certain kinds of invariances, \| and perform much better. \| \| This assessment is a bit off. \| \| First, convolution and invariance are definitely not the only \| things you need. Modern DL architectures use lots of very clever \| gadgets inspired by decades of interdisciplinary research. \| \| Second, architecture still matters a lot in neural networks, and \| domain experts still make architectural decisions heavily \| informed by domain insights into what their goals are and what \| tools might make progress towards these goals. For example, \| convolution + max-pooling makes sense as a combination because of \| historically successful techniques in computer vision. It wasn't \| something randomly tried or brute forced. \| \| The role of domain expertise has not gone away. You just have to \| leverage it in ways that are lower-level, less obvious, less \| explicitly connected to the goal in a way that a human would \| expect based on high-level conceptual reasoning. \| \| From what I've heard, the author's thesis is most true for chess. \| The game tree for chess isn't so huge as Go, so it's more \| amenable to brute forcing. The breakthrough in Go was not from \| Moore's Law, it was from innovative DL/RL techniques. \| \| Computation may enable more compute-heavy techniques, but it \| doesn't mean it's obvious what these techniques are or that they \| are well-characterized as simpler or more "brute force" than past \| approaches. \| a-dub wrote: \| > First, convolution and invariance are definitely not the only \| things you need. Modern DL architectures use lots of very \| clever gadgets inspired by decades of interdisciplinary \| research. \| \| i have noticed this. rather than replacing feature engineering, \| it seems that you find some of those ideas from psychophysics \| just manually built into the networks. \| antiquark wrote: \| The author is applying the "past performance guarantees future \| results" fallacy. \| tejohnso wrote: \| This reminds me of George Hotz's Comma.ai end to end \| reinforcement learning approach vs Tesla's feature-engineering \| based approach described in an article I read. \| \| Hotz feels that "not only will comma outpace Tesla, but that \| Tesla will eventually adopt comma's method."[1] \| \| [1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die \| \| Previous discussion on the article: \| https://news.ycombinator.com/item?id=30738763 \| fnbr wrote: \| I think an end-to-end RL approach will eventually work- but \| _eventually_ could be in a really long time. It's also a \| question of scale: even if Comma's approach is fundamentally \| better, how much better is it? If Tesla has 1000x more cars, \| and their approach is 10x worse, they'll still improve 100x \| faster than Comma. \| jasfi wrote: \| So don't build your AI/AGI approach too high-level. But you still \| need to represent common sense somehow. \| kjksf wrote: \| And this is why I'm much less pessimistic than most about \| robotaxis. \| \| Waymo has a working robotaxi in a limited area and they got there \| with a fleet of 600 cars and mere millions of driving data. \| \| Now imagine they trained on 100x cars i.e. 60k cars and billions \| of driving data. \| \| Guess what, Tesla already has FSD running, under human \| supervision, in 60k cars and that fleet is driving billions of \| miles. \| \| They are colleting 100x data sets as I write this. \| \| We also continue to significantly improve hardware for both NN \| inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia \| GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI \| Hardware https://www.ai-startups.org/top/hardware/) \| \| If the bitter lesson extends to problem of self-driving, we're \| doing everything right to solve it. \| \| It's just a matter of time to collect enough training data, have \| enough compute to train the neural network and enough compute to \| run the network in the car. \| Animats wrote: \| Waymo is not a raw neural network. Waymo has an explicit \| geometric world model, and you can look at it. \| VHRanger wrote: \| More data doesn't help if the additional data points don't add \| information to the dataset. \| \| At some point it's better to add features than simply more rows \| of observations. \| \| Arguably text and images are special cases here because we do \| self supervised learning (which you cant do for self driving \| for obvious reasons). \| \| What TSLA should have done a long time ago is keep investing in \| additional sensors to enrich data points, rather than blindly \| collecting more of the same. \| fxtentacle wrote: \| You're not wrong, but I believe you're so far off on the \| necessary scale that it'll never solve the problem. \| \| For an AI to learn to play Bomberman at an acceptable level, \| you need to run 2-3 billion training steps with RL learning \| where the AI is free to explore new actions to collect data \| about how well they work. I'm part of team CloudGamepad and \| we'll compete in the Bomberland AI challenge finals tomorrow, \| so I do have some practical experience there. Before I looked \| at things in detail, I also vastly overestimated reinforcement \| learning's capabilities. \| \| For an AI to learn useful policy without the ability to confirm \| what an action does, you need exponentially more data. There's \| great papers by DeepMind and OpenAI that try to ease the pain a \| bit, but as-is, I don't think even a trillion miles driven \| would be enough data. Letting the AI try out things, of course, \| is dangerous, as we have seen in the past. \| \| But the truly nasty part about AI and RL in particular is that \| the AI will act as if anything that it didn't see often enough \| during training simply doesn't exist. If it never sees a pink \| truck from the side, no "virtual neurons" will grow to detect \| this. AIs in general don't generalize. So if your driving \| dataset lacks enough examples of 0.1% black swan events, you \| can be sure that your AI is going to go totally haywire when \| they happen. Like "I've never seen a truck sideways before => \| it doesn't exist => boom." \| naveen99 wrote: \| What were the new data augmentation methods for optical flow \| you referred to on a previous comment on this topic ? \| shadowgovt wrote: \| The sensors self-driving cars use are far less sensitive to \| color than human eyes. \| \| You can generalize your concept to the other sensors, but \| sensor fusion compensates somewhat... The odds of an input \| being something never seen across _all_ sensor modalities \| become pretty low. \| \| (And when it did see something weird, it can generally handle \| it the way humans do... Drive defensively). \| gwern wrote: \| > But the truly nasty part about AI and RL in particular is \| that the AI will act as if anything that it didn't see often \| enough during training simply doesn't exist. If it never sees \| a pink truck from the side, no "virtual neurons" will grow to \| detect this. AIs in general don't generalize. So if your \| driving dataset lacks enough examples of 0.1% black swan \| events, you can be sure that your AI is going to go totally \| haywire when they happen. Like "I've never seen a truck \| sideways before => it doesn't exist => boom." \| \| Let's not overstate the problem here. There are plenty of AI \| things which would work well to recognize a sideways truck. \| Look at CLIP, which can also be plugged into DRL agents (per \| the cake); find an image of your pink truck and text prompt \| CLIP with "a photograph of a pink truck" and a bunch of \| random prompts, and I bet you it'll pick the correct one. \| Small-scale DRL trained solely on a single task is extremely \| brittle, yes, but trained over a diversity of tasks and you \| start seeing transfer to new tasks and composition of \| behaviors and flexibility (look at, say, Hide-and-Seek or \| XLAND). \| \| These are all in line with the bitter hypothesis that much of \| what is wrong with them is not some fundamental problem that \| will require special hand-designed "generalization modules" \| bolted onto them by generations of grad students laboring in \| the math mines, but simply that they are still trained on too \| undiverse problems for too short a time with too little data \| using too little models, and that just as we already see \| strikingly better results in terms of generalization & \| composition & rare datapoints from past scaling, we'll see \| more in the future. \| \| What goes wrong with Tesla cars specifically, I don't know, \| but I will point out that Waymo manages to kill many fewer \| people and so we shouldn't consider Tesla performance to even \| be SOTA on the self-driving task, much less tell us anything \| about fundamental limits to self-driving cars and/or NNs. \| mattnewton wrote: \| > What goes wrong with Tesla cars specifically, I don't \| know, but I will point out that Waymo manages to kill many \| fewer people and so we shouldn't consider Tesla performance \| to even be SOTA on the self-driving task, much less tell us \| anything about fundamental limits to self-driving cars \| and/or NNs. \| \| Side note, but I think Waymo is treating this more like a \| JPL "moon landing" style problem and Tesla is trying to \| sell cars today. It's very different to start with making \| it possible and then scaling it down vs trying to build \| something working backwards from the sensors and compute \| economical to ship today. \| [deleted] \| fxtentacle wrote: \| I used to agree, but now I disagree. You don't need to look any \| further than Google's ubiquitous mobilenet v3 architecture. It \| needs a lot less compute but outperforms v1 and v2 in almost \| every way. It also outperforms most other image recognition \| encoders at 1% the FLOPS. \| \| And if you read the paper, there's experienced professionals \| explaining why they made which change. It's a deliberate \| handcrafted design. Sure, they used parameter sweeps, too, but \| that's more the AI equivalent of using Excel over paper tables. \| vegesm wrote: \| Actually, MobileNetV3 is a supporting example of the bitter \| lesson and not the other way round. The point of Sutton's essay \| is that it isn't worth adding inductive biases (specific loss \| functions, handcrafted features, special architectures) to our \| algorithm. Having lots of data, just put that into a generic \| architecture and it eventually outperforms manually tuned ones. \| \| MobileNetV3 uses architecture search, which is a prime example \| of the above: even the architecture hyperparameters are derived \| from data. The handcrafted optimizations just concern speed and \| do not include any inductive biases. \| fxtentacle wrote: \| "The handcrafted optimizations just concern speed" \| \| That is the goal here. Efficient execution on mobile \| hardware. Mobilenet v1 and v2 did similar parameter sweeps, \| but perform much worse. The main novel thing about v3 is \| precisely the handcrafted changes. I'd treat that as an \| indication that those handcrafted changes in v3 far exceed \| what could be achieved with lots of compute in v1 and v2. \| \| Also, I don't think any amount of compute can come up with \| new efficient non-linearity formulas like hswish in v3. \| sitkack wrote: \| Sutton is talking about a long term trend. Would Google have \| been able to achieve this w/o a lot of computation? I don't \| think it refutes the essay in any way. If anything, model \| compression takes even more computation. We can't scale \| heuristics, we can scale computation. \| koeng wrote: \| Link to the paper? \| fxtentacle wrote: \| https://arxiv.org/abs/1905.02244v5 \| fnbr wrote: \| Right, but that's not a counterexample. The bitter lesson \| suggests that, eventually, it'll be difficult to outperform a \| learning system manually. It doesn't say that this is always \| true. DeepBlue _was_ better than all other chess players at the \| time. But now, AlphaZero is better. \| \| I believe the same is true for neural network architecture \| search: at some point, learning systems will be better than all \| humans. Maybe that's not true today, but I wouldn't bet on that \| _always_ being false. \| fxtentacle wrote: \| The article says: \| \| "We have to learn the bitter lesson that building in how we \| think we think does not work in the long run." \| \| And I would argue: It saves at least 100x in compute time. So \| by hand-designing relevant areas, I can build an AI today \| which otherwise would become possible due to Moore's law in \| about 7 years. Those 7 years are the reason to do it. That's \| plenty of time to create a startup and cash out. \| bee_rider wrote: \| I think the "we" in this case is researchers and scientists \| trying to advance human knowledge, not startup folks. \| Startups of course expend lots of effort on doing things \| that don't end up helping humanity in the long run. ___________________________________________________________________ (page generated 2022-04-02 23:00 UTC)