proxy70

	[HN Gopher] Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL... ___________________________________________________________________ Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL parsing benchmark Author : ibobev Score : 78 points Date : 2023-05-03 19:31 UTC (3 hours ago)
	web link (lemire.me)
	w3m dump (lemire.me)
	\| joseph_grobbles wrote: \| [dead] \| jeffbee wrote: \| To me it would be somewhat more interesting to compare head-to- \| head mobile CPUs instead of comparing laptops and servers. In \| this particular microbenchmark, mobile 12th and 13th-generation \| Core performance cores, and even the efficiency cores on the 13th \| generation, are faster than the M2. \| scns wrote: \| Even when they are faster, the M2 is on the same die as the RAM \| and the bandwidth and latency are way better. That matters for \| compilation or am i mistaken? \| jeffbee wrote: \| You are mistaken. Just like everyone else who has ever \| repeated the myth that Apple puts large-scale, high- \| performance logic and high density DRAM on the same die, \| which is impossible. \| \| Apple uses LPDDR4 modules soldered to a PCB, sourced from the \| same Korean company that everyone else uses. Intel has used \| the exact same architecture since Cannon Lake, in 2018. \| wtallis wrote: \| Cannon Lake is a weird red herring to bring in to the \| discussion, because it was only a real product in the \| narrowest sense possible. It may have technically been the \| first CPU Intel shipped with LPDDR4 support (or was it one \| of their Atom-based chips?), but the exact generation of \| LPDDR isn't really relevant because both Apple and Intel \| have supported multiple generations of LPDDR over the years \| and both have moved past 4 and 4x to 5 now. \| \| What is somewhat relevant as the source of confusion here \| is that Apple puts the DRAM on the same _package_ as the \| processor rather than on the motherboard nearby like is \| almost always done for x86 systems that use LPDDR. (But \| there 's at least one upcoming Intel system that's been \| announced as putting the processor and LPDDR on a shared \| module that is itself then soldered to the motherboard.) \| That packaging detail probably doesn't matter much for the \| entry-level Apple chips that use the same memory bus width \| as x86 processors, but may be more important for the high- \| end parts with GPU-like wide memory busses. \| ricw wrote: \| I don't think that is true. \| \| Last I checked apple M1 max chips have up to 800GB/s \| throughput, whilst AMDs high end chips taper out at around \| ~250GB/s or so, closer to what a standard M2 chip does (not \| max, or pro version). at the top end they've got at least \| 2x the memory bandwidth than other CPU vendors, and that's \| likely the case further down too. \| wmf wrote: \| The Mx Max should be compared to a discrete CPU+GPU \| combination that does have comparable total memory \| bandwidth. It isn't automatically better to put \| everything on one chip. \| scns wrote: \| Thank you for the correction then. Ah yes the soldered \| LPDDR4 dies allow higher bandwiths since more pins allow \| parallel access. \| Dalewyn wrote: \| Intel 12th and 13th gen both use the same efficiency cores. \| jeffbee wrote: \| Well, on the ones I happen to have on hand the 12th gen hits \| 3300MHz and the 13th gen goes all the way to 4200MHz. \| Dalewyn wrote: \| Yeah well, the efficiency cores on a 12900K will be faster \| than those on an N100, so what is your point? \| \| We're discussing overall compute power differences between \| CPU architectures, minute differences in performance \| between identical-architecture CPU cores stemming from \| higher clock speeds is outside the scope of this \| discussion. \| smoldesu wrote: \| > Note that you cannot blindly correct for frequency in this \| manner because it is not physically possible to just change the \| frequency as I did \| \| They're also not the same core architecture? Comparing ARM chips \| that conform to the same spec won't necessarily scale the same \| across frequencies. Even if all of these CPUs did have scaling \| clock speeds, their core logic is not the same. Hell, even the \| Firestorm and Icestorm cores on the M1 SOC shouldn't be \| considered directly comparable if you scale the clock speeds. \| [deleted] \| wmf wrote: \| That's the point. He knows they're different architectures \| (although X1 and V1 are related) so normalizing frequency \| exposes the architectural differences. \| psanford wrote: \| It looks like there's been some good progress on getting Linux \| running natively on the Windows Dev Kit 2023 hardware[0]. There \| was a previous discussion here about this hardware back in \| 2022-11[1]. \| \| [0]: https://github.com/linux-surface/surface-pro-x/issues/43 \| \| [1]: https://news.ycombinator.com/item?id=33418044 \| bushbaba wrote: \| Seems weird to compare the c7g.large vs m2 and not the largest VM \| sizes. \| Thaxll wrote: \| I think the performance of my oracle free instance ( arm cpu ) is \| 10x worse than those results. \| dylan604 wrote: \| my oracle free instance uses mariadb instead of mysql, but i'm \| guessing you meant that as the free instance provided by oracle \| instead of an instance not using anything from oracle. =) \| Thaxll wrote: \| Yes I'm talking about: https://docs.oracle.com/en- \| us/iaas/Content/FreeTier/freetier... \| \| Ampere A1 Compute instances \| [deleted] \| seiferteric wrote: \| Not that it's for sure, but M3 is probably coming out late this \| year/early next year and will be on 3nm, so once again having a \| huge node advantage. Just seems like Apple will have the latest \| node before everyone else for the foreseeable future. \| webaholic wrote: \| Apple pays a premium to TSMC to reserve the early runs on the \| next gen nodes. They can do this because they can charge their \| users a premium for Apple devices. I am not sure the rest of \| the players have that much pricing power or margins. \| monocasa wrote: \| Part of what they don't mention is that Graviton 3 and the \| Snapdragon 8cx Gen 3 have pretty much the same processor core. \| The Neoverse V1 is only a slightly modified Cortex X1. Hence the \| same results when you account for clock frequency. \| mhh__ wrote: \| Without more context about what the code actually does this \| doesn't tell me all that much, other than what I could guess from \| the intended usecases of the chips. \| \| The strength of Apple silicon is that it can crush benchmarks and \| transfer that power very well to real world concurrent workloads \| too e.g. is this basically just measuring the L1 latency? If not \| are the compilers generating the right instructions etc. (One \| would assume they are but I have had issues with getting good arm \| codegen previously, only to find that the compiler couldn't work \| out what ISA to target other than a conservative guess) \| renewiltord wrote: \| Git repo available from post. \| mhh__ wrote: \| I ain't readin' all that (OK maybe I will but lemire does do \| this quite a lot, his blog is 40% gems 60% slightly sloppy \| borderline factoids that only make sense if you think in \| exactly the same way he does) \| KerrAvon wrote: \| Yes. Since it's Ada, I'm suspicious of codegen tuning being a \| major factor here. \| zimpenfish wrote: \| "Ada is a fast and spec-compliant URL parser written in C++." \| \| Wouldn't modern C++ compilers have decent codegen tuning for \| all these platforms? \| wtallis wrote: \| Is there something specific about this library that makes you \| suspicious, or are you assuming from the name that this is \| using the Ada programming language? \| dan-robertson wrote: \| It does seem the benchmark has its data in cache, based on the \| timings. \| \| If the benchmark were only measuring L1 latency, what would \| that imply about the 'scaling by inverse clock speed' bit? My \| guess is as follows. Chips with higher clock rates will be \| penalised: (a) it is harder to decrease latencies (memory, \| pipeline length, etc) in absolute terms than run at a higher \| clock speed to maybe do non-memory things faster; and (b) if \| you're waiting 5ns to read some data, that hurts you more after \| the scaling if your clock speed is higher. The fact that the M1 \| wins after the scaling despite the higher clock rate suggests \| to me that either they have a big advantage on memory latency \| or there's some non-memory-latency advantage in scheduling or \| branch prediction that leads to more useful instructions being \| retired per cycle. \| \| But maybe I'm interpreting it the wrong way. \| Kwpolska wrote: \| A comparison with x86_64 CPUs (e.g. those seen in comparable \| MacBooks and AWS machines) would be useful. \| \| Also, I'm not sure if "correcting" the numbers for 3 GHz is \| reasonable and reflects real-life performance. Perhaps some \| throttling could be applied to test the CPUs using a common \| frequency? \| deltaci wrote: \| It's a benchmark of GitHub Actions(Azure) vs a really old \| Macbook Pro 15, not exactly what you are looking for, but it \| tells the vibe already. \| \| https://buildjet.com/for-github-actions/blog/a-performance-r... \| willcipriano wrote: \| Sometimes when I run a lot of builds in a short period of \| time I feel like I get demoted to the slower boxes. \| 015a wrote: \| This is a big, general problem with CI providers I don't hear \| talked about enough: because they charge per-minute, they are \| actively incentivized to run on old hardware, slowing builds \| and milking more from customers in the process. Doubly-so \| when your CI is hosted by a major cloud provider who would \| otherwise have to scrap these old machines. \| \| I wish this were only a theoretical concern, a theoretical \| incentive, but its not. Github Actions is slow, and Gitlab \| suffers from a similar problem; their hosted SaaS runners are \| on GCP n1-standard-1 machines. The oldest machine type in \| GCP's fleet, the n1-standard-1 is powered by a variety of \| dusty, old CPUs Google Cloud has no other use for, from Sandy \| Bridge to Skylake. That's a 12 year old CPU. \| AnthonyMouse wrote: \| There are workloads where newer CPUs are dramatically \| faster (e.g. AVX-512), but in general the difference isn't \| huge. Most of what the newer CPUs get you is more cores and \| higher power efficiency, which you don't care about when \| you're paying per-vCPU. Which vCPU is faster, a ten year \| old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon \| Platinum 8352V at 2.1GHz? It depends on the workload. Which \| has more memory bandwidth _per core_? \| \| But the cloud provider prefers the latter because it has \| 500% more cores for 50% more power. Which is why the latter \| still goes for >$2000 and the former is <$15. \| stingraycharles wrote: \| In my totally unscientific (but consistent) benchmarks for our \| CI build servers, m6g.8xlarge compile our c++ codebase in about \| 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The \| price difference is about 20% as well iirc, so it's generally a \| good deal. \| \| Of course the types of optimisations that a compiler may (or \| may not) do on aarch64 vs x86_64 are completely different and \| may explain the difference (we actually compile with \| -march=haswell for x86_64), but generally Graviton seems like a \| really good deal. \| foota wrote: \| You're probably leaving a lot of performance on the floor if \| you're building for haskell and running on skylakeish or \| newer. \| \| Edit: yes, haswell:-) \| nickpeterson wrote: \| *haswell, in case the Haskell people come after you. \| speed_spread wrote: \| "don't poke the endofunctor" \| paulddraper wrote: \| Ah, that makes more sense. \| zamalek wrote: \| > I'm not sure if "correcting" the numbers for 3 GHz is \| reasonable and reflects real-life performance \| \| It's not useful at all. It effectively measures IPC \| (instructions per clock), which is just chip vendor bragging \| rights. \| \| Assuming that all the chips meet some baseline performance \| criteria: for datacenter and portable devices, the real \| benchmark would be "instructions per joule." \| \| For desktop devices "instructions per dollar" would be most \| relevant. \| aylmao wrote: \| > It's not useful at all. It effectively measures IPC \| (instructions per clock), which is just chip vendor bragging \| rights. \| \| +1. Moreover the author then seems to conclude from this \| benchmark: \| \| > Overall, these numbers suggest that the Qualcomm processor \| is competitive. \| \| This is an odd conclusion to draw from this test and these \| numbers, given how little this benchmark tests (just string \| operations). Does this benchmark want to test raw CPU power? \| Then why "normalize" to 3GHz? Does it want to test CPU \| capabilities? If so why use such a "narrow" test? \| \| IMO this benchmarking does for a good data-point, but far \| from enough to draw much of a conclusion from. \| Octoth0rpe wrote: \| > For desktop devices "instructions per dollar" would be most \| relevant. \| \| For cloud customers as well \| cubefox wrote: \| Which would probably include the cost of the chips in some \| way, not just electricity. \| zamalek wrote: \| > For cloud customers as well \| \| Cloud costs are dominated by power delivery and cooling. \| Both of those are directly influenced by how much power the \| chip uses to achieve it's performance target. \| \| I guess it does indirectly influence dollar cost, but I was \| referring to MSRP of the chip. As a simple example: the \| per-chip cost of Graviton is probably enormous (if you \| factor R&D into the cost of a chip), but it's still cheaper \| for Amazon customers. Why? Power and cooling. ___________________________________________________________________ (page generated 2023-05-03 23:00 UTC)