|
| joseph_grobbles wrote:
| [dead]
| jeffbee wrote:
| To me it would be somewhat more interesting to compare head-to-
| head mobile CPUs instead of comparing laptops and servers. In
| this particular microbenchmark, mobile 12th and 13th-generation
| Core performance cores, and even the efficiency cores on the 13th
| generation, are faster than the M2.
| scns wrote:
| Even when they are faster, the M2 is on the same die as the RAM
| and the bandwidth and latency are way better. That matters for
| compilation or am i mistaken?
| jeffbee wrote:
| You are mistaken. Just like everyone else who has ever
| repeated the myth that Apple puts large-scale, high-
| performance logic and high density DRAM on the same die,
| which is impossible.
|
| Apple uses LPDDR4 modules soldered to a PCB, sourced from the
| same Korean company that everyone else uses. Intel has used
| the exact same architecture since Cannon Lake, in 2018.
| wtallis wrote:
| Cannon Lake is a weird red herring to bring in to the
| discussion, because it was only a real product in the
| narrowest sense possible. It may have technically been the
| first CPU Intel shipped with LPDDR4 support (or was it one
| of their Atom-based chips?), but the exact generation of
| LPDDR isn't really relevant because both Apple and Intel
| have supported multiple generations of LPDDR over the years
| and both have moved past 4 and 4x to 5 now.
|
| What is somewhat relevant as the source of confusion here
| is that Apple puts the DRAM on the same _package_ as the
| processor rather than on the motherboard nearby like is
| almost always done for x86 systems that use LPDDR. (But
| there 's at least one upcoming Intel system that's been
| announced as putting the processor and LPDDR on a shared
| module that is itself then soldered to the motherboard.)
| That packaging detail probably doesn't matter much for the
| entry-level Apple chips that use the same memory bus width
| as x86 processors, but may be more important for the high-
| end parts with GPU-like wide memory busses.
| ricw wrote:
| I don't think that is true.
|
| Last I checked apple M1 max chips have up to 800GB/s
| throughput, whilst AMDs high end chips taper out at around
| ~250GB/s or so, closer to what a standard M2 chip does (not
| max, or pro version). at the top end they've got at least
| 2x the memory bandwidth than other CPU vendors, and that's
| likely the case further down too.
| wmf wrote:
| The Mx Max should be compared to a discrete CPU+GPU
| combination that does have comparable total memory
| bandwidth. It isn't automatically better to put
| everything on one chip.
| scns wrote:
| Thank you for the correction then. Ah yes the soldered
| LPDDR4 dies allow higher bandwiths since more pins allow
| parallel access.
| Dalewyn wrote:
| Intel 12th and 13th gen both use the same efficiency cores.
| jeffbee wrote:
| Well, on the ones I happen to have on hand the 12th gen hits
| 3300MHz and the 13th gen goes all the way to 4200MHz.
| Dalewyn wrote:
| Yeah well, the efficiency cores on a 12900K will be faster
| than those on an N100, so what is your point?
|
| We're discussing overall compute power differences between
| CPU architectures, minute differences in performance
| between identical-architecture CPU cores stemming from
| higher clock speeds is outside the scope of this
| discussion.
| smoldesu wrote:
| > Note that you cannot blindly correct for frequency in this
| manner because it is not physically possible to just change the
| frequency as I did
|
| They're also not the same core architecture? Comparing ARM chips
| that conform to the same spec won't necessarily scale the same
| across frequencies. Even if all of these CPUs did have scaling
| clock speeds, their core logic is not the same. Hell, even the
| Firestorm and Icestorm cores on the M1 SOC shouldn't be
| considered directly comparable if you scale the clock speeds.
| [deleted]
| wmf wrote:
| That's the point. He knows they're different architectures
| (although X1 and V1 are related) so normalizing frequency
| exposes the architectural differences.
| psanford wrote:
| It looks like there's been some good progress on getting Linux
| running natively on the Windows Dev Kit 2023 hardware[0]. There
| was a previous discussion here about this hardware back in
| 2022-11[1].
|
| [0]: https://github.com/linux-surface/surface-pro-x/issues/43
|
| [1]: https://news.ycombinator.com/item?id=33418044
| bushbaba wrote:
| Seems weird to compare the c7g.large vs m2 and not the largest VM
| sizes.
| Thaxll wrote:
| I think the performance of my oracle free instance ( arm cpu ) is
| 10x worse than those results.
| dylan604 wrote:
| my oracle free instance uses mariadb instead of mysql, but i'm
| guessing you meant that as the free instance provided by oracle
| instead of an instance not using anything from oracle. =)
| Thaxll wrote:
| Yes I'm talking about: https://docs.oracle.com/en-
| us/iaas/Content/FreeTier/freetier...
|
| Ampere A1 Compute instances
| [deleted]
| seiferteric wrote:
| Not that it's for sure, but M3 is probably coming out late this
| year/early next year and will be on 3nm, so once again having a
| huge node advantage. Just seems like Apple will have the latest
| node before everyone else for the foreseeable future.
| webaholic wrote:
| Apple pays a premium to TSMC to reserve the early runs on the
| next gen nodes. They can do this because they can charge their
| users a premium for Apple devices. I am not sure the rest of
| the players have that much pricing power or margins.
| monocasa wrote:
| Part of what they don't mention is that Graviton 3 and the
| Snapdragon 8cx Gen 3 have pretty much the same processor core.
| The Neoverse V1 is only a slightly modified Cortex X1. Hence the
| same results when you account for clock frequency.
| mhh__ wrote:
| Without more context about what the code actually does this
| doesn't tell me all that much, other than what I could guess from
| the intended usecases of the chips.
|
| The strength of Apple silicon is that it can crush benchmarks and
| transfer that power very well to real world concurrent workloads
| too e.g. is this basically just measuring the L1 latency? If not
| are the compilers generating the right instructions etc. (One
| would assume they are but I have had issues with getting good arm
| codegen previously, only to find that the compiler couldn't work
| out what ISA to target other than a conservative guess)
| renewiltord wrote:
| Git repo available from post.
| mhh__ wrote:
| I ain't readin' all that (OK maybe I will but lemire does do
| this quite a lot, his blog is 40% gems 60% slightly sloppy
| borderline factoids that only make sense if you think in
| exactly the same way he does)
| KerrAvon wrote:
| Yes. Since it's Ada, I'm suspicious of codegen tuning being a
| major factor here.
| zimpenfish wrote:
| "Ada is a fast and spec-compliant URL parser written in C++."
|
| Wouldn't modern C++ compilers have decent codegen tuning for
| all these platforms?
| wtallis wrote:
| Is there something specific about this library that makes you
| suspicious, or are you assuming from the name that this is
| using the Ada programming language?
| dan-robertson wrote:
| It does seem the benchmark has its data in cache, based on the
| timings.
|
| If the benchmark were only measuring L1 latency, what would
| that imply about the 'scaling by inverse clock speed' bit? My
| guess is as follows. Chips with higher clock rates will be
| penalised: (a) it is harder to decrease latencies (memory,
| pipeline length, etc) in absolute terms than run at a higher
| clock speed to maybe do non-memory things faster; and (b) if
| you're waiting 5ns to read some data, that hurts you more after
| the scaling if your clock speed is higher. The fact that the M1
| wins after the scaling despite the higher clock rate suggests
| to me that either they have a big advantage on memory latency
| or there's some non-memory-latency advantage in scheduling or
| branch prediction that leads to more useful instructions being
| retired per cycle.
|
| But maybe I'm interpreting it the wrong way.
| Kwpolska wrote:
| A comparison with x86_64 CPUs (e.g. those seen in comparable
| MacBooks and AWS machines) would be useful.
|
| Also, I'm not sure if "correcting" the numbers for 3 GHz is
| reasonable and reflects real-life performance. Perhaps some
| throttling could be applied to test the CPUs using a common
| frequency?
| deltaci wrote:
| It's a benchmark of GitHub Actions(Azure) vs a really old
| Macbook Pro 15, not exactly what you are looking for, but it
| tells the vibe already.
|
| https://buildjet.com/for-github-actions/blog/a-performance-r...
| willcipriano wrote:
| Sometimes when I run a lot of builds in a short period of
| time I feel like I get demoted to the slower boxes.
| 015a wrote:
| This is a big, general problem with CI providers I don't hear
| talked about enough: because they charge per-minute, they are
| actively incentivized to run on old hardware, slowing builds
| and milking more from customers in the process. Doubly-so
| when your CI is hosted by a major cloud provider who would
| otherwise have to scrap these old machines.
|
| I wish this were only a theoretical concern, a theoretical
| incentive, but its not. Github Actions is slow, and Gitlab
| suffers from a similar problem; their hosted SaaS runners are
| on GCP n1-standard-1 machines. The oldest machine type in
| GCP's fleet, the n1-standard-1 is powered by a variety of
| dusty, old CPUs Google Cloud has no other use for, from Sandy
| Bridge to Skylake. That's a 12 year old CPU.
| AnthonyMouse wrote:
| There are workloads where newer CPUs are dramatically
| faster (e.g. AVX-512), but in general the difference isn't
| huge. Most of what the newer CPUs get you is more cores and
| higher power efficiency, which you don't care about when
| you're paying per-vCPU. Which vCPU is faster, a ten year
| old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon
| Platinum 8352V at 2.1GHz? It depends on the workload. Which
| has more memory bandwidth _per core_?
|
| But the cloud provider prefers the latter because it has
| 500% more cores for 50% more power. Which is why the latter
| still goes for >$2000 and the former is <$15.
| stingraycharles wrote:
| In my totally unscientific (but consistent) benchmarks for our
| CI build servers, m6g.8xlarge compile our c++ codebase in about
| 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The
| price difference is about 20% as well iirc, so it's generally a
| good deal.
|
| Of course the types of optimisations that a compiler may (or
| may not) do on aarch64 vs x86_64 are completely different and
| may explain the difference (we actually compile with
| -march=haswell for x86_64), but generally Graviton seems like a
| really good deal.
| foota wrote:
| You're probably leaving a lot of performance on the floor if
| you're building for haskell and running on skylakeish or
| newer.
|
| Edit: yes, haswell:-)
| nickpeterson wrote:
| *haswell, in case the Haskell people come after you.
| speed_spread wrote:
| "don't poke the endofunctor"
| paulddraper wrote:
| Ah, that makes more sense.
| zamalek wrote:
| > I'm not sure if "correcting" the numbers for 3 GHz is
| reasonable and reflects real-life performance
|
| It's not useful at all. It effectively measures IPC
| (instructions per clock), which is just chip vendor bragging
| rights.
|
| Assuming that all the chips meet some baseline performance
| criteria: for datacenter and portable devices, the real
| benchmark would be "instructions per joule."
|
| For desktop devices "instructions per dollar" would be most
| relevant.
| aylmao wrote:
| > It's not useful at all. It effectively measures IPC
| (instructions per clock), which is just chip vendor bragging
| rights.
|
| +1. Moreover the author then seems to conclude from this
| benchmark:
|
| > Overall, these numbers suggest that the Qualcomm processor
| is competitive.
|
| This is an odd conclusion to draw from this test and these
| numbers, given how little this benchmark tests (just string
| operations). Does this benchmark want to test raw CPU power?
| Then why "normalize" to 3GHz? Does it want to test CPU
| capabilities? If so why use such a "narrow" test?
|
| IMO this benchmarking does for a good data-point, but far
| from enough to draw much of a conclusion from.
| Octoth0rpe wrote:
| > For desktop devices "instructions per dollar" would be most
| relevant.
|
| For cloud customers as well
| cubefox wrote:
| Which would probably include the cost of the chips in some
| way, not just electricity.
| zamalek wrote:
| > For cloud customers as well
|
| Cloud costs are dominated by power delivery and cooling.
| Both of those are directly influenced by how much power the
| chip uses to achieve it's performance target.
|
| I guess it does indirectly influence dollar cost, but I was
| referring to MSRP of the chip. As a simple example: the
| per-chip cost of Graviton is probably enormous (if you
| factor R&D into the cost of a chip), but it's still cheaper
| for Amazon customers. Why? Power and cooling.
___________________________________________________________________
(page generated 2023-05-03 23:00 UTC) |