[HN Gopher] Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL...
___________________________________________________________________
 
Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL parsing
benchmark
 
Author : ibobev
Score  : 78 points
Date   : 2023-05-03 19:31 UTC (3 hours ago)
 
web link (lemire.me)
w3m dump (lemire.me)
 
| joseph_grobbles wrote:
| [dead]
 
| jeffbee wrote:
| To me it would be somewhat more interesting to compare head-to-
| head mobile CPUs instead of comparing laptops and servers. In
| this particular microbenchmark, mobile 12th and 13th-generation
| Core performance cores, and even the efficiency cores on the 13th
| generation, are faster than the M2.
 
  | scns wrote:
  | Even when they are faster, the M2 is on the same die as the RAM
  | and the bandwidth and latency are way better. That matters for
  | compilation or am i mistaken?
 
    | jeffbee wrote:
    | You are mistaken. Just like everyone else who has ever
    | repeated the myth that Apple puts large-scale, high-
    | performance logic and high density DRAM on the same die,
    | which is impossible.
    | 
    | Apple uses LPDDR4 modules soldered to a PCB, sourced from the
    | same Korean company that everyone else uses. Intel has used
    | the exact same architecture since Cannon Lake, in 2018.
 
      | wtallis wrote:
      | Cannon Lake is a weird red herring to bring in to the
      | discussion, because it was only a real product in the
      | narrowest sense possible. It may have technically been the
      | first CPU Intel shipped with LPDDR4 support (or was it one
      | of their Atom-based chips?), but the exact generation of
      | LPDDR isn't really relevant because both Apple and Intel
      | have supported multiple generations of LPDDR over the years
      | and both have moved past 4 and 4x to 5 now.
      | 
      | What is somewhat relevant as the source of confusion here
      | is that Apple puts the DRAM on the same _package_ as the
      | processor rather than on the motherboard nearby like is
      | almost always done for x86 systems that use LPDDR. (But
      | there 's at least one upcoming Intel system that's been
      | announced as putting the processor and LPDDR on a shared
      | module that is itself then soldered to the motherboard.)
      | That packaging detail probably doesn't matter much for the
      | entry-level Apple chips that use the same memory bus width
      | as x86 processors, but may be more important for the high-
      | end parts with GPU-like wide memory busses.
 
      | ricw wrote:
      | I don't think that is true.
      | 
      | Last I checked apple M1 max chips have up to 800GB/s
      | throughput, whilst AMDs high end chips taper out at around
      | ~250GB/s or so, closer to what a standard M2 chip does (not
      | max, or pro version). at the top end they've got at least
      | 2x the memory bandwidth than other CPU vendors, and that's
      | likely the case further down too.
 
        | wmf wrote:
        | The Mx Max should be compared to a discrete CPU+GPU
        | combination that does have comparable total memory
        | bandwidth. It isn't automatically better to put
        | everything on one chip.
 
      | scns wrote:
      | Thank you for the correction then. Ah yes the soldered
      | LPDDR4 dies allow higher bandwiths since more pins allow
      | parallel access.
 
  | Dalewyn wrote:
  | Intel 12th and 13th gen both use the same efficiency cores.
 
    | jeffbee wrote:
    | Well, on the ones I happen to have on hand the 12th gen hits
    | 3300MHz and the 13th gen goes all the way to 4200MHz.
 
      | Dalewyn wrote:
      | Yeah well, the efficiency cores on a 12900K will be faster
      | than those on an N100, so what is your point?
      | 
      | We're discussing overall compute power differences between
      | CPU architectures, minute differences in performance
      | between identical-architecture CPU cores stemming from
      | higher clock speeds is outside the scope of this
      | discussion.
 
| smoldesu wrote:
| > Note that you cannot blindly correct for frequency in this
| manner because it is not physically possible to just change the
| frequency as I did
| 
| They're also not the same core architecture? Comparing ARM chips
| that conform to the same spec won't necessarily scale the same
| across frequencies. Even if all of these CPUs did have scaling
| clock speeds, their core logic is not the same. Hell, even the
| Firestorm and Icestorm cores on the M1 SOC shouldn't be
| considered directly comparable if you scale the clock speeds.
 
  | [deleted]
 
  | wmf wrote:
  | That's the point. He knows they're different architectures
  | (although X1 and V1 are related) so normalizing frequency
  | exposes the architectural differences.
 
| psanford wrote:
| It looks like there's been some good progress on getting Linux
| running natively on the Windows Dev Kit 2023 hardware[0]. There
| was a previous discussion here about this hardware back in
| 2022-11[1].
| 
| [0]: https://github.com/linux-surface/surface-pro-x/issues/43
| 
| [1]: https://news.ycombinator.com/item?id=33418044
 
| bushbaba wrote:
| Seems weird to compare the c7g.large vs m2 and not the largest VM
| sizes.
 
| Thaxll wrote:
| I think the performance of my oracle free instance ( arm cpu ) is
| 10x worse than those results.
 
  | dylan604 wrote:
  | my oracle free instance uses mariadb instead of mysql, but i'm
  | guessing you meant that as the free instance provided by oracle
  | instead of an instance not using anything from oracle. =)
 
    | Thaxll wrote:
    | Yes I'm talking about: https://docs.oracle.com/en-
    | us/iaas/Content/FreeTier/freetier...
    | 
    | Ampere A1 Compute instances
 
| [deleted]
 
| seiferteric wrote:
| Not that it's for sure, but M3 is probably coming out late this
| year/early next year and will be on 3nm, so once again having a
| huge node advantage. Just seems like Apple will have the latest
| node before everyone else for the foreseeable future.
 
  | webaholic wrote:
  | Apple pays a premium to TSMC to reserve the early runs on the
  | next gen nodes. They can do this because they can charge their
  | users a premium for Apple devices. I am not sure the rest of
  | the players have that much pricing power or margins.
 
| monocasa wrote:
| Part of what they don't mention is that Graviton 3 and the
| Snapdragon 8cx Gen 3 have pretty much the same processor core.
| The Neoverse V1 is only a slightly modified Cortex X1. Hence the
| same results when you account for clock frequency.
 
| mhh__ wrote:
| Without more context about what the code actually does this
| doesn't tell me all that much, other than what I could guess from
| the intended usecases of the chips.
| 
| The strength of Apple silicon is that it can crush benchmarks and
| transfer that power very well to real world concurrent workloads
| too e.g. is this basically just measuring the L1 latency? If not
| are the compilers generating the right instructions etc. (One
| would assume they are but I have had issues with getting good arm
| codegen previously, only to find that the compiler couldn't work
| out what ISA to target other than a conservative guess)
 
  | renewiltord wrote:
  | Git repo available from post.
 
    | mhh__ wrote:
    | I ain't readin' all that (OK maybe I will but lemire does do
    | this quite a lot, his blog is 40% gems 60% slightly sloppy
    | borderline factoids that only make sense if you think in
    | exactly the same way he does)
 
  | KerrAvon wrote:
  | Yes. Since it's Ada, I'm suspicious of codegen tuning being a
  | major factor here.
 
    | zimpenfish wrote:
    | "Ada is a fast and spec-compliant URL parser written in C++."
    | 
    | Wouldn't modern C++ compilers have decent codegen tuning for
    | all these platforms?
 
    | wtallis wrote:
    | Is there something specific about this library that makes you
    | suspicious, or are you assuming from the name that this is
    | using the Ada programming language?
 
  | dan-robertson wrote:
  | It does seem the benchmark has its data in cache, based on the
  | timings.
  | 
  | If the benchmark were only measuring L1 latency, what would
  | that imply about the 'scaling by inverse clock speed' bit? My
  | guess is as follows. Chips with higher clock rates will be
  | penalised: (a) it is harder to decrease latencies (memory,
  | pipeline length, etc) in absolute terms than run at a higher
  | clock speed to maybe do non-memory things faster; and (b) if
  | you're waiting 5ns to read some data, that hurts you more after
  | the scaling if your clock speed is higher. The fact that the M1
  | wins after the scaling despite the higher clock rate suggests
  | to me that either they have a big advantage on memory latency
  | or there's some non-memory-latency advantage in scheduling or
  | branch prediction that leads to more useful instructions being
  | retired per cycle.
  | 
  | But maybe I'm interpreting it the wrong way.
 
| Kwpolska wrote:
| A comparison with x86_64 CPUs (e.g. those seen in comparable
| MacBooks and AWS machines) would be useful.
| 
| Also, I'm not sure if "correcting" the numbers for 3 GHz is
| reasonable and reflects real-life performance. Perhaps some
| throttling could be applied to test the CPUs using a common
| frequency?
 
  | deltaci wrote:
  | It's a benchmark of GitHub Actions(Azure) vs a really old
  | Macbook Pro 15, not exactly what you are looking for, but it
  | tells the vibe already.
  | 
  | https://buildjet.com/for-github-actions/blog/a-performance-r...
 
    | willcipriano wrote:
    | Sometimes when I run a lot of builds in a short period of
    | time I feel like I get demoted to the slower boxes.
 
    | 015a wrote:
    | This is a big, general problem with CI providers I don't hear
    | talked about enough: because they charge per-minute, they are
    | actively incentivized to run on old hardware, slowing builds
    | and milking more from customers in the process. Doubly-so
    | when your CI is hosted by a major cloud provider who would
    | otherwise have to scrap these old machines.
    | 
    | I wish this were only a theoretical concern, a theoretical
    | incentive, but its not. Github Actions is slow, and Gitlab
    | suffers from a similar problem; their hosted SaaS runners are
    | on GCP n1-standard-1 machines. The oldest machine type in
    | GCP's fleet, the n1-standard-1 is powered by a variety of
    | dusty, old CPUs Google Cloud has no other use for, from Sandy
    | Bridge to Skylake. That's a 12 year old CPU.
 
      | AnthonyMouse wrote:
      | There are workloads where newer CPUs are dramatically
      | faster (e.g. AVX-512), but in general the difference isn't
      | huge. Most of what the newer CPUs get you is more cores and
      | higher power efficiency, which you don't care about when
      | you're paying per-vCPU. Which vCPU is faster, a ten year
      | old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon
      | Platinum 8352V at 2.1GHz? It depends on the workload. Which
      | has more memory bandwidth _per core_?
      | 
      | But the cloud provider prefers the latter because it has
      | 500% more cores for 50% more power. Which is why the latter
      | still goes for >$2000 and the former is <$15.
 
  | stingraycharles wrote:
  | In my totally unscientific (but consistent) benchmarks for our
  | CI build servers, m6g.8xlarge compile our c++ codebase in about
  | 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The
  | price difference is about 20% as well iirc, so it's generally a
  | good deal.
  | 
  | Of course the types of optimisations that a compiler may (or
  | may not) do on aarch64 vs x86_64 are completely different and
  | may explain the difference (we actually compile with
  | -march=haswell for x86_64), but generally Graviton seems like a
  | really good deal.
 
    | foota wrote:
    | You're probably leaving a lot of performance on the floor if
    | you're building for haskell and running on skylakeish or
    | newer.
    | 
    | Edit: yes, haswell:-)
 
      | nickpeterson wrote:
      | *haswell, in case the Haskell people come after you.
 
        | speed_spread wrote:
        | "don't poke the endofunctor"
 
        | paulddraper wrote:
        | Ah, that makes more sense.
 
  | zamalek wrote:
  | > I'm not sure if "correcting" the numbers for 3 GHz is
  | reasonable and reflects real-life performance
  | 
  | It's not useful at all. It effectively measures IPC
  | (instructions per clock), which is just chip vendor bragging
  | rights.
  | 
  | Assuming that all the chips meet some baseline performance
  | criteria: for datacenter and portable devices, the real
  | benchmark would be "instructions per joule."
  | 
  | For desktop devices "instructions per dollar" would be most
  | relevant.
 
    | aylmao wrote:
    | > It's not useful at all. It effectively measures IPC
    | (instructions per clock), which is just chip vendor bragging
    | rights.
    | 
    | +1. Moreover the author then seems to conclude from this
    | benchmark:
    | 
    | > Overall, these numbers suggest that the Qualcomm processor
    | is competitive.
    | 
    | This is an odd conclusion to draw from this test and these
    | numbers, given how little this benchmark tests (just string
    | operations). Does this benchmark want to test raw CPU power?
    | Then why "normalize" to 3GHz? Does it want to test CPU
    | capabilities? If so why use such a "narrow" test?
    | 
    | IMO this benchmarking does for a good data-point, but far
    | from enough to draw much of a conclusion from.
 
    | Octoth0rpe wrote:
    | > For desktop devices "instructions per dollar" would be most
    | relevant.
    | 
    | For cloud customers as well
 
      | cubefox wrote:
      | Which would probably include the cost of the chips in some
      | way, not just electricity.
 
      | zamalek wrote:
      | > For cloud customers as well
      | 
      | Cloud costs are dominated by power delivery and cooling.
      | Both of those are directly influenced by how much power the
      | chip uses to achieve it's performance target.
      | 
      | I guess it does indirectly influence dollar cost, but I was
      | referring to MSRP of the chip. As a simple example: the
      | per-chip cost of Graviton is probably enormous (if you
      | factor R&D into the cost of a chip), but it's still cheaper
      | for Amazon customers. Why? Power and cooling.
 
___________________________________________________________________
(page generated 2023-05-03 23:00 UTC)