[HN Gopher] Parsing JSON faster with Intel AVX-512
___________________________________________________________________
 
Parsing JSON faster with Intel AVX-512
 
Author : ashvardanian
Score  : 97 points
Date   : 2022-05-25 21:29 UTC (1 days ago)
 
web link (lemire.me)
w3m dump (lemire.me)
 
| worewood wrote:
| Using specialized instructions not always turn into performance
| improvements. Processors are pretty smart these days and the
| generated u-ops may be the same
 
| skavi wrote:
| Hopefully we'll see AVX-512 in Intel's little cores soon.
| Centaur's last CPU architecture proves that it is possible to
| implement the extension without a huge amount of area [0]. Once
| that happens, I expect we'll finally consistently see AVX-512 on
| new Intel processors. The masks really are a huge improvement to
| the design.
| 
| AMD should be implementing AVX-512 on their own cores soon as
| well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much
| be in a golden age of SIMD.
| 
| [0]: https://chipsandcheese.com/2022/04/30/examining-centaur-
| chas...
 
  | torginus wrote:
  | I'm kinda torn on AVX-512 (and SIMD in general). On one hand,
  | AVX-512 finally introduced a sane programming model with mask
  | registers for branching code, which makes the lives of
  | compilers much easier.
  | 
  | On the other hand, the tooling for turning high-level languages
  | into SIMD code is not there yet, ISPC refuses to support ARM,
  | and is still kind of a novelty tool.
  | 
  | Additionally, 512-bit wide vectors are just too big - the
  | resulting vector units take up too much die space even on _big_
  | cores, and the power consumption causes issues causing said
  | dies to downclock. Probably it won 't be viable on small cores.
 
    | dr_zoidberg wrote:
    | > Additionally, 512-bit wide vectors are just too big - the
    | resulting vector units take up too much die space even on big
    | cores, and the power consumption causes issues causing said
    | dies to downclock.
    | 
    | This is no longer true, citing [0]:
    | 
    | > At least, it means we need to adjust our mental model of
    | the frequency related cost of AVX-512 instructions. Rather
    | than the prior-generation verdict of "AVX-512 generally
    | causes significant downclocking", on these Ice Lake and
    | Rocket Lake client chips we can say that AVX-512 causes
    | insignificant (usually, none at all) license-based
    | downclocking and I expect this to be true on other ICL and
    | RKL client chips as well.
    | 
    | And we still have to see AMDs implementation of AVX512 on
    | Zen4 to know what behavior and limits it may have (if any).
    | 
    | [0] https://travisdowns.github.io/blog/2020/08/19/icl-
    | avx512-fre...
 
  | jeffbee wrote:
  | Considering that the execution units, register file, etc that
  | support AVX-512 are themselves nearly as large as the entire
  | Gracemont core ... don't hold your breath.
 
    | brigade wrote:
    | You don't need larger than the 128-bit ALUs or the
    | 207x128-bit register file Gracemont already has to implement
    | AVX-512. It doesn't make sense on its own with that backend,
    | but for ISA compatibility with a big core it does.
 
      | Dylan16807 wrote:
      | Can the shuffling instructions be reasonably efficient with
      | a small ALU?
 
        | brigade wrote:
        | Depends on what you consider reasonable. Worst case is
        | 512-bit vpermi2*, which could be implemented with 16x
        | 128-bit vpermi2-like uops, if the needed masking was
        | implicit.
        | 
        | Which to me is reasonable for ISA compatibility. (Also
        | considering that having to deal with ISA incompatibility
        | across active cores is _not_ reasonable at all.)
 
      | jeffbee wrote:
      | I'm not sure that users would accept that. You could have a
      | situation where an ifunc is resolved on a fast core with a
      | slightly superior AVX-512 definition, but then the thread
      | migrates to an efficiency core and the AVX-512 definition
      | is dramatically slower than what could have been achieved
      | with AVX2 (e.g. if a microcoded permute was 16x slower).
 
        | brigade wrote:
        | Most reasonable would be a hypothetical AVX-256 that was
        | AVX-512VL minus ZMM registers. Intel chose against that.
        | 
        | So the only reasonable options for a big little system
        | are to not have little cores, or for nothing to support
        | AVX-512, or for the little cores to support AVX-512 as
        | best they can. Then thread director can weight AVX-512
        | usage even heavier than it already weights AVX2.
 
  | dragontamer wrote:
  | > we'll pretty much be in a golden age of SIMD.
  | 
  | We already are in the golden age of SIMD. NVidia and AMD GPUs
  | are easier and easier to program through standard interfaces.
  | 
  | Intel / AMD are pushing SIMD on a CPU, which is useful for
  | sure, but always is going to be smaller in scope than a
  | dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt
  | and the like.
  | 
  | SIMD-on-a-CPU is useful because you can perform SIMD over the
  | L1 cache as communication (rather than traversing L1 -> L2 ->
  | L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and
  | back). But if you have a large-scale operation that can work
  | SIMD, the GPU-traversal absolutely works and is commonly done.
 
    | skavi wrote:
    | Good point. Should have clarified I was referring to CPU
    | SIMD.
 
      | dragontamer wrote:
      | AVX2 is not as good as AVX512. But AVX2 still has vgather
      | instructions, pshufb, and a few other useful tricks.
      | 
      | AVX512 and ARM SVE2 bring the CPU up to parity with maybe
      | 2010s-era GPUs or so (full gather/scatter, more permutation
      | instructions, etc. etc.). But GPUs continued to evolve.
      | Butterfly-shuffles are the generic any-to-any network
      | building block, and are exposed in PTX (NVidia assembly)
      | shfl.bfly, and AMD DPP (data-parallel primitives).
      | 
      | Having a richer set of lane-to-lane shuffling (especially
      | ready-to-use butterfly networks) would be best. It really
      | is surprising how many problems require those rich-sets of
      | data-movement instructions, or otherwise benefit from them.
      | 
      | NEON and SVE had hard-coded data-movement for specific
      | applications. The general-purpose instruction (pshufb) is
      | kinda like permute/shfl from AMD/NVidia. A backwards-
      | permute IIRC doesn't exist yet on CPU-side.
      | 
      | And butterfly networks are the general-purpose solution,
      | capable of implementing any arbitrary data-movement in just
      | log(width) steps. (pshufb / permute instructions would be
      | the full-sized butterfly network, but some cases might be
      | "easier" and faster to execute with only a limited number
      | of butterfly swaps, such as what inevitably comes up in
      | sorting)
      | 
      | --------
      | 
      | Still, all of these operations can be implemented in AVX2
      | (albeit slower / less efficiently). So its not like the
      | "language" of AVX2 / AVX is incomplete... its just missing
      | a few general-purpose instructions that could lead to
      | better performance.
 
| PragmaticPulp wrote:
| > Could we do better? Assuredly. There are many AVX-512 that we
| are not using yet. We do not use ternary Boolean operations
| (vpternlog). We are not using the new powerful shuffle functions
| (e.g., vpermt2b). We have an example of coevolution: better
| hardware requires new software which, in turn, makes the hardware
| shine.
| 
| > Of course, to get these new benefits, you need recent Intel
| processors with adequate AVX-512 support
| 
| AVX-512 support can be confusing because it's often referred to
| as a single instruction set.
| 
| AVX-512 is actually a large family of instructions that have
| different availability depending on the CPU. It's not enough to
| say that a CPU has AVX-512 because it's not a binary question.
| You have to know _which_ AVX-512 instructions are supported on a
| particular CPU.
| 
| Wikipedia has a partial chart of AVX-512 support by CPU:
| https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
| 
| Note that some instructions that are available in one generation
| of CPU are can actually be unavailable (superseded, usually) in
| the next generation of the CPU. If you go deep enough into
| AVX-512 optimization, you essentially end up targeting a specific
| CPU for the code. This is not a big deal if you're deploying
| software to 10,000 carefully controlled cloud servers with known
| specifications, but it makes general use and especially consumer
| use much harder.
 
  | robocat wrote:
  | To add, they are using[2] the relatively recent VBMI2
  | instructions of AVX512. This article[1] talks about the
  | advantages of VBMI on IceLake released 2021.
  | 
  | [1] https://www.singlestore.com/blog/a-programmers-perspective/
  | comments https://news.ycombinator.com/item?id=28179111
  | 
  | [2] https://news.ycombinator.com/item?id=31522464
 
  | mikepurvis wrote:
  | Are there good libraries for doing runtime feature detection?
  | Eg, include three versions of hot function X in the binary, and
  | have it seamlessly insert the correct function pointer at
  | startup? Or have the function contain multiple bodies and just
  | JMP to the correct block of code?
  | 
  | I know you can do this yourself, but last time I looked it was
  | a heavily manual process-- you had to basically define a plugin
  | interface and dynamically load your selected implementation
  | from a separate shared object. What are the barriers to having
  | compilers able to be hinted into transparently generating
  | multiple versions of key functions?
 
    | bremac wrote:
    | I'm unsure about library support, but gcc and clang support
    | function multi-versioning (FMV), which resolves the function
    | based on CPUID the first time the function is called.
    | 
    | This LWN article has some additional information:
    | https://lwn.net/Articles/691932/
 
      | mikepurvis wrote:
      | TIL! I guess it makes sense that popular numeric libraries
      | like BLAS, Eigen, and so-on would take advantage of this,
      | but I wonder how widely used it is overall.
 
    | loeg wrote:
    | GCC has offered Function Multiversioning for about a decade
    | now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID
    | to resolve the ifunc once at program start:
    | https://lwn.net/Articles/691932/ .
    | 
    | Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tool
    | s/clang/docs/AttributeRe...
    | 
    | A nice presentation on it:
    | https://llvm.org/devmtg/2014-10/Slides/Christopher-
    | Function%...
 
      | indygreg2 wrote:
      | While this IFUNC feature does exist and it is useful, when
      | I performed binary analysis on every package in Ubuntu in
      | January, I found that only ~11 distinct packages have
      | IFUNCs. It certainly looks like this ELF feature is not
      | really used much [in open source] outside of GNU toolchain-
      | level software!
      | 
      | https://gregoryszorc.com/blog/2022/01/09/bulk-analyze-
      | linux-...
 
        | TkTech wrote:
        | I've wanted to use them many times in the past, but the
        | limited support on other compilers (looking at you MSVC)
        | always made it a non-starter. If I have to support some
        | other method of feature detection anyways, there's no
        | point.
 
    | colejohnson66 wrote:
    | Check out Agner Fog's vectorclass library:
    | https://github.com/vectorclass/version2
 
| beached_whale wrote:
| If only intel wasnt dropping support for it on a lot of cpus
 
  | jrimbault wrote:
  | I'm using this comment as jumping point.
  | 
  | What's cost/opportunity to optimizing for a specific
  | platform/instruction set ? At what point is it worth doing,
  | when isn't it worth doing ? AVX-512 strikes as something...
  | "ephemeral".
 
    | skavi wrote:
    | writing code for SIMD can get you absolutely massive
    | performance improvements. Whether it's worth the added
    | complexity depends on the situation. If your data is already
    | arranged in a cache friendly way (SoA), it shouldn't be
    | incredibly difficult to use SIMD intrinsics to optimize. I'd
    | first take a look at what the compiler is already generating
    | for you to see if manual intervention is worth it.
 
    | throwaway92394 wrote:
    | Well I mean this article is demoing a 28% improvement (if I
    | did my math right) for json parsing.
    | 
    | Sure AVX-512 is only applicable to specific workloads, and
    | even many of those workloads the cost/opportunity of
    | optimizing for AVX-512 might not be worth it. But there
    | clearly ARE usecases that would benefit, and it might be
    | worth it for more consumer applications to optimize for
    | AVX-512 - but only if it can be used.
    | 
    | The way I see it is that the benefit of optimizing for
    | AVX-512 is far higher if it becomes normal for consumer CPUs
    | to have it. A 28% improvement is pretty decent, but it's only
    | worth implementing if enough people can utilize it.
 
    | beached_whale wrote:
    | For many maybe not, but when writing the foundations of
    | software it is good to start fast. There are libraries that
    | abstract various simd architectures now too. Simdjson has
    | their own and there are ones lile KUMI
 
  | jcranmer wrote:
  | Intel is not dropping support for it on a lot of CPUs.
  | 
  | The only thing they've done is disable it in the hybrid Alder
  | Lake cores, presumably because the E-cores couldn't support it
  | (while the P-cores could), and they didn't want to deal with
  | the headaches of ISA extensions being supported only on some
  | cores in the system.
 
    | Aardwolf wrote:
    | > Intel is not dropping support for it on a lot of CPUs.
    | 
    | There are 0 current generation consumer CPUs of neither Intel
    | nor AMD that have it
    | 
    | > The only thing they've done is disable it in the hybrid
    | Alder Lake cores
    | 
    | Which happen to be _all_ the current generation Intel CPUs
 
    | beached_whale wrote:
    | Ah, headlines foiled me. I read disabled in Alder lake all
    | together.
 
      | temac wrote:
      | It is disabled in all consumer Alder lake (and I don't
      | remember if there will be Xeon of that gen with P-core only
      | -- IIRC Intel stop the AVX512 validation late on those
      | cores, but it was still before it was formally finished, so
      | probably not). At one point it worked with some Bios on
      | P-core only chips or if you disabled E-core on hybrid ones,
      | but with up-to-date Intel microcode it does not work
      | anymore.
 
    | coder543 wrote:
    | > The only thing they've done is disable it in the hybrid
    | Alder Lake cores
    | 
    | That is incorrect. You can buy Alder Lake CPUs that only have
    | one type of core (the i3 series only has P-cores, for
    | example), and those do not support AVX-512 either. They're
    | not "hybrid" in any way.
    | 
    | Some of their motherboard partners initially allowed you to
    | access AVX-512, but Intel has put a stop to this and the
    | feature is disabled on _all_ Alder Lake CPU SKUs, period.
    | 
    | Newer Alder Lake chips have AVX-512 fused off entirely:
    | https://www.tomshardware.com/news/intel-nukes-alder-lake-
    | avx...
    | 
    | > Intel is not dropping support for it on a lot of CPUs.
    | 
    | That seems like a pretty questionable statement. Intel might
    | keep AVX-512 around for Xeon, but it seems extremely dead on
    | the consumer market. If Intel decides to bring it back for
    | the next generation, that would be strange and very poor
    | planning.
 
      | gpderetta wrote:
      | It seems likely that the reason is that some intel
      | customers are willing to pai a significant premium for the
      | feature and intel doesn't want it to be available for cheap
 
  | nomel wrote:
  | Well, if it means more cores, it's almost certainly worth it,
  | in the grand scheme of things.
 
| bfrog wrote:
| At what point is JSON not the right option? Surely when trying to
| do this sort of thing?
| 
| At what point is it saner to use something like flatbuffers or
| capnproto style message encoding instead.
 
  | smabie wrote:
  | Often you do not get the choice of whether you want to be
  | parsing json or not.
 
  | vardump wrote:
  | Sometimes you just don't have a choice when you need to
  | interface with a third party data feed or software.
  | 
  | Isn't it better to have all options open?
 
  | avg_dev wrote:
  | Good thought. If you are coding in C++ maybe you can use some
  | sort of binary serialization thing. Even in other languages if
  | json parsing is a bottleneck it can possibly be optimized away
  | through use of a binary wire format. That said vector
  | operations available to programmers is always a welcome thing
  | I'd say. And who knows how much production json parsing this
  | library really does, it could be a ton.
  | 
  | I'm torn. I've worked at shops where we aim over time to reduce
  | response time while serving business logic and using
  | statistical models that get iterated on. Even there I haven't
  | seen a blatant need for non-JSON rpc. But I know my experience
  | doesn't mirror everyone's. And I like seeing and learning about
  | instruction sets. I'm currently taking a course in parallel
  | computing and I just used a avx2 for the first time in a toy
  | program to subtract one vector from another in a single
  | instruction which while not particularly useful is a window
  | into more interesting things and is still SIMD.
  | 
  | I think on the whole making json parsing for a large enough
  | fraction of processors is probably a huge win for the
  | environment. But who is parsing json in C++?
 
    | ollien wrote:
    | > But who is parsing json in C++?
    | 
    | Well, Facebook for one! Folly has lots of utilities for this
    | (see folly::dynamic[1]). We make extensive use of this at my
    | (non-Facebook) job.
    | 
    | [1] https://github.com/facebook/folly/blob/master/folly/docs/
    | Dyn...
 
| timerol wrote:
| > Of course, to get these new benefits, you need recent Intel
| processors with adequate AVX-512 support and, evidently, you also
| need relatively recent C++ processors. Some of the recent laptop-
| class Intel processors do not support AVX-512 but you should be
| fine if you rely on AWS and have big Intel nodes.
| 
| What is meant by "relatively recent C++ processors"? Is that
| supposed to be "compilers"?
 
  | [deleted]
 
  | Narishma wrote:
  | It's supposed to be Intel, not C++.
 
| NegativeLatency wrote:
| "new" is relative since they've been out for almost 10 years:
| https://www.intel.com/content/www/us/en/developer/articles/t...
 
  | jeffbee wrote:
  | This code uses VBMI2, which just came out quite recently.
 
___________________________________________________________________
(page generated 2022-05-26 23:00 UTC)