proxy70

	[HN Gopher] Parsing JSON faster with Intel AVX-512 ___________________________________________________________________ Parsing JSON faster with Intel AVX-512 Author : ashvardanian Score : 97 points Date : 2022-05-25 21:29 UTC (1 days ago)
	web link (lemire.me)
	w3m dump (lemire.me)
	\| worewood wrote: \| Using specialized instructions not always turn into performance \| improvements. Processors are pretty smart these days and the \| generated u-ops may be the same \| skavi wrote: \| Hopefully we'll see AVX-512 in Intel's little cores soon. \| Centaur's last CPU architecture proves that it is possible to \| implement the extension without a huge amount of area [0]. Once \| that happens, I expect we'll finally consistently see AVX-512 on \| new Intel processors. The masks really are a huge improvement to \| the design. \| \| AMD should be implementing AVX-512 on their own cores soon as \| well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much \| be in a golden age of SIMD. \| \| [0]: https://chipsandcheese.com/2022/04/30/examining-centaur- \| chas... \| torginus wrote: \| I'm kinda torn on AVX-512 (and SIMD in general). On one hand, \| AVX-512 finally introduced a sane programming model with mask \| registers for branching code, which makes the lives of \| compilers much easier. \| \| On the other hand, the tooling for turning high-level languages \| into SIMD code is not there yet, ISPC refuses to support ARM, \| and is still kind of a novelty tool. \| \| Additionally, 512-bit wide vectors are just too big - the \| resulting vector units take up too much die space even on _big_ \| cores, and the power consumption causes issues causing said \| dies to downclock. Probably it won 't be viable on small cores. \| dr_zoidberg wrote: \| > Additionally, 512-bit wide vectors are just too big - the \| resulting vector units take up too much die space even on big \| cores, and the power consumption causes issues causing said \| dies to downclock. \| \| This is no longer true, citing [0]: \| \| > At least, it means we need to adjust our mental model of \| the frequency related cost of AVX-512 instructions. Rather \| than the prior-generation verdict of "AVX-512 generally \| causes significant downclocking", on these Ice Lake and \| Rocket Lake client chips we can say that AVX-512 causes \| insignificant (usually, none at all) license-based \| downclocking and I expect this to be true on other ICL and \| RKL client chips as well. \| \| And we still have to see AMDs implementation of AVX512 on \| Zen4 to know what behavior and limits it may have (if any). \| \| [0] https://travisdowns.github.io/blog/2020/08/19/icl- \| avx512-fre... \| jeffbee wrote: \| Considering that the execution units, register file, etc that \| support AVX-512 are themselves nearly as large as the entire \| Gracemont core ... don't hold your breath. \| brigade wrote: \| You don't need larger than the 128-bit ALUs or the \| 207x128-bit register file Gracemont already has to implement \| AVX-512. It doesn't make sense on its own with that backend, \| but for ISA compatibility with a big core it does. \| Dylan16807 wrote: \| Can the shuffling instructions be reasonably efficient with \| a small ALU? \| brigade wrote: \| Depends on what you consider reasonable. Worst case is \| 512-bit vpermi2*, which could be implemented with 16x \| 128-bit vpermi2-like uops, if the needed masking was \| implicit. \| \| Which to me is reasonable for ISA compatibility. (Also \| considering that having to deal with ISA incompatibility \| across active cores is _not_ reasonable at all.) \| jeffbee wrote: \| I'm not sure that users would accept that. You could have a \| situation where an ifunc is resolved on a fast core with a \| slightly superior AVX-512 definition, but then the thread \| migrates to an efficiency core and the AVX-512 definition \| is dramatically slower than what could have been achieved \| with AVX2 (e.g. if a microcoded permute was 16x slower). \| brigade wrote: \| Most reasonable would be a hypothetical AVX-256 that was \| AVX-512VL minus ZMM registers. Intel chose against that. \| \| So the only reasonable options for a big little system \| are to not have little cores, or for nothing to support \| AVX-512, or for the little cores to support AVX-512 as \| best they can. Then thread director can weight AVX-512 \| usage even heavier than it already weights AVX2. \| dragontamer wrote: \| > we'll pretty much be in a golden age of SIMD. \| \| We already are in the golden age of SIMD. NVidia and AMD GPUs \| are easier and easier to program through standard interfaces. \| \| Intel / AMD are pushing SIMD on a CPU, which is useful for \| sure, but always is going to be smaller in scope than a \| dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt \| and the like. \| \| SIMD-on-a-CPU is useful because you can perform SIMD over the \| L1 cache as communication (rather than traversing L1 -> L2 -> \| L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and \| back). But if you have a large-scale operation that can work \| SIMD, the GPU-traversal absolutely works and is commonly done. \| skavi wrote: \| Good point. Should have clarified I was referring to CPU \| SIMD. \| dragontamer wrote: \| AVX2 is not as good as AVX512. But AVX2 still has vgather \| instructions, pshufb, and a few other useful tricks. \| \| AVX512 and ARM SVE2 bring the CPU up to parity with maybe \| 2010s-era GPUs or so (full gather/scatter, more permutation \| instructions, etc. etc.). But GPUs continued to evolve. \| Butterfly-shuffles are the generic any-to-any network \| building block, and are exposed in PTX (NVidia assembly) \| shfl.bfly, and AMD DPP (data-parallel primitives). \| \| Having a richer set of lane-to-lane shuffling (especially \| ready-to-use butterfly networks) would be best. It really \| is surprising how many problems require those rich-sets of \| data-movement instructions, or otherwise benefit from them. \| \| NEON and SVE had hard-coded data-movement for specific \| applications. The general-purpose instruction (pshufb) is \| kinda like permute/shfl from AMD/NVidia. A backwards- \| permute IIRC doesn't exist yet on CPU-side. \| \| And butterfly networks are the general-purpose solution, \| capable of implementing any arbitrary data-movement in just \| log(width) steps. (pshufb / permute instructions would be \| the full-sized butterfly network, but some cases might be \| "easier" and faster to execute with only a limited number \| of butterfly swaps, such as what inevitably comes up in \| sorting) \| \| -------- \| \| Still, all of these operations can be implemented in AVX2 \| (albeit slower / less efficiently). So its not like the \| "language" of AVX2 / AVX is incomplete... its just missing \| a few general-purpose instructions that could lead to \| better performance. \| PragmaticPulp wrote: \| > Could we do better? Assuredly. There are many AVX-512 that we \| are not using yet. We do not use ternary Boolean operations \| (vpternlog). We are not using the new powerful shuffle functions \| (e.g., vpermt2b). We have an example of coevolution: better \| hardware requires new software which, in turn, makes the hardware \| shine. \| \| > Of course, to get these new benefits, you need recent Intel \| processors with adequate AVX-512 support \| \| AVX-512 support can be confusing because it's often referred to \| as a single instruction set. \| \| AVX-512 is actually a large family of instructions that have \| different availability depending on the CPU. It's not enough to \| say that a CPU has AVX-512 because it's not a binary question. \| You have to know _which_ AVX-512 instructions are supported on a \| particular CPU. \| \| Wikipedia has a partial chart of AVX-512 support by CPU: \| https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 \| \| Note that some instructions that are available in one generation \| of CPU are can actually be unavailable (superseded, usually) in \| the next generation of the CPU. If you go deep enough into \| AVX-512 optimization, you essentially end up targeting a specific \| CPU for the code. This is not a big deal if you're deploying \| software to 10,000 carefully controlled cloud servers with known \| specifications, but it makes general use and especially consumer \| use much harder. \| robocat wrote: \| To add, they are using[2] the relatively recent VBMI2 \| instructions of AVX512. This article[1] talks about the \| advantages of VBMI on IceLake released 2021. \| \| [1] https://www.singlestore.com/blog/a-programmers-perspective/ \| comments https://news.ycombinator.com/item?id=28179111 \| \| [2] https://news.ycombinator.com/item?id=31522464 \| mikepurvis wrote: \| Are there good libraries for doing runtime feature detection? \| Eg, include three versions of hot function X in the binary, and \| have it seamlessly insert the correct function pointer at \| startup? Or have the function contain multiple bodies and just \| JMP to the correct block of code? \| \| I know you can do this yourself, but last time I looked it was \| a heavily manual process-- you had to basically define a plugin \| interface and dynamically load your selected implementation \| from a separate shared object. What are the barriers to having \| compilers able to be hinted into transparently generating \| multiple versions of key functions? \| bremac wrote: \| I'm unsure about library support, but gcc and clang support \| function multi-versioning (FMV), which resolves the function \| based on CPUID the first time the function is called. \| \| This LWN article has some additional information: \| https://lwn.net/Articles/691932/ \| mikepurvis wrote: \| TIL! I guess it makes sense that popular numeric libraries \| like BLAS, Eigen, and so-on would take advantage of this, \| but I wonder how widely used it is overall. \| loeg wrote: \| GCC has offered Function Multiversioning for about a decade \| now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID \| to resolve the ifunc once at program start: \| https://lwn.net/Articles/691932/ . \| \| Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tool \| s/clang/docs/AttributeRe... \| \| A nice presentation on it: \| https://llvm.org/devmtg/2014-10/Slides/Christopher- \| Function%... \| indygreg2 wrote: \| While this IFUNC feature does exist and it is useful, when \| I performed binary analysis on every package in Ubuntu in \| January, I found that only ~11 distinct packages have \| IFUNCs. It certainly looks like this ELF feature is not \| really used much [in open source] outside of GNU toolchain- \| level software! \| \| https://gregoryszorc.com/blog/2022/01/09/bulk-analyze- \| linux-... \| TkTech wrote: \| I've wanted to use them many times in the past, but the \| limited support on other compilers (looking at you MSVC) \| always made it a non-starter. If I have to support some \| other method of feature detection anyways, there's no \| point. \| colejohnson66 wrote: \| Check out Agner Fog's vectorclass library: \| https://github.com/vectorclass/version2 \| beached_whale wrote: \| If only intel wasnt dropping support for it on a lot of cpus \| jrimbault wrote: \| I'm using this comment as jumping point. \| \| What's cost/opportunity to optimizing for a specific \| platform/instruction set ? At what point is it worth doing, \| when isn't it worth doing ? AVX-512 strikes as something... \| "ephemeral". \| skavi wrote: \| writing code for SIMD can get you absolutely massive \| performance improvements. Whether it's worth the added \| complexity depends on the situation. If your data is already \| arranged in a cache friendly way (SoA), it shouldn't be \| incredibly difficult to use SIMD intrinsics to optimize. I'd \| first take a look at what the compiler is already generating \| for you to see if manual intervention is worth it. \| throwaway92394 wrote: \| Well I mean this article is demoing a 28% improvement (if I \| did my math right) for json parsing. \| \| Sure AVX-512 is only applicable to specific workloads, and \| even many of those workloads the cost/opportunity of \| optimizing for AVX-512 might not be worth it. But there \| clearly ARE usecases that would benefit, and it might be \| worth it for more consumer applications to optimize for \| AVX-512 - but only if it can be used. \| \| The way I see it is that the benefit of optimizing for \| AVX-512 is far higher if it becomes normal for consumer CPUs \| to have it. A 28% improvement is pretty decent, but it's only \| worth implementing if enough people can utilize it. \| beached_whale wrote: \| For many maybe not, but when writing the foundations of \| software it is good to start fast. There are libraries that \| abstract various simd architectures now too. Simdjson has \| their own and there are ones lile KUMI \| jcranmer wrote: \| Intel is not dropping support for it on a lot of CPUs. \| \| The only thing they've done is disable it in the hybrid Alder \| Lake cores, presumably because the E-cores couldn't support it \| (while the P-cores could), and they didn't want to deal with \| the headaches of ISA extensions being supported only on some \| cores in the system. \| Aardwolf wrote: \| > Intel is not dropping support for it on a lot of CPUs. \| \| There are 0 current generation consumer CPUs of neither Intel \| nor AMD that have it \| \| > The only thing they've done is disable it in the hybrid \| Alder Lake cores \| \| Which happen to be _all_ the current generation Intel CPUs \| beached_whale wrote: \| Ah, headlines foiled me. I read disabled in Alder lake all \| together. \| temac wrote: \| It is disabled in all consumer Alder lake (and I don't \| remember if there will be Xeon of that gen with P-core only \| -- IIRC Intel stop the AVX512 validation late on those \| cores, but it was still before it was formally finished, so \| probably not). At one point it worked with some Bios on \| P-core only chips or if you disabled E-core on hybrid ones, \| but with up-to-date Intel microcode it does not work \| anymore. \| coder543 wrote: \| > The only thing they've done is disable it in the hybrid \| Alder Lake cores \| \| That is incorrect. You can buy Alder Lake CPUs that only have \| one type of core (the i3 series only has P-cores, for \| example), and those do not support AVX-512 either. They're \| not "hybrid" in any way. \| \| Some of their motherboard partners initially allowed you to \| access AVX-512, but Intel has put a stop to this and the \| feature is disabled on _all_ Alder Lake CPU SKUs, period. \| \| Newer Alder Lake chips have AVX-512 fused off entirely: \| https://www.tomshardware.com/news/intel-nukes-alder-lake- \| avx... \| \| > Intel is not dropping support for it on a lot of CPUs. \| \| That seems like a pretty questionable statement. Intel might \| keep AVX-512 around for Xeon, but it seems extremely dead on \| the consumer market. If Intel decides to bring it back for \| the next generation, that would be strange and very poor \| planning. \| gpderetta wrote: \| It seems likely that the reason is that some intel \| customers are willing to pai a significant premium for the \| feature and intel doesn't want it to be available for cheap \| nomel wrote: \| Well, if it means more cores, it's almost certainly worth it, \| in the grand scheme of things. \| bfrog wrote: \| At what point is JSON not the right option? Surely when trying to \| do this sort of thing? \| \| At what point is it saner to use something like flatbuffers or \| capnproto style message encoding instead. \| smabie wrote: \| Often you do not get the choice of whether you want to be \| parsing json or not. \| vardump wrote: \| Sometimes you just don't have a choice when you need to \| interface with a third party data feed or software. \| \| Isn't it better to have all options open? \| avg_dev wrote: \| Good thought. If you are coding in C++ maybe you can use some \| sort of binary serialization thing. Even in other languages if \| json parsing is a bottleneck it can possibly be optimized away \| through use of a binary wire format. That said vector \| operations available to programmers is always a welcome thing \| I'd say. And who knows how much production json parsing this \| library really does, it could be a ton. \| \| I'm torn. I've worked at shops where we aim over time to reduce \| response time while serving business logic and using \| statistical models that get iterated on. Even there I haven't \| seen a blatant need for non-JSON rpc. But I know my experience \| doesn't mirror everyone's. And I like seeing and learning about \| instruction sets. I'm currently taking a course in parallel \| computing and I just used a avx2 for the first time in a toy \| program to subtract one vector from another in a single \| instruction which while not particularly useful is a window \| into more interesting things and is still SIMD. \| \| I think on the whole making json parsing for a large enough \| fraction of processors is probably a huge win for the \| environment. But who is parsing json in C++? \| ollien wrote: \| > But who is parsing json in C++? \| \| Well, Facebook for one! Folly has lots of utilities for this \| (see folly::dynamic[1]). We make extensive use of this at my \| (non-Facebook) job. \| \| [1] https://github.com/facebook/folly/blob/master/folly/docs/ \| Dyn... \| timerol wrote: \| > Of course, to get these new benefits, you need recent Intel \| processors with adequate AVX-512 support and, evidently, you also \| need relatively recent C++ processors. Some of the recent laptop- \| class Intel processors do not support AVX-512 but you should be \| fine if you rely on AWS and have big Intel nodes. \| \| What is meant by "relatively recent C++ processors"? Is that \| supposed to be "compilers"? \| [deleted] \| Narishma wrote: \| It's supposed to be Intel, not C++. \| NegativeLatency wrote: \| "new" is relative since they've been out for almost 10 years: \| https://www.intel.com/content/www/us/en/developer/articles/t... \| jeffbee wrote: \| This code uses VBMI2, which just came out quite recently. ___________________________________________________________________ (page generated 2022-05-26 23:00 UTC)