|
| worewood wrote:
| Using specialized instructions not always turn into performance
| improvements. Processors are pretty smart these days and the
| generated u-ops may be the same
| skavi wrote:
| Hopefully we'll see AVX-512 in Intel's little cores soon.
| Centaur's last CPU architecture proves that it is possible to
| implement the extension without a huge amount of area [0]. Once
| that happens, I expect we'll finally consistently see AVX-512 on
| new Intel processors. The masks really are a huge improvement to
| the design.
|
| AMD should be implementing AVX-512 on their own cores soon as
| well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much
| be in a golden age of SIMD.
|
| [0]: https://chipsandcheese.com/2022/04/30/examining-centaur-
| chas...
| torginus wrote:
| I'm kinda torn on AVX-512 (and SIMD in general). On one hand,
| AVX-512 finally introduced a sane programming model with mask
| registers for branching code, which makes the lives of
| compilers much easier.
|
| On the other hand, the tooling for turning high-level languages
| into SIMD code is not there yet, ISPC refuses to support ARM,
| and is still kind of a novelty tool.
|
| Additionally, 512-bit wide vectors are just too big - the
| resulting vector units take up too much die space even on _big_
| cores, and the power consumption causes issues causing said
| dies to downclock. Probably it won 't be viable on small cores.
| dr_zoidberg wrote:
| > Additionally, 512-bit wide vectors are just too big - the
| resulting vector units take up too much die space even on big
| cores, and the power consumption causes issues causing said
| dies to downclock.
|
| This is no longer true, citing [0]:
|
| > At least, it means we need to adjust our mental model of
| the frequency related cost of AVX-512 instructions. Rather
| than the prior-generation verdict of "AVX-512 generally
| causes significant downclocking", on these Ice Lake and
| Rocket Lake client chips we can say that AVX-512 causes
| insignificant (usually, none at all) license-based
| downclocking and I expect this to be true on other ICL and
| RKL client chips as well.
|
| And we still have to see AMDs implementation of AVX512 on
| Zen4 to know what behavior and limits it may have (if any).
|
| [0] https://travisdowns.github.io/blog/2020/08/19/icl-
| avx512-fre...
| jeffbee wrote:
| Considering that the execution units, register file, etc that
| support AVX-512 are themselves nearly as large as the entire
| Gracemont core ... don't hold your breath.
| brigade wrote:
| You don't need larger than the 128-bit ALUs or the
| 207x128-bit register file Gracemont already has to implement
| AVX-512. It doesn't make sense on its own with that backend,
| but for ISA compatibility with a big core it does.
| Dylan16807 wrote:
| Can the shuffling instructions be reasonably efficient with
| a small ALU?
| brigade wrote:
| Depends on what you consider reasonable. Worst case is
| 512-bit vpermi2*, which could be implemented with 16x
| 128-bit vpermi2-like uops, if the needed masking was
| implicit.
|
| Which to me is reasonable for ISA compatibility. (Also
| considering that having to deal with ISA incompatibility
| across active cores is _not_ reasonable at all.)
| jeffbee wrote:
| I'm not sure that users would accept that. You could have a
| situation where an ifunc is resolved on a fast core with a
| slightly superior AVX-512 definition, but then the thread
| migrates to an efficiency core and the AVX-512 definition
| is dramatically slower than what could have been achieved
| with AVX2 (e.g. if a microcoded permute was 16x slower).
| brigade wrote:
| Most reasonable would be a hypothetical AVX-256 that was
| AVX-512VL minus ZMM registers. Intel chose against that.
|
| So the only reasonable options for a big little system
| are to not have little cores, or for nothing to support
| AVX-512, or for the little cores to support AVX-512 as
| best they can. Then thread director can weight AVX-512
| usage even heavier than it already weights AVX2.
| dragontamer wrote:
| > we'll pretty much be in a golden age of SIMD.
|
| We already are in the golden age of SIMD. NVidia and AMD GPUs
| are easier and easier to program through standard interfaces.
|
| Intel / AMD are pushing SIMD on a CPU, which is useful for
| sure, but always is going to be smaller in scope than a
| dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt
| and the like.
|
| SIMD-on-a-CPU is useful because you can perform SIMD over the
| L1 cache as communication (rather than traversing L1 -> L2 ->
| L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and
| back). But if you have a large-scale operation that can work
| SIMD, the GPU-traversal absolutely works and is commonly done.
| skavi wrote:
| Good point. Should have clarified I was referring to CPU
| SIMD.
| dragontamer wrote:
| AVX2 is not as good as AVX512. But AVX2 still has vgather
| instructions, pshufb, and a few other useful tricks.
|
| AVX512 and ARM SVE2 bring the CPU up to parity with maybe
| 2010s-era GPUs or so (full gather/scatter, more permutation
| instructions, etc. etc.). But GPUs continued to evolve.
| Butterfly-shuffles are the generic any-to-any network
| building block, and are exposed in PTX (NVidia assembly)
| shfl.bfly, and AMD DPP (data-parallel primitives).
|
| Having a richer set of lane-to-lane shuffling (especially
| ready-to-use butterfly networks) would be best. It really
| is surprising how many problems require those rich-sets of
| data-movement instructions, or otherwise benefit from them.
|
| NEON and SVE had hard-coded data-movement for specific
| applications. The general-purpose instruction (pshufb) is
| kinda like permute/shfl from AMD/NVidia. A backwards-
| permute IIRC doesn't exist yet on CPU-side.
|
| And butterfly networks are the general-purpose solution,
| capable of implementing any arbitrary data-movement in just
| log(width) steps. (pshufb / permute instructions would be
| the full-sized butterfly network, but some cases might be
| "easier" and faster to execute with only a limited number
| of butterfly swaps, such as what inevitably comes up in
| sorting)
|
| --------
|
| Still, all of these operations can be implemented in AVX2
| (albeit slower / less efficiently). So its not like the
| "language" of AVX2 / AVX is incomplete... its just missing
| a few general-purpose instructions that could lead to
| better performance.
| PragmaticPulp wrote:
| > Could we do better? Assuredly. There are many AVX-512 that we
| are not using yet. We do not use ternary Boolean operations
| (vpternlog). We are not using the new powerful shuffle functions
| (e.g., vpermt2b). We have an example of coevolution: better
| hardware requires new software which, in turn, makes the hardware
| shine.
|
| > Of course, to get these new benefits, you need recent Intel
| processors with adequate AVX-512 support
|
| AVX-512 support can be confusing because it's often referred to
| as a single instruction set.
|
| AVX-512 is actually a large family of instructions that have
| different availability depending on the CPU. It's not enough to
| say that a CPU has AVX-512 because it's not a binary question.
| You have to know _which_ AVX-512 instructions are supported on a
| particular CPU.
|
| Wikipedia has a partial chart of AVX-512 support by CPU:
| https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
|
| Note that some instructions that are available in one generation
| of CPU are can actually be unavailable (superseded, usually) in
| the next generation of the CPU. If you go deep enough into
| AVX-512 optimization, you essentially end up targeting a specific
| CPU for the code. This is not a big deal if you're deploying
| software to 10,000 carefully controlled cloud servers with known
| specifications, but it makes general use and especially consumer
| use much harder.
| robocat wrote:
| To add, they are using[2] the relatively recent VBMI2
| instructions of AVX512. This article[1] talks about the
| advantages of VBMI on IceLake released 2021.
|
| [1] https://www.singlestore.com/blog/a-programmers-perspective/
| comments https://news.ycombinator.com/item?id=28179111
|
| [2] https://news.ycombinator.com/item?id=31522464
| mikepurvis wrote:
| Are there good libraries for doing runtime feature detection?
| Eg, include three versions of hot function X in the binary, and
| have it seamlessly insert the correct function pointer at
| startup? Or have the function contain multiple bodies and just
| JMP to the correct block of code?
|
| I know you can do this yourself, but last time I looked it was
| a heavily manual process-- you had to basically define a plugin
| interface and dynamically load your selected implementation
| from a separate shared object. What are the barriers to having
| compilers able to be hinted into transparently generating
| multiple versions of key functions?
| bremac wrote:
| I'm unsure about library support, but gcc and clang support
| function multi-versioning (FMV), which resolves the function
| based on CPUID the first time the function is called.
|
| This LWN article has some additional information:
| https://lwn.net/Articles/691932/
| mikepurvis wrote:
| TIL! I guess it makes sense that popular numeric libraries
| like BLAS, Eigen, and so-on would take advantage of this,
| but I wonder how widely used it is overall.
| loeg wrote:
| GCC has offered Function Multiversioning for about a decade
| now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID
| to resolve the ifunc once at program start:
| https://lwn.net/Articles/691932/ .
|
| Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tool
| s/clang/docs/AttributeRe...
|
| A nice presentation on it:
| https://llvm.org/devmtg/2014-10/Slides/Christopher-
| Function%...
| indygreg2 wrote:
| While this IFUNC feature does exist and it is useful, when
| I performed binary analysis on every package in Ubuntu in
| January, I found that only ~11 distinct packages have
| IFUNCs. It certainly looks like this ELF feature is not
| really used much [in open source] outside of GNU toolchain-
| level software!
|
| https://gregoryszorc.com/blog/2022/01/09/bulk-analyze-
| linux-...
| TkTech wrote:
| I've wanted to use them many times in the past, but the
| limited support on other compilers (looking at you MSVC)
| always made it a non-starter. If I have to support some
| other method of feature detection anyways, there's no
| point.
| colejohnson66 wrote:
| Check out Agner Fog's vectorclass library:
| https://github.com/vectorclass/version2
| beached_whale wrote:
| If only intel wasnt dropping support for it on a lot of cpus
| jrimbault wrote:
| I'm using this comment as jumping point.
|
| What's cost/opportunity to optimizing for a specific
| platform/instruction set ? At what point is it worth doing,
| when isn't it worth doing ? AVX-512 strikes as something...
| "ephemeral".
| skavi wrote:
| writing code for SIMD can get you absolutely massive
| performance improvements. Whether it's worth the added
| complexity depends on the situation. If your data is already
| arranged in a cache friendly way (SoA), it shouldn't be
| incredibly difficult to use SIMD intrinsics to optimize. I'd
| first take a look at what the compiler is already generating
| for you to see if manual intervention is worth it.
| throwaway92394 wrote:
| Well I mean this article is demoing a 28% improvement (if I
| did my math right) for json parsing.
|
| Sure AVX-512 is only applicable to specific workloads, and
| even many of those workloads the cost/opportunity of
| optimizing for AVX-512 might not be worth it. But there
| clearly ARE usecases that would benefit, and it might be
| worth it for more consumer applications to optimize for
| AVX-512 - but only if it can be used.
|
| The way I see it is that the benefit of optimizing for
| AVX-512 is far higher if it becomes normal for consumer CPUs
| to have it. A 28% improvement is pretty decent, but it's only
| worth implementing if enough people can utilize it.
| beached_whale wrote:
| For many maybe not, but when writing the foundations of
| software it is good to start fast. There are libraries that
| abstract various simd architectures now too. Simdjson has
| their own and there are ones lile KUMI
| jcranmer wrote:
| Intel is not dropping support for it on a lot of CPUs.
|
| The only thing they've done is disable it in the hybrid Alder
| Lake cores, presumably because the E-cores couldn't support it
| (while the P-cores could), and they didn't want to deal with
| the headaches of ISA extensions being supported only on some
| cores in the system.
| Aardwolf wrote:
| > Intel is not dropping support for it on a lot of CPUs.
|
| There are 0 current generation consumer CPUs of neither Intel
| nor AMD that have it
|
| > The only thing they've done is disable it in the hybrid
| Alder Lake cores
|
| Which happen to be _all_ the current generation Intel CPUs
| beached_whale wrote:
| Ah, headlines foiled me. I read disabled in Alder lake all
| together.
| temac wrote:
| It is disabled in all consumer Alder lake (and I don't
| remember if there will be Xeon of that gen with P-core only
| -- IIRC Intel stop the AVX512 validation late on those
| cores, but it was still before it was formally finished, so
| probably not). At one point it worked with some Bios on
| P-core only chips or if you disabled E-core on hybrid ones,
| but with up-to-date Intel microcode it does not work
| anymore.
| coder543 wrote:
| > The only thing they've done is disable it in the hybrid
| Alder Lake cores
|
| That is incorrect. You can buy Alder Lake CPUs that only have
| one type of core (the i3 series only has P-cores, for
| example), and those do not support AVX-512 either. They're
| not "hybrid" in any way.
|
| Some of their motherboard partners initially allowed you to
| access AVX-512, but Intel has put a stop to this and the
| feature is disabled on _all_ Alder Lake CPU SKUs, period.
|
| Newer Alder Lake chips have AVX-512 fused off entirely:
| https://www.tomshardware.com/news/intel-nukes-alder-lake-
| avx...
|
| > Intel is not dropping support for it on a lot of CPUs.
|
| That seems like a pretty questionable statement. Intel might
| keep AVX-512 around for Xeon, but it seems extremely dead on
| the consumer market. If Intel decides to bring it back for
| the next generation, that would be strange and very poor
| planning.
| gpderetta wrote:
| It seems likely that the reason is that some intel
| customers are willing to pai a significant premium for the
| feature and intel doesn't want it to be available for cheap
| nomel wrote:
| Well, if it means more cores, it's almost certainly worth it,
| in the grand scheme of things.
| bfrog wrote:
| At what point is JSON not the right option? Surely when trying to
| do this sort of thing?
|
| At what point is it saner to use something like flatbuffers or
| capnproto style message encoding instead.
| smabie wrote:
| Often you do not get the choice of whether you want to be
| parsing json or not.
| vardump wrote:
| Sometimes you just don't have a choice when you need to
| interface with a third party data feed or software.
|
| Isn't it better to have all options open?
| avg_dev wrote:
| Good thought. If you are coding in C++ maybe you can use some
| sort of binary serialization thing. Even in other languages if
| json parsing is a bottleneck it can possibly be optimized away
| through use of a binary wire format. That said vector
| operations available to programmers is always a welcome thing
| I'd say. And who knows how much production json parsing this
| library really does, it could be a ton.
|
| I'm torn. I've worked at shops where we aim over time to reduce
| response time while serving business logic and using
| statistical models that get iterated on. Even there I haven't
| seen a blatant need for non-JSON rpc. But I know my experience
| doesn't mirror everyone's. And I like seeing and learning about
| instruction sets. I'm currently taking a course in parallel
| computing and I just used a avx2 for the first time in a toy
| program to subtract one vector from another in a single
| instruction which while not particularly useful is a window
| into more interesting things and is still SIMD.
|
| I think on the whole making json parsing for a large enough
| fraction of processors is probably a huge win for the
| environment. But who is parsing json in C++?
| ollien wrote:
| > But who is parsing json in C++?
|
| Well, Facebook for one! Folly has lots of utilities for this
| (see folly::dynamic[1]). We make extensive use of this at my
| (non-Facebook) job.
|
| [1] https://github.com/facebook/folly/blob/master/folly/docs/
| Dyn...
| timerol wrote:
| > Of course, to get these new benefits, you need recent Intel
| processors with adequate AVX-512 support and, evidently, you also
| need relatively recent C++ processors. Some of the recent laptop-
| class Intel processors do not support AVX-512 but you should be
| fine if you rely on AWS and have big Intel nodes.
|
| What is meant by "relatively recent C++ processors"? Is that
| supposed to be "compilers"?
| [deleted]
| Narishma wrote:
| It's supposed to be Intel, not C++.
| NegativeLatency wrote:
| "new" is relative since they've been out for almost 10 years:
| https://www.intel.com/content/www/us/en/developer/articles/t...
| jeffbee wrote:
| This code uses VBMI2, which just came out quite recently.
___________________________________________________________________
(page generated 2022-05-26 23:00 UTC) |