proxy70

	[HN Gopher] New SiFive RISC-V core P650 with 40% IPC increase ___________________________________________________________________ New SiFive RISC-V core P650 with 40% IPC increase Author : FullyFunctional Score : 131 points Date : 2021-12-02 16:21 UTC (6 hours ago)
	web link (www.sifive.com)
	w3m dump (www.sifive.com)
	\| snvzz wrote: \| Some context: RISC-V Summit is next week, and RISC-V \| international has just approved a batch of important \| extensions[0]. With these extensions, RISC-V is not missing \| anything relative to ARM and x86 ISAs in terms of functionality. \| \| I expect a lot of tape-outs to happen this month, as core vendors \| were probably holding for the announced ratifications, in fear of \| last minute changes. Next year is going to be exciting. \| \| [0]: https://riscv.org/announcements/2021/12/riscv- \| ratifies-15-ne... \| [deleted] \| socialdemocrat wrote: \| That is great news! Is there any friendly intro/coverage \| anywhere of the new vector extension? \| \| I am curious about the final design. Would be interesting to \| hear how people think it compares with ARMs scalable vector \| extensions. \| snvzz wrote: \| There's been a few talks on the topic. They're archived in \| e.g. youtube. \| \| I like it. It's fairly simple and clean, yet powerful. \| \| There was also some discussion here in HN months ago, about \| an article comparing RISC-V V extension and ARM SVE. \| \| The article itself got several things wrong about V, but the \| discussion[0] was interesting. \| \| [0] https://news.ycombinator.com/item?id=27063748 \| [deleted] \| monocasa wrote: \| I wouldn't say RISC-V isn't missing anything. The lack of \| add/subtract with carry is an issue for efficient runtime of \| many JITed languages like JavaScript. \| \| That being said, I don't think it's the worse thing in the \| world like some do. The focus now should be on compiled code \| since JITs by definition can make runtime descions on if some \| future extension that fixes this deficiency exists or not. The \| J extension has stalled for the moment, but with these other \| extensions ratified there should be more bandwidth available \| hopefully. \| teruakohatu wrote: \| Can't vendor's making desktop/mobile class CPUs detect the \| equivalent pattern and optimize it in microcode or silicon? \| \| Or is that what we are trying to get away from? \| monocasa wrote: \| Maybe, but it's a leap, IMO. The equivalent patterns are 3x \| as long, and modify tons of arch visible state for their \| intermediate results which leaves more work for those \| combined instructions to do. \| \| The complaint is valid, IMO, and would show up on the \| filtration test they used to come up with ops if they were \| working with JITs too rather than just what's in AOT code. \| socialdemocrat wrote: \| Anyone able to put this in context? How fast are these cores \| compared to various ARM, Intel and AMD cores? At what level can \| they compete? \| sanxiyn wrote: \| > With a projected score of 11+ SPECInt2006/GHz, the SiFive \| Performance P650 brings RISC-V into a new category of high-end \| computing applications. \| \| 11+ SPECInt2006/GHz is comparable to Apple Icestorm \| microarchitecture. Apple Firestorm microarchitecture is roughly \| 2x better at 22 SPECInt2006/GHz. \| Symmetry wrote: \| How impressive that number is rather depends on how many GHz \| they're managing. In general the slower you design your clock \| to clock, the faster you can make all your caches. Plus the \| slower you clock your core, designed in or not, the lower the \| number of clock cycles it takes to talk to main memory. \| pantalaimon wrote: \| Mind you that raw core performance is not everything, memory \| bandwidth and caches are crucial to make sure the CPU isn't \| waiting for data all the time. \| sanxiyn wrote: \| Yes, but SPECint includes all such effects. As long as \| SPECint benchmarks (such as GCC) are representative of your \| workload, it works fine. \| tlb wrote: \| I trust that the Apple benchmarks include all such \| effects. I'm less convinced that the RISC-V "projections" \| include them. SPECint2006 is supposed to be measured with \| real memory and an OS. Per-GHz numbers can't accurately \| reflect main memory latency, since its speed doesn't \| scale with the CPU clock. \| spear wrote: \| Right, and "per GHz" numbers are also not very useful \| because you can't just crank up the GHz when you need \| performance. Even with the same process technology, you \| can't assume different microarchitectures will max out at \| the same frequency. \| sebow wrote: \| If i recall correctly the sifive unmatched is still pretty slow \| compared to ARM( \| https://www.phoronix.com/scan.php?page=article&item=hifive-u... \| ).Now this board is not the one in question(P650) but we'll \| have to observe upcoming benchmarks [for which i recommend \| phoronix] \| \| Obviously you can't even think about comparing it further with \| Intel & AMD, but when you look at the history of something like \| ARM(which i believe is 30-40 years old), riscv came a long way \| pretty fast, and the good thing it's a solid choice for the \| future due being open. \| sebow wrote: \| Sweet, are there any resources on transitioning/migrating or \| differences between x86_64 and riscv; or the ISAs are drastically \| different that it's just better to dive in head-first? \| bruce343434 wrote: \| > With a projected score of 11+ SPECInt2006/GHz \| \| That seems to imply a certain integer arithmetic performance, but \| I wonder what the floating point performance is. They could have \| just said "X flops". \| \| Comparing to other benchmarks at [1], I have no idea, because \| they all have denormalized results, so totals, rather than per \| GHz per core. Nice reporting. \| \| How fast is this thing? Pentium? first gen i3? current gent ryzen \| 5? The fact that they are being so obtuse about it leads me to \| believe performance isn't great. \| \| [1] https://www.spec.org/cgi- \| bin/osgresults?conf=cint2006;op=dum... \| wmf wrote: \| I'd compare it to an Atom "efficiency" core. \| marcodiego wrote: \| Faster than ARM A-77: \| https://www.phoronix.net/image.php?id=2021&image=sifive_p650... . \| Performance comparable to Apple Icestorm architecture, the \| 'efficiency' cores in M1. Considering A-710 is the fastest ARM \| core currently available and its successor will only be available \| next year, SiFive is just a few years before real competition \| starts in an arena currently dominated by ARM. \| \| This will be beautiful to watch. \| [deleted] \| zozbot234 wrote: \| It will be interesting to see a comparison on power-efficiency \| as well as performance. RISC-V implementations have shown a \| pretty sizeable advantage wrt. power use in the past, and we \| don't quite know how this advantage compares in these larger, \| performance-focused designs. \| dmitrygr wrote: \| > just a few years before real competition starts \| \| Are you assuming the competition will just sit and do nothing? \| GhettoComputers wrote: \| Good enough" matters more than benchmarks. They can make \| supercomputers but it doesn't matter to someone who wants a \| $100 computer. \| dmitrygr wrote: \| All riscv thingies i see today are decidedly not $100. I do \| see plenty of arm designs running linux under $10 though \| baybal2 wrote: \| This is something genuinely interesting from riscv crowd for the \| first time \| danielEM wrote: \| Once it gets to the shelfes at reasonable price will be happy to \| work with/on it. \| \| Curious how IP pricing compares to ARM in this case and how much \| would I need to put on top of it to tape out own batch of \| processors \| snvzz wrote: \| The license to the ISA itself is free. \| \| There's several vendors besides RISC-V offering cores for \| licensing. There's even some OSHW cores that can be freely \| used. \| \| Even if we choose to ignore the technical prowess of being a \| true 5th generation RISC ISA built with hindsight no other ISA \| has, what's IMHO a big deal in RISC-V is the mere availability \| of this market of cores. \| \| It poses a threat to ARM's business model, where ARM licenses \| cores and ISA, but nobody else than ARM can license cores to \| others. \| Teknoman117 wrote: \| As far as OSHW cores go, it's so very nice to be able to \| throw something together in verilog and be able to inherit a \| compiler and not be trampling on someone else's copyright... \| dmitrygr wrote: \| > built with hindsight no other ISA has \| \| Why do all the riscv fans Conveniently ignore aarch64 when \| they make statements like this? It was in fact a completely \| clean new design, based on hindsight, by people who know what \| they are doing, and with no legacy Cruft. \| FullyFunctional wrote: \| I'm a fan of RISC-V but the freedom is a large part of it. \| Aarch64 _is_ a very well designed ISA and _clearly_ has a \| lot of benefit of hindsight. The load pair /store pair \| instructions, the addressing modes, fixed 32-bit \| instruction size, etc. It all really helps. I suspect that \| Apple was actively part of designing it. \| \| I think however that RISC-V isn't that much worse and \| because of the freedom we will almost certainly see more \| implementation of RISC-V. I'd be watching Tenstorrent, \| SiFive, Rivos, Esperanto, and maybe Alibaba/T-Head. \| brucehoult wrote: \| Aarch64 obviously _isn 't_ a completely clean sheet design. \| It was constrained by having to execute on the same CPU \| pipelines as 32 bit code, at least for the first decade or \| so. And the 32 bit mode has to perform well. There are tens \| of millions of Raspberry Pi 3s and 4s (and later model Pi \| 2s) which have 64 bit CPUs but have never seen a 64 bit \| instruction in their lives. Android phones have been \| supporting both 32 and 64 bit apps for a long time. \| \| The "by people who know what they are doing" thing is just \| pure FUD. Sure, ARM employs some competent people, but no \| more so than IBM, Intel, AMD or the various members of \| RISC-V International. \| snvzz wrote: \| >Why do all the riscv fans Conveniently ignore aarch64 when \| they make statements like this? It was in fact a completely \| clean new design, based on hindsight, by people who know \| what they are doing, and with no legacy Cruft. \| \| aarch64 seems poorly designed to me. \| \| ARMv7 had thumb, but for some reason ARMv8 did not \| incorporate any lessons from that. As a result, code \| density is bad; ARMv8 binaries are huge. \| \| ARMv9, to be available in chips next year, is just a higher \| profile of required extensions, and does nothing to fix \| that. \| \| Ever wonder why M1 needs such huge L1 cache? Well, now you \| know. \| \| Considering ARMv9 will be competing against RVA22, I don't \| have much hope for ARM. \| dmitrygr wrote: \| > for some reason ARMv8 did not incorporate any lessons \| from that. \| \| I used to think so too, until I asked some more \| knowledgeable people about it. Turns out the lesson _IS_ \| that not having it is better. Fixed-sized instructions \| make a decoding significantly simpler, making it much \| easier to make very wide front ends \| brucehoult wrote: \| A little easier, not much easier. A number of \| organisations are making very wide RISC-V \| implementations, and one has already published how their \| decoder works. It's modular, with each block looking at \| 48 bits of code (the first 16 overlapping with the \| previous block) and decoding either two 16 bit \| instructions, or one aligned 32 bit instruction, or one \| misaligned 32 bit instruction with a following 16 bit \| instruction, or one misaligned 32 bit instruction \| followed by an ignored start of another misaligned 32 bit \| instruction. \| \| You can put as many of these modules side by side as you \| want. There is a serial dependency between them in that \| each block has to tell the next block whether its last 16 \| bits are the start of a misaligned 32 bit instruction or \| not. That could become an issue with really really wide \| but for something decoding e.g. 16 bytes at a time (4 to \| 8 instructions) it's not an issue. \| \| There is a trade-off between a little bit of decoder \| complexity and a lot of improved code density -- but \| nowhere near to the same extent as say x86. \| adrian_b wrote: \| ARMv8 code density is quite good for a fixed-length ISA \| and is of course much better than that of RISC-V. \| \| RISC-V has only one good feature for code density, the \| combined compare-and-branch instructions, but even this \| feature was designed poorly, because it does not have all \| the kinds of compare-and-branch that are needed, e.g. if \| you want safe code that checks for overflows, the number \| of required instructions and the code size explode. Only \| unsafe code, without run-time checks, can have an \| acceptable size in RISC-V. \| \| ARMv8 has an adequate unused space in the branch opcode \| map, where combined compare-and-branch instructions could \| be added, and with a larger branch offset range than in \| RISC-V, in which case the code size advantage of ARMv8 \| vs. RISC-V would increase significantly. \| \| While the combined compare-and-branch of RISC-V are good \| for code density, because branches are very frequent, the \| rest of the ISA is bad and the worst is the lack of \| indexed addressing, which frequently requires 2 RISC-V \| instructions instead of 1 ARM instruction. \| brucehoult wrote: \| I'm not sure how you missed RISC-V's big feature for code \| density -- the "C" extension, giving it arbitrarily mixed \| 16 and 32 bit opcodes. \| \| I've heard of that feature before somewhere else. It gave \| the company that invented it unparalleled code density in \| their 32 bit systems and propelled them to the heights of \| success in mobile devices. What was their name? Wait .. \| oh, yes ... ARM. \| \| Why they forgot this in their 64 bit ISA is a mystery. \| The best theory I can come up with is that they thought \| the industry had shaken out and amd64 was the only \| competition they were going to have, ever. Aarch64 does \| indeed have very good code density for a fixed-length 32 \| bit opcode ISA, and comes very close to matching amd64. \| They may have thought that was going to be good enough. \| \| Note: the RISC-V "C" extension is technically optional, \| but the only CPU cores I know of that don't implement it \| are academic toys, student projects, and tiny cores for \| use in FPGAs where they are running programs with only a \| few hundred instructions in them. Once you get over even \| maybe 1 KB of code it's cheaper in resources to implement \| "C" than to provide more program storage. \| zozbot234 wrote: \| The thing with lack of shifted indexed addressing is that \| it just might not matter all that much beyond toy \| examples. Address calculations can generally be folded in \| with other code, particularly in loops which are a common \| case. So it's only rarely that you actually need those \| extra instructions. \| adrian_b wrote: \| Shifted indexed addressing is needed more seldom, but \| indexed addressing, i.e. register + register, is needed \| in every loop that accesses memory. \| \| There are 2 ways of programming a loop that addresses \| memory with a minimum of instructions. \| \| One way, which is preferable e.g. on Intel/AMD, is to \| reuse the loop counter as the index into the data \| structure that is accessed, so each load/store needs a \| base register + index register addressing, which is \| missing in RISC-V. \| \| The second way, which is preferable e.g. on POWER and \| which is also available on ARM, is to use an addressing \| mode with auto-update, where the offset used in loads or \| stores is added into the base register. This is also \| missing in RISC-V. \| \| Because none of the 2 methods works in RISC-V with a \| minimum number of instructions, like in all other CPUs, \| all such loops, which are very frequent, need pairs of \| instructions in RISC-V, corresponding to single \| instructions in the other CPUs. \| brucehoult wrote: \| A big difference here is that the RISC-V instructions are \| usually all 16 bits in size while the Aarch64 and POWER \| instructions are all 32 bits in size. So the code size is \| the same. \| \| Also, high performance Aarch64 and POWER implementations \| are likely to be splitting those instructions into two \| decoupled uops in the back end. \| \| Performance-critical loops are unrolled on all ISAs to \| minimise loop control overhead and also to allow \| scheduling instructions to allow for the several cycle \| latency of loads from even L1 cache. When you do that, \| indexed addressing and auto-update addressing are still \| doing both operations for every load or store which, as \| well as being a lot of operations, introduces sequential \| dependency between the instructions. The RISC-V way \| allows the use of simple load/store with offset -- all of \| which are independent of each other -- with one merged \| update of each pointer at the end of the loop. POWER and \| Aarch64 compilers for high performance microarchitectures \| use the RISC-V structure for unrolled loops anyway. \| \| So indexed addressing and auto-update addressing give no \| advantage for code size, and don't help performance at \| the high end. \| snvzz wrote: \| >in which case the code size advantage of ARMv8 vs. \| RISC-V would increase significantly. \| \| Many things could be said about ARMv8, but that it has \| good code size is not one of it. It does, in fact, have \| abysmal code density. Both RISC-V and x86-64 produce \| significantly smaller binaries. For RISC-V, we're talking \| about a 20% reduction of size. \| \| There's a wealth of papers on this, but you can verify \| this trivially yourself, by either compiling binaries for \| different architectures from the same sources, or \| comparing binaries in Linux distributions that support \| RISC-V and ARM. \| \| >where combined compare-and-branch instructions could be \| added, and with a larger branch offset range than in \| RISC-V \| \| If your argument is that ARMv8 could get better over \| time, I hate to be the bearer of bad news. ARMv9 code \| density isn't any better. \| \| >and the worst is the lack of indexed addressing, which \| frequently requires 2 RISC-V instructions instead of 1 \| ARM instruction. \| \| These patterns are standardized, and they become one \| instruction after fusion. \| \| RISC-V, unlike the previous generation of ISAs, was \| thoroughly designed with hindsight on fusion. The \| simplest microarchitectures can of course omit it \| altogether, but the cost of fusion in RISC-V is low; I \| have seen it quoted at 400 gates. \| brucehoult wrote: \| Instruction fusion is a possibility for the future, which \| has been discussed academically, but no one implements it \| at present. I'm not sure anyone will -- it's too much \| complexity for simple cores, and not needed for big OoO \| cores. \| \| The one fusion implementation I'm aware of if the SiFive \| 7-series combining a conditional branch that jumps \| forward over exactly one instruction. It turns the \| instruction pair into predicated execution. \| \| I agree with everything else. In particular the code \| density. Anyone can download Ubuntu or Fedora images for \| the same release for amd64, arm64, and riscv64. Mount \| them and run "size" on any selection of binaries you \| want. The RISC-V ones are consistently and significantly \| smaller than the other two, with arm64 the biggest. \| pohl wrote: \| _Ever wonder why M1 needs such huge L1 cache? Well, now \| you know._ \| \| I'm not sure I follow this, but it reminds me to ask: \| does RISC-V allow for designs to have both efficiency & \| performance cores like the ARM big.LITTLE concept? Has \| anyone made one yet? \| brucehoult wrote: \| Of course you can do it. SiFive has been allowing \| customers to configure core complexes with a mixture of \| different core types for years -- for example mixing U84 \| cores with U74 or U54. If you want to do a BIG.little \| thing with transferring a running program from one core \| type to another that's just a software thing -- and using \| cores with the same ISA but different microarchitecture. \| \| To date the examples of this that have been shipped to \| the public have used cores with similar \| microarchitecture, but a different set of extensions. \| \| For example the U54-MC in the HiFive Unleashed and in the \| Microsemi Polarfire SoC FPGAs use four U54 cores plus one \| E51 core for "real time" tasks. The E51 doesn't have an \| FPU or MMU or Supervisor mode. The U74-MC in the HiFive \| Unmatched is similar. \| \| Alibaba's ICE SoC, which you may have seen videos of \| running Android, has two C910 Out-of-Order cores (similar \| to ARM A72/A73) implementing RV64GC, and a third C910 \| core that also has a vector processing unit with two \| pipes with 256 bit vector ALU each, plus 128 bit vector \| load and store pipes. \| [deleted] \| fartcannon wrote: \| So I guess we should expect to hear a lot of FUD about RISC-V \| over the coming years. \| marcodiego wrote: \| No need to wait. Already happened in 2018: \| https://www.theregister.com/2018/07/10/arm_riscv_website/ \| \| https://www.extremetech.com/wp- \| content/uploads/2018/07/arm-r... \| snvzz wrote: \| And it is how many learned about RISC-V's existence. \| \| It will be a PR disaster long remembered. One for the \| textbooks. \| snvzz wrote: \| This is a real possibility, albeit a sad one. \| \| No amount of FUD will save ARM. Only pivoting into a \| different business model could. \| duskwuff wrote: \| Honestly, ARM is fine. They're no longer the only game in \| town, but they've still got a huge head start. \| snvzz wrote: \| They'll be fine if they focus on their microarchitectures \| rather than the ISA (where IMHO they've already lost), \| and make the process for obtaining a license much more \| streamlined; I've heard it takes no less than 18 months \| of long negotiations to license anythin from ARM. That's \| not sustainable now that there's competition. \| duskwuff wrote: \| That's already where their focus is. Most of ARM's \| customers are licensing specific cores from ARM, not the \| ISA as a whole. \| jaas wrote: \| Who exactly are the customers for this chip? ___________________________________________________________________ (page generated 2021-12-02 23:01 UTC)