|
| stefan_ wrote:
| > The Tiled Vertex Buffer is the Parameter Buffer. PB is the
| PowerVR name, TVB is the public Apple name, and PB is still an
| internal Apple name.
|
| Patent lawyers love this one silly trick.
| robert_foss wrote:
| Seeing how Apple licensed the full PowerVR hardware before,
| they probably currently have a license for the whatever
| hardware they based their design on.
| kimixa wrote:
| They originally claimed they completely redesigned it and
| announced they were therefore going to drop the PowerVR
| architecture license - that was the reason for the stock
| price crash and Imagination Technologies sale in 2017.
|
| Then they have since scrubbed the internet of all such claims
| and to this day pay for an architecture license. I think it's
| similar to an ARM architecture license - where it's a license
| for any derived technology and patents rather than actually
| being given the RTL for powervr-designed cores.
|
| I worked at PowerVR during that time (I have Opinions, but
| will try to keep them to myself), and my understanding was
| that Apple hadn't actually taken new PowerVR RTL for a number
| of years and had significant internal redesigns of large
| units (e.g. the shader ISA was rather different from the
| PowerVR designs of the time), but presumably they still use
| enough of the derived tech and ideas that paying the
| architecture license is necessary. This transfer was only one
| way - we never saw anything internal about Apple's designs,
| so reverse engineering efforts like this are still
| interesting.
|
| And as someone who worked on the PowerVR cores (not the Apple
| derivatives) I can assure you all this discussed in the
| original post is _extremely_ familiar.
| pyb wrote:
| Apple's claim is that they designed it themselves. https://en
| .wikipedia.org/wiki/Talk:Apple_M1#[dubious_%E2%80%...
| gjsman-1000 wrote:
| There's no reason that couldn't be a half-truth - it could
| be a PowerVR with certain components replaced, or even the
| entire GPU replaced but with PowerVR-like commands and
| structure for compatibility reasons. Kind of like how AMD
| designed their own x86 chip despite it being x86 (Intel's
| architecture).
|
| Also, if you read Hector Martin's tweets (he's doing the
| reverse-engineering), Apple replacing the actual logic
| while maintaining the "API" of sorts is not unheard of.
| It's what they do with ARM themselves - using their own ARM
| designs instead of the stock Cortex ones while maintaining
| ARM compatibility.*
|
| *Thus, Apple has a right to the name "Apple Silicon"
| because the chip is designed by Apple, and just happens to
| be ARM-compatible. Other chips from almost everyone else
| use stock ARM designs from ARM themselves. Otherwise, we
| might as well call AMD an "Intel design" because its x86 by
| the same logic.
| quux wrote:
| Didn't Apple have a large or even dominant role in the
| design of the ARM64/AArch64 architecture? I remember
| reading somewhere that they developed ARM64 and
| essentially "gave it" to ARM who accepted but nobody
| could understand at the time why a 64 bit extension to
| ARM was needed so urgently, and why some of the details
| of the architecture had been designed the way they had.
| Years later with Apple Silicon it all became clear.
| kalleboo wrote:
| The source is a former Apple engineer (now at Nvidia
| apparently)
|
| https://twitter.com/stuntpants/status/1346470705446092811
|
| > _arm64 is the Apple ISA, it was designed to enable
| Apple's microarchitecture plans. There's a reason Apple's
| first 64 bit core (Cyclone) was years ahead of everyone
| else, and it isn't just caches_
|
| > _Arm64 didn't appear out of nowhere, Apple contracted
| ARM to design a new ISA for its purposes. When Apple
| began selling iPhones containing arm64 chips, ARM hadn't
| even finished their own core design to license to
| others._
|
| > _ARM designed a standard that serves its clients and
| gets feedback from them on ISA evolution. In 2010 few
| cared about a 64-bit ARM core. Samsung & Qualcomm, the
| biggest mobile vendors, were certainly caught unaware by
| it when Apple shipped in 2013._
|
| > > _Samsung was the fab, but at that point they were
| already completely out of the design part. They likely
| found out that it was a 64 bit core from the diagnostics
| output. SEC and QCOM were aware of arm64 by then, but
| they hadn't anticipated it entering the mobile market
| that soon._
|
| > _Apple planned to go super-wide with low clocks, highly
| OoO, highly speculative. They needed an ISA to enable
| that, which ARM provided._
|
| > _M1 performance is not so because of the ARM ISA, the
| ARM ISA is so because of Apple core performance plans a
| decade ago._
|
| > > _ARMv8 is not arm64 (AArch64). The advantages over
| arm (AArch32) are huge. Arm is a nightmare of
| dependencies, almost every instruction can affect flow
| control, and must be executed and then dumped if its
| precondition is not met. Arm64 is made for reordering._
| quux wrote:
| Thanks!
| travisgriggs wrote:
| > > M1 performance is not so because of the ARM ISA, the
| ARM ISA is so because of Apple core performance plans a
| decade ago.
|
| This is such an interesting counterpoint to the
| occasional "Just ship it" screed (just one yesterday I
| think?) we see on HN.
|
| I have to say, I find this long form delivery of tech to
| be enlightening. That kind of foresight has to mean some
| level of technical saaviness at high decision making
| levels. Whereas many of us are caught at companies with
| short sighted/tech naive leadership who clamor to just
| ship it so we can start making money and recoup the money
| we're losing on these expensive tech type developers.
| kif wrote:
| I think the "just ship it" method is necessary when
| you're small and starting out. Unless you are well
| funded, you couldn't afford to do what Apple did.
| pyb wrote:
| I haven't followed the announcements CPU side - do Apple
| clearly claim that they designed their own CPU (with an
| ARM instruction set)?
| daneel_w wrote:
| They are one of a handful of companies that hold a
| license allowing them to both customize the reference
| core and to implement the Arm ISA through their own
| silicon design. Everyone else's SoCs all use the same Arm
| reference mask. Qualcomm also holds such a license, which
| owes to their Snapdragon SoC, just like Apple's A- and
| M-series, occupying a performance hierarchy above
| everything else Arm.
| happycube wrote:
| The _only_ Qualcomm designed 64-bit mobile core so far
| was the Kyro core in the 820. They then assigned that
| team to server chips (Centriq) then sacked the whole team
| when they felt they needed to cut cash flow to stave off
| Avago /Broadcom. The "Kyro" cores from 835 on are
| rebadged/adjusted ARM cores.
|
| IMO the Kyro/820 wasn't a _major_ failure, it turned out
| a lot better than the 810 which had A53 /A57 cores.
|
| And _then_ they decided they needed a mobile CPU team
| again and bought Nuvia for ~US$1 Billion.
| masklinn wrote:
| According to Hector Martin (the project lead of Asahi) in
| previous threads of the subject[0], Apple actually has an
| "architecture+" license which is completely exclusive to
| them, thanks to having literally been at the origins of
| ARM: not only can Apple implement the ISA on completely
| custom silicon rather than license ARM cores, they can
| _customise_ the ISA (as in add instructions, as well as
| opt out of mandatory ISA features).
|
| [0] https://news.ycombinator.com/item?id=29798744
| pyb wrote:
| Such a license is a big clue, but not quite what I was
| enquiring about...
| paulmd wrote:
| To be blunt, you're asking about questions that could be
| solved with a quick google and you are coming off as a
| bit of a jerk asking for very specific citations with
| exact specific wording for basic facts like this that,
| again, could be solved by looking through the wikipedia
| for "apple silicon" and then bouncing to a specific
| source. People have answered your question and you're
| brushing them off because you want it answered in an
| exact specific way.
|
| https://en.wikipedia.org/wiki/Apple_silicon
|
| https://www.anandtech.com/show/7335/the-
| iphone-5s-review/2
|
| > NVIDIA and Samsung, up to this point, have gone the
| processor license route. They take ARM designed cores
| (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate
| them into custom SoCs. In NVIDIA's case the CPU cores are
| paired with NVIDIA's own GPU, while Samsung licenses GPU
| designs from ARM and Imagination Technologies. Apple
| previously leveraged its ARM processor license as well.
| Until last year's A6 SoC, all Apple SoCs leveraged CPU
| cores designed by and licensed from ARM.
|
| > With the A6 SoC however, Apple joined the ranks of
| Qualcomm with leveraging an ARM architecture license. At
| the heart of the A6 were a pair of Apple designed CPU
| cores that implemented the ARMv7-A ISA. I came to know
| these cores by their leaked codename: Swift.
|
| Yes, Apple has been designing and using non-reference
| cores since the A6 era, and were one of the first to the
| table with ARMv8 (apple engineers claim it was designed
| for them under contract to their specifications, but
| _this_ part is difficult to verify with anything more
| than citations from individual engineers).
|
| I expect that Apple has said as much in their
| presentations somewhere, but if you're that keen on
| finding such an incredibly specific attribution, then
| knock yourself out. It'll be in an apple conference
| somewhere, like WWDC. They probably have said "apple-
| designed silicon" or "custom core" at some point, and
| that would be your citation - but they also sell
| products, not hardware, and they don't _extensively_ talk
| about their architectures since they 're not really the
| product, so you probably won't find a deep-dive like
| Anandtech from Apple directly where they say "we have
| 8-wide decode, 16-deep pipeline... etc" sorts of things.
| [deleted]
| gjsman-1000 wrote:
| Qualcomm did use their own design called _Kyro_ for a
| little while, but is now focusing on cores designed by
| Nuvia which they just bought for the future.
|
| As for Apple, they've designed their own cores since the
| Apple A6 which used the _Swift_ core. If you go to the
| Wikipedia page, you can actually see the names of their
| core designs, which they improve every year. For the M1
| and A14, they use _Firestorm_ High-Performance Cores and
| _Icestorm_ Efficiency Cores. The A15 uses _Avalanche_ and
| _Blizzard_. If you visit AnandTech, they have deep-dives
| on the technical details of many of Apple 's core designs
| and how they differ from other core designs including
| stock ARM.
|
| The Apple A5 and earlier were stock ARM cores, the last
| one they used being Cortex A9.
|
| For this reason, Apple is about as much an ARM chip as
| AMD is an Intel chip. Technically compatible,
| implementation almost completely different. It's also why
| Apple calls it "Apple Silicon" and it is not just
| marketing, but actually justified just as much as AMD not
| calling their chips Intel derivatives.
| GeekyBear wrote:
| > Qualcomm did use their own design called Kyro for a
| little while
|
| Before that, they had Scorpion and Krait, which were both
| quite successful 32 bit ARM compatible cores at the time.
|
| Kryo started as an attempt to quickly launch a custom 64
| bit ARM core and the attempt failed badly enough that
| Qualcomm abandoned designing their own cores and turned
| to licensing semi-custom cores from ARM instead.
| amaranth wrote:
| Kyro started as custom but flopped in the Snapdragon 820
| so they moved to a "semi-custom" design, it's unclear how
| different it really is from the stock Cortex designs.
| daneel_w wrote:
| The other-wordly performance-per-watt would be another.
| stephen_g wrote:
| They do, and their microarchitecture is unambiguously,
| hugely different to anything else (some details in 1).
| The last Apple Silicon chip to use a standard Arm design
| was the A5X, whereas they were using customised PowerVR
| GPUs until I think the A11.
|
| 1. https://www.anandtech.com/show/16226/apple-
| silicon-m1-a14-de...
| rjsw wrote:
| > Apple replacing the actual logic while maintaining the
| "API" of sorts is not unheard of.
|
| They did this with ADB, early PowerPC systems contained a
| controller chip that has the same API that was
| implemented in software in the 6502 IOP coprocessor in
| the IIfx/Q900/Q950.
| brian_herman wrote:
| Also laywers that can keep it in court long enough for a
| redesign.
| tambourine_man wrote:
| Few things are more enjoyable than reading a good bug story, even
| when it's not one's area of expertise. Well done.
| alimov wrote:
| I had the same thought. I really enjoy following along and
| getting a glimpse into the thought process of people working
| through challenges.
| danw1979 wrote:
| Alyssa and the rest of the Asahi team are basically magicians as
| far as I can tell.
|
| What amazing work and great writing that takes an absolute
| graphics layman (me) on a very technical journey yet it is still
| largely understandable.
| [deleted]
| nicoburns wrote:
| > Why the duplication? I have not yet observed Metal using
| different programs for each.
|
| I'm guessing whoever designed the system wasn't sure whether they
| would ever need to be different, and designed it so that they
| could be. It turned out that they didn't need to be, but it was
| either more work than it was worth to change it (considering that
| simply passing the same parameter twice is trivial), or they
| wanted to leave the flexibility in the system in case it's needed
| in future.
|
| I've definitely had APIs like this in a few places in my code
| before.
| pocak wrote:
| I don't understand why the programs are the same. The partial
| render store program has to write out both the color and the
| depth buffer, while the final render store should only write
| out color and throw away depth.
| kimixa wrote:
| Possibly pixel local storage - I think this can be accessed
| with extended raster order groups and image blocks in metal.
|
| https://developer.apple.com/documentation/metal/resource_fun.
| ..
|
| E.g in their example in the link above for deferred rendering
| (figure 4) the multiple G buffers won't actually need to
| leave the on-chip tile buffer - unless there's a partial
| render before the final shading shader is run.
| hansihe wrote:
| Not necessarily, other render passes could need the depth
| data later.
| Someone wrote:
| So it seems it allows for optimization. If you know you
| don't need everything, one of the steps can do less than
| the other.
| johntb86 wrote:
| Most likely that would depend on what storeAction is set
| to: https://developer.apple.com/documentation/metal/mtlrend
| erpas...
| pocak wrote:
| Right, I had the article's bunny test program on my mind,
| which looks like it has only one pass.
|
| In OpenGL, the driver would have to scan the following
| commands to see if it can discard the depth data. If it
| doesn't see the depth buffer get cleared, it has to be
| conservative and save the data. I assume mobile GPU drivers
| in general do make the effort to do this optimization, as
| the bandwidth savings are significant.
|
| In Vulkan, the application explicitly specifies which
| attachment (i.e. stencil, depth, color buffer) must be
| persisted at the end of a render pass, and which need not.
| So that maps nicely to the "final render flush program".
|
| The quote is about Metal, though, which I'm not familiar
| with, but a sibling comment points out it's similar to
| Vulkan in this aspect.
|
| So that leaves me wondering: did Rosenzweig happen to only
| try Metal apps that always use _MTLStoreAction.store_ in
| passes that overflow the TVB, or is the Metal driver
| skipping a useful optimization, or neither? E.g. because
| the hardware has another control for this?
| plekter wrote:
| I think multisampling may be the answer.
|
| For partial rendering all samples must be written out, but
| for the final one you can resolve(average) them before
| writeout.
| [deleted]
| 542458 wrote:
| It's been said more than a few times in the past, but I cannot
| get over just how smart and motivated Alyssa Rosenzweig is -
| she's currently an undergraduate university student, and was
| leading the Panfrost project when she was still in high school!
| Every time I read something she wrote I'm astounded at how
| competent and eloquent she is.
| frostwarrior wrote:
| While I was reading I was already thinking that. I can't
| believe how smart and an awesome developer she is.
| pciexpgpu wrote:
| Undergrad? I thought she was some Staff SWE in an OSS company.
| Seriously impressive, and ought to give anyone imposter
| syndrome.
| gjsman-1000 wrote:
| Well, Alyssa is, and works for Collabora while also being
| undergrad.
| coverband wrote:
| I was about to post "very impressive", but that seems a huge
| understatement after finding out she's still in school...
| [deleted]
| aero-glide2 wrote:
| Have to admit, wherever i see people much younger than me do
| great things I get very depressed.
| kif wrote:
| I used to feel this way, too. However, every single one of us
| has their own unique circumstances.
|
| I can't give too many details unfortunately. But, there's a
| specific step I took in my career, which was completely
| random at the time. I was still a student, and I decided not
| to work somewhere. I resigned two weeks in. Had I not done
| that, I wouldn't be where I am today. My situation would be
| totally different.
|
| Yes, some people are very talented. But it does take quite a
| lot of work and dedication. And yes, sometimes you cannot
| afford to dedicate your time to learning something because
| life happens.
| cowvin wrote:
| No need to be depressed. It's not a competition between you.
| You can find inspiration in what others achieve and try to
| achieve more yourself.
| ip26 wrote:
| I get that. But then I remember at that age, I was only just
| cobbling together my very first computer from the scrap bin.
| An honest comparison is nearly impossible.
| pimeys wrote:
| And for me, her existence is enough to keep me of getting
| depressed about my industry. Whatever she's doing, is keeping
| my hopes up for computer engineering.
| [deleted]
| ohgodplsno wrote:
| Be excited! This means amazing things are coming, from
| incredibly talented people. And even better when they put out
| their knowledge in public, in an easy to digest form, letting
| you learn from them.
| azinman2 wrote:
| Does anyone know if she has a proper interview somewhere? I'd
| love to know how she got so technical in high school to be able
| to reverse engineer a GPU -- something I would have no idea how
| to start even with many more years experience (although
| admittedly I know very little about GPUs and don't do graphics
| work).
| daenz wrote:
| That image gave me flashbacks of gnarly shader debugging I did
| once. IIRC, I was dividing by zero in some very rare branch of a
| fragment shader, and it caused those black tiles to flicker in
| and out of existence. Excruciatingly painful to debug on a GPU.
| thanatos519 wrote:
| What an entertaining story!
| ninju wrote:
| > Comparing a trace from our driver to a trace from Metal,
| looking for any relevant difference, we eventually _stumble on
| the configuration required_ to make depth buffer flushes work.
|
| > And with that, we get our bunny.
|
| So what was the configuration that needed to change? Don't leave
| us hanging!!!
| [deleted]
| dry_soup wrote:
| Very interesting and easy to follow writeup, even for a graphics
| ignoramus like myself.
| Jasper_ wrote:
| Huh, I always thought tilers re-ran their vertex shaders multiple
| times -- once with position-only to do binning, and then _again_
| when computing for all attributes with each tile; that 's what
| the "forward tilers" like Adreno/Mali do. That's crazy they dump
| all geometry to main memory rather than keeping it in pipe. It
| explains why geometry is more of a limit on AGX/PVR than
| Adreno/Mali.
| pocak wrote:
| That's what I thought, too, until I saw ARM's Hot Chips 2016
| slides. Page 24 shows that they write transformed positions to
| RAM, and later write varyings to RAM. That's for Bifrost, but
| it's implied Midgard is the same, except it doesn't filter out
| vertices from culled primitives.
|
| That makes me wonder whether the other GPUs with position-only
| shading - Intel and Adreno - do the same.
|
| As for PowerVR, I've never seen them described as position-only
| shaders - I think they've always done full vertex processing
| upfront.
|
| edit: slides are at https://old.hotchips.org/wp-
| content/uploads/hc_archives/hc28...
| Jasper_ wrote:
| Mali's slides here still show them doing two vertex shading
| passes, one for positions, and again for other attributes.
| I'm guessing "memory" here means high-performance in-unit
| memory like TMEM, rather than a full frame's worth of data,
| but I'm not sure!
| atq2119 wrote:
| I was under that impression as well. If they write out all
| attributes, what is really the remaining difference to a
| traditional immediate more renderer? Nvidia reportedly has
| vertex attributes going through memory for many generations
| already (and they are at least partially tiled...).
|
| I suppose the difference is whether the render target lives in
| the "SM" and is explicitly loaded and flushed (by a shader, no
| less!) or whether it lives in a separate hardware block that
| acts as a cache.
| Jasper_ wrote:
| NV has vertex attributes "in-pipe" (hence mesh shaders), and
| the appearance of a tiler is a misread, it's just a change to
| the macro-rasterizer about which quads get dispatched first,
| it's not a true tiler.
|
| The big difference is the end of the pipe, as mentioned;
| whether you have ROPs or whether your shader cores load/store
| from a framebuffer segment. Basically, whether or not
| framebuffer clears are expensive (assuming no fast-clear
| cheats), or free.
| [deleted]
| bob1029 wrote:
| I really appreciate the writing and work that was done here.
|
| It is amazing to me how complicated these systems have become. I
| am looking over the source for the single triangle demo. Most of
| this is just about getting information from point A to point B in
| memory. Over 500 lines worth of GPU protocol overhead... Granted,
| this is a one-time cost once you get it working, but it's still a
| lot to think about and manage over time.
|
| I've written software rasterizers that fit neatly within 200
| lines and provide very flexible pixel shading techniques.
| Certainly not capable of running a cyberpunk 2077 scene, but
| interactive framerates otherwise. In the good case, I can go from
| a dead stop to final frame buffer in <5 milliseconds. Can you
| even get the GPU to wake up in that amount of time?
| mef wrote:
| with great optimization comes great complexity
| [deleted]
| quux wrote:
| Impressive work and really interesting write up. Thanks!
| VyseofArcadia wrote:
| > Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a
| screaming fast desktop, but its unified memory and tiler GPU have
| roots in mobile phones.
|
| PowerVR has its roots in a desktop video card with somewhat
| limited release and impact. It really took off when it was used
| in the Sega Dreamcast home console and the Sega Naomi arcade
| board. It was only later that people put them in phones.
| robert_foss wrote:
| But being a Tiling rendering architecture which is normal for
| mobile applications and not how desktop GPUs are architectured,
| it would be fair to call it a mobile GPU.
| Veliladon wrote:
| Nvidia appears to be an immediate mode renderer to the user
| but has used a tiled rendering architecture under the hood
| since Maxwell.
| pushrax wrote:
| According to the sources I've read, it uses a tiled
| rasterizing architecture but it's not deferred in the same
| way as typical mobile TBDR that bins all vertexes before
| starting rasterization, deferring all rasterization after
| all vertex generation, and flushing each tile to the
| framebuffer once.
|
| NV seems to rasterize vertexes in small batches (i.e.
| immediately) but buffers the rasterizer output on die in
| tiles. There can still be significant overlap between
| vertex generation and rasterization. Those tiles are
| flushed to the framebuffer, potentially before they are
| fully rendered, and potentially multiple times per draw
| call depending on the vertex ordering. They do some
| primitive reordering to try to avoid flushing as much, but
| it's not a full deferred architecture.
| [deleted]
| monocasa wrote:
| Nvidia's is a tile-based immediate mode rasterizer. It's
| more a cache friendly immediate renderer than a TBDR.
| tomc1985 wrote:
| I actually had one of those cards! The only games I could get
| it to work with were Half-Life, glQuake, and Jedi Knight, and
| the bilinear texture filtering had some odd artifacting IIRC
| wazoox wrote:
| Unified memory was introduced by SGI with the O2 workstation in
| 1996, then they used it again with their x86 workstations SGI
| 320 and 540 in 1999. So it was a workstation-class technology
| before being a mobile one :)
| andrekandre wrote:
| even the n64 had unified memory way back in 1995
| nwallin wrote:
| The N64's unified memory model had a pretty big asterisk
| though. The system had only 4kB for textures out of 4MB of
| total RAM. And textures are what uses the most memory in a
| lot of games.
| ChuckNorris89 wrote:
| N64 chip was also SGI designed
| iforgotpassword wrote:
| Was it the kyro 2? I had one of these but killed it by
| overclocking... Would make for a good retro system.
| smcl wrote:
| The Kyro and Kyro 2 were a little after the Dreamcast.
| sh33sh wrote:
| Really enjoyed the way it was written
| GeekyBear wrote:
| Alyssa's writing style steps you through a technical mystery in
| a way that remains compelling even if you lack the domain
| knowledge to solve the mystery yourself.
___________________________________________________________________
(page generated 2022-05-13 23:00 UTC) |