proxy70

[HN Gopher] Ask HN: How does a CPU communicate with a GPU?
___________________________________________________________________
 
Ask HN: How does a CPU communicate with a GPU?
 
I've been learning about computer architecture [1] and I've become
comfortable with my understanding of how a processor communicates
with main memory - be it directly, with the presence of caches or
even virtual memory - and I/O peripherals.  But something that
seems weirdly absent from the courses I took and what I have found
online is how the CPU communicates with other processing units,
such as GPUs - not only that, but an in-depth description of
interconnecting different systems with buses (by in-depth I mean an
RTL example/description).  I understand that as you add more
hardware to a machine, complexity increases and software must
intervene - so a generalistic answer won't exist and the answer
will depend on the implementation being talked about. That's fine
by me.  What I'm looking for is a description of how a CPU tells a
GPU to start executing a program. Through what means do they
communicate - a bus? How does such a communication instance look
like?  I'd love get pointers to resources such as books and
lectures that are more hands-on/implementation aware.  [1] Just so
that my background knowledge is clear: I've concluded NAND2TETRIS,
watched and concluded Berkeley's 2020 CS61C and have read a good
chunk of H&P (both Computer Architecture: A Quantitative Approach
and Computer Organization and Design: RISC-V edition), and now am
moving on to Onur Mutlu's lectures on advanced computer
architecture.
 
Author : pedrolins
Score  : 58 points
Date   : 2022-03-30 20:17 UTC (2 hours ago)
 
| simne wrote:
| Lot of things happen there.
| 
| But most important, PCIe bus is serial bus, which have
| virtualized interface, so there is no physical process of
| communication, what happen more similar to Ethernet network, mean
| on each device exists few endpoints, each has it's own controller
| with its own address and few registers to store state and
| transitions, and memory buffer(s).
| 
| Videocards usually have many behaviors. In simplest modes, they
| behave just as RAM mapped to large chunk of system RAM space,
| plus video registers to control video output, and to control
| address mapping of video ram, and to switch modes.
| 
| In more complex modes, Videocards generate interrupts (just
| special type of message on PCIe).
| 
| In 3D modes, which are most complex, Videocontroller take data
| from its own memory (which mapped to system space), there are
| stored tree of graphic primitives, some draw directly from
| videoram, but for others used bus master option of PCIe, in which
| videocontroller read additional data (textures) from predefined
| chunks of system RAM.
| 
| About GPU operation, usually, CPU copy data to Videoram directly,
| than ask videocontroller to run program in videoram, and when
| complete, GPU issue interrupt, and than CPU copied result from
| videoram.
| 
| Recent additions where, add GPU possibility to read data from
| system disks, using mentioned before bus master, but those
| additions are not already wide implemented.
 
  | simne wrote:
  | For beginner, I think the best to begin read about Atari
  | consoles, Atari-65/130, NES, as their ideas where later
  | implemented in all commodity videocards, just slightly
  | extended.
  | 
  | BTW all modern videos use bank-switching.
 
| melenaboija wrote:
| It is old and I am not sure everything still applies but I found
| this course useful to understand how GPUs work:
| 
| Intro to Parallel Programming:
| 
| https://classroom.udacity.com/courses/cs344
| 
| https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...
 
| aliasaria wrote:
| There is some good information on how PCI-Express works here:
| https://blog.ovhcloud.com/how-pci-express-works-and-why-you-...
 
| dragontamer wrote:
| I'm no expert on PCIe, but its been described to me as a network.
| 
| PCIe has switches, addresses, and so forth. Very much like IP-
| addresses, except PCIe operates on a significantly faster level.
| 
| At its lowest-level, PCIe x1 is a single "lane", a singular
| stream of zeros-and-ones (with various framing / error correction
| on top). PCIe x2, x4, x8, and x16 are simply 2x, 4x, 8x, or 16
| lanes running in parallel and independently.
| 
| -------
| 
| PCIe is a very large and complex protocol however. This "serial"
| comms can become abstracted into Memory-mapped I/O. Instead of
| programming at the "packet" level, most PCIe operations are seen
| as just RAM.
| 
| > even virtual memory
| 
| So you understand virtual memory? PCIe abstractions go up to and
| include the virtual memory system. When your OS sets aside some
| virtual-memory for PCIe devices, when programs read/write to
| those memory-addresses, the OS (and PCIe bridge) will translate
| those RAM reads/writes into PCIe messages.
| 
| --------
| 
| I now handwave a few details and note: GPUs do the same thing on
| their end. GPUs can also have a "virtual memory" that they
| read/write to, and translates into PCIe messages.
| 
| This leads to a system called "Shared Virtual Memory" which has
| become very popular in a lot of GPGPU programming circles. When
| the CPU (or GPU) read/write to a memory address, it is then
| automatically copied over to the other device as needed. Caching
| layers are layered on top to improve the efficiency (Some SVM may
| exist on the CPU-side, so the GPU will fetch the data and store
| it in its own local memory / caches, but always rely upon the CPU
| as the "main owner" of the data. The reverse, GPU-side shared
| memory, also exists, where the CPU will communicate with the
| GPU).
| 
| To coordinate access to RAM properly, the entire set of atomic
| operations + memory barriers have been added to PCIe 3.0+. So you
| can perform "compare-and-swap" to shared virtual memory, and
| read/write to these virtual memory locations in a standardized
| way across all PCIe devices.
| 
| PCIe 4.0 and PCIe 5.0 are adding more and more features, making
| PCIe feel more-and-more like a "shared memory system", akin to
| cache-coherence strategies that multi-CPU / multi-socket CPUs use
| to share RAM with each other. In the long term, I expect Future
| PCIe standards to push the interface even further in this "like a
| dual-CPU-socket" memory-sharing paradigm.
| 
| This is great because you can have 2-CPUs + 4 GPUs on one system,
| and when GPU#2 writes to Address#0xF1235122, the shared-virtual-
| memory system automatically translates that to its "physical"
| location (wherever it is), and the lower-level protocols pass the
| data to the correct location without any assistance from the
| programmer.
| 
| This means that a GPU can do things like perform a linked-list
| traversal (or tree traversal), even if all of the nodes of the
| tree/list are in CPU#1, CPU#2, GPU#4, and GPU#1. The shared-
| virtual-memory paradigm just handwaves the details and lets PCIe
| 3.0 / 4.0 / 5.0 protocols handle the details automatically.
 
  | simne wrote:
  | I agree that PCIe is mostly shared memory system.
  | 
  | But for videocards this sharing is unequal, because their RAM
  | sizes exceeds 32bit address space, and lot of still used
  | mainboards have 32bit PCIe controller, so all PCIe addresses
  | should be inside 4GB address space, and this is seen on windows
  | machines as total installed memory is nor all, but minus
  | approximately 0.5GB, from which 256MB is videoram access
  | window.
  | 
  | So in most cases, remain in force rule, that videocard share
  | all it's memory through 256mb window using bank-switching.
  | 
  | As for GPU read main system memory, usually this is useless,
  | because vram is magnitudes faster, even if not consider usage
  | of bus bandwidth by other devices, like HDD/SSD.
  | 
  | And in most cases, only usage of access GPU to main system
  | memory, is traditional read of textures (for 3D accelerator)
  | from main system memory - for example ALL 3D software using GPU
  | rendering, could only use for this videoram, none use system
  | ram.
 
| roschdal wrote:
| Through the electrical wires in the PCI express port.
 
  | danielmarkbruce wrote:
  | I could be misunderstanding the context of the question, but I
  | think OP is imagining some sophisticated communication logic
  | involved at the chip level. The CPU doesn't know anything much
  | about the GPU other than it's there and data can be sent back
  | and forth to it. It doesn't know what any of the data means.
  | 
  | I think the logic OP imagines does exist, but it's actually in
  | the compiler (eg the cuda compiler), figuring exactly what
  | bytes to send which will start a program etc.
 
    | coolspot wrote:
    | Not in the compiler but in GPU driver. A graphic program (or
    | compute) just calls APIs (DirectX/Vulkan/CUDA) of a driver,
    | which then knows how to do that on a low-level writing to
    | particular regions of RAM mapped to GPU registers.
 
      | danielmarkbruce wrote:
      | Yes! This is correct. My bad, it's been too long. I guess
      | either way the point is that it's done in software, not
      | hardware.
 
        | lxgr wrote:
        | There's also odd/interesting architectures like one of
        | the earlier Raspberry Pis, where the GPU was actually
        | running its own operating system that would take care of
        | things like shader compilation.
        | 
        | In that case, what's actually being written to
        | shared/mapped memory is very high level instructions that
        | are then compiled or interpreted on the GPU (which is
        | really an entire computer, CPU and all) itself.
 
  | alberth wrote:
  | Nit pick...
  | 
  | Technically it's not "through" the electrical wires, it's
  | actually through the electrical field created _around_ the
  | electrical wires.
  | 
  | Veritasium explains https://youtu.be/bHIhgxav9LY
 
    | tux3 wrote:
    | Nitpicking the nitpick: the energy is what's in the fields,
    | but the electrical wires aren't just for show, the electrons
    | do need to be able to move in the wire for there to be a
    | current, and the physical properties of the wire have a big
    | impact on the signal.
    | 
    | So things get very complicated and unintuitive, especially at
    | high frequencies, but it's okay to say through the wire!
 
      | a9h74j wrote:
      | And as you might be alluding, particularly high
      | frequencies: in the skin (via skin effect) of the wire!
      | 
      | I'll confess I have never seen a plot of actual rms current
      | density vs radius related to skin effect.
 
| rayiner wrote:
| Typically CPU and GPU communicate over the PCI Express bus. (It's
| not technically a bus but a point to point connection.) From the
| perspective of software running on the CPU, these days, that
| communication is typically in the form of memory-mapped IO. The
| GPU has registers and memory mapped into the CPU address space
| using PCIE. A write to a particular address generates a message
| on the PCIE bus that's received by the GPU and produces a write
| to a GPU register or GPU memory.
| 
| The GPU also has access to system memory through the PCIE bus.
| Typically, the CPU will construct buffers in memory with data
| (textures, vertices), commands, and GPU code. It will then store
| the buffer address in a GPU register and ring some sort of
| "doorbell" by writing to another GPU register. The GPU
| (specifically, the GPU command processor) will then read the
| buffers from system memory, and start executing the commands.
| Those commands can include, for example, loading GPU shader
| programs into shader memory and triggering the shaders to execute
| those shaders.
 
  | Keyframe wrote:
  | If OP or anyone else wants to see this firsthand.. well shit, I
  | feel old now, but.. try an exercise into assembly programming
  | of commodore 64. Get a VICE emulator and dig into it for a few
  | weeks. It's real easy to get into, CPU (6502 based), video chip
  | (VIC II), sound chip (famous SID), ROM chips.. they all love in
  | this address space (yeah, not mentioning pages), CPU has three
  | registers.. it's also real fun to get into, even to this day.
 
    | vletal wrote:
    | Nice exercise. Similarly I learned most about basic computer
    | architecture by programing 8050 in ASM as well as C.
    | 
    | And I'm 32. Am I old yet? I'm not right? Right?
 
      | silisili wrote:
      | Sorry pal!
      | 
      | I remember playing Halo in my early 20's, and chatting with
      | a guy from LA who was 34. Wow, he's so old, why was he
      | still playing video games.
      | 
      | Here I sit in my late 30's...still playing games when I
      | have time, denying that I'm old, despite the noises I make
      | getting up and random aches and pains.
 
      | Keyframe wrote:
      | 40s are new thirties, my friend. Also, painkillers help.
 
    | jeroenhd wrote:
    | There's a nice guide by Ben Eater on Youtube about a
    | breadboard computers: https://www.youtube.com/playlist?list=P
    | LowKtXNTBypFbtuVMUVXN...
    | 
    | It doesn't sport any modern features like DMA, but builds up
    | from the core basics: a 6502 chip, a clock, and a blinking
    | LED, all hooked up on a breadboard. He also built a basic VGA
    | card and explains protocols like PS/2, USB, and SPI. It's a
    | great introduction or refresher into the low level hardware
    | concepts behind computers. You can even buy kits to play
    | along at home!
 
    | zokier wrote:
    | Is my understanding correct that compared to those historical
    | architectures, modern GPUs are a lot more asynchronous?
    | 
    | What I mean that these days you'd issue a data transfer or
    | program execution on the GPU, they will complete at its own
    | pace and the CPU in the meanwhile continues executing other
    | code; in contrast in those 8 bitters you'd poke a video
    | register or whatev and expect that to have more immediate
    | effect allowing those famous race the beam effects etc?
 
      | Keyframe wrote:
      | There were interrupts telling you when certain things
      | happened. If anything, it was asynchronous. Big thing is
      | also that you had to tally the cost of what you eere doing.
      | There was a budget of how many cycles you got per line, per
      | screen and then fit whatever you had to in that. With
      | playing sound it was common to draw color when you fed the
      | music into SID so you could tell, like a crude debug/ad hoc
      | printf, how many cycles your music routines ate.
 
  | divbzero wrote:
  | Going one deeper, how does the communication work on a physical
  | level? I'm guessing the wires of the PCI Express bus passively
  | propagate the voltage and the CPU and GPU do "something" with
  | that voltage?
 
    | throw82473751 wrote:
    | Voltages yes.. usually its all binary digital signals,
    | running serial/parallel and following some communication
    | protocol. Maybe you should have a look at something really
    | simple/old like UART communication to get some idea how this
    | works and then study next how this is scaled up over PCIE to
    | understand the chat between CPU/GPU?
    | 
    | Or maybe not, one does not need all the details, so often
    | just scaled concepts :)
    | 
    | https://en.m.wikipedia.org/wiki/Universal_asynchronous_recei.
    | ..
    | 
    | Edit: Wait it is really already QAM over PCIE? Yeah then UART
    | is a gross simplification, but maybe still a good one to
    | start with depending on knowledge level?
 
      | _3u10 wrote:
      | https://pcisig.com/sites/default/files/files/PCI_Express_El
      | e... It doesn't say QAM explicitly but it has all the QAM
      | terminology like 128 codes. Inter symbol interference etc.
      | I'm not an RF guy by any stretch but it sounds like QAM to
      | me.
      | 
      | This is an old spec. I think it's like equivalent to
      | QAM-512 for PCIe 6
 
      | rayiner wrote:
      | PCI-E isn't QAM. It's NRZ over a differential link, with
      | 64/66b encoding, and then scrambled to reduce long runs of
      | 0s or 1s.
 
    | wyldfire wrote:
    | It might be easier to start with older or simpler/slower
    | buses. ISA, SPI, I2C. In some ways ISA is very different -
    | latching multiple parallel channels together instead of
    | ganging independent serial lanes. But it makes sense to start
    | off simple and consider the evolution. Modern PCIe layers
    | several awesome technologies together, especially FEC.
    | Originally they used 8b10b but I see now they're using
    | 242b256b.
 
    | rayiner wrote:
    | Before you get that deep, you need to step back for a bit.
    | The CPU is itself several different processors and
    | controllers. Look at a modern Intel CPU:
    | https://www.anandtech.com/show/3922/intels-sandy-bridge-
    | arch.... The individual x86 cores are connected via a ring
    | bus to a system agent. The ring bus is a kind of parallel
    | bus. In general, a parallel bus works by having every device
    | on the bus operating on a clock. At each clock tick (or after
    | some number of clock ticks), data can be transferred by
    | pulling address lines high or low to signify an address, and
    | pulling data lines high or low to signify the data value to
    | be written to that address.
    | 
    | The system agent then receives the memory operation and looks
    | at the system address map. If the target address is PCI-E
    | memory, it generates a PCI-E transaction using its built-in
    | PCI-E controller. The PCI-E bus is actually a multi-lane
    | serial bus. Each lane is a pair of wires using differential
    | signaling
    | (https://en.wikipedia.org/wiki/Differential_signalling). Bits
    | are sent on each lane according to a clock by manipulating
    | the voltages on the differential pairs. The voltage swings
    | don't correspond directly to 0s and 1s. Because of the data
    | rates involved and the potential for interference, cross-
    | talk, etc., an extremely complex mechanism is used to turn
    | bits into voltage swings on the differential pairs: https://p
    | cisig.com/sites/default/files/files/PCI_Express_Ele...
    | 
    | From the perspective of software, however, it's just bits
    | sent over a wire. The bits encode a PCI-E message packet:
    | https://www.semisaga.com/2019/07/pcie-tlp-header-packet-
    | form.... The packet has headers, address information, and
    | data information. But basically the packet can encode
    | transactions such as a memory write or read or register write
    | or read.
 
    | tenebrisalietum wrote:
    | Older CPUs - the CPU had a bunch of A pins (address), a bunch
    | of D pins (data).
    | 
    | The A pins would be a binary representation of an address,
    | and the D pins would be the binary representation of data.
    | 
    | A couple of other pins would select behavior (read or write)
    | and allow handshaking.
    | 
    | Those pins were connected to everything else that needed to
    | talk with the CPU on a physical level, such as RAM, I/O
    | devices, and connectors for expansion. Think 10-base-T
    | networking where multiple nodes are physically modulating one
    | common wire on an electrical level. Same concept, but you
    | have many more wires (and they're way shorter).
    | 
    | Arbitration logic was needed so things didn't step on each
    | other. Sometimes things did anyway and you couldn't talk to
    | certain devices in certain ways or your system would lock up
    | or misbehave.
    | 
    | Were there "switches" to isolate and select among various
    | banks of components? Sure, they are known as "gate arrays" -
    | those could be ASICs or implemented with simple 74xxx ICs.
    | 
    | Things like NuBus and PCI came about - the bus controller is
    | directly connected and addressable to the CPU as a device,
    | but everything else is connected to the bus controller, so
    | now the new-style bus isn't tied to the CPU and can operate
    | at a different speed and CPU and bus speed are now decoupled.
    | (This was done on video controllers in the old 8-bit days as
    | well - to get to video RAM you had to talk to the video chip,
    | and couldn't talk to video RAM directly on some 8-bit
    | systems).
    | 
    | PCIE is no longer a bus, it's more like switched Ethernet -
    | there's packets and switching and data goes over what's
    | basically one wire - this ends up being faster and more
    | reliable if you use advanced modulation schemes than keeping
    | multiple wires in sync at high speeds. The controllers facing
    | the CPU still implement the same interface, though.
 
    | _3u10 wrote:
    | It's signaled similar to QAM. Far more complicated than GPIO
    | type stuff. Think FM radio / spread spectrum rather than
    | bitbanging / old school serial / parallel ports.
    | 
    | Similar to old school modems if the line is noisy it can drop
    | to lower "baud" rates. You can manually try to recover higher
    | rates if the noise is gone but it's simpler to just reboot.
 
    | tux3 wrote:
    | Oh, that is _several_ levels deeper! PCIe is a big standard
    | with several layers of abstraction, and it 's far from
    | passive.
    | 
    | The different versions of PCIe use a different encoding, so
    | it's hard to sum it all up in a couple sentences in terms of
    | what the voltage does.
 
  | monkeybutton wrote:
  | IMO memory-mapped IO is the coolest thing since sliced bread.
  | It's a great example in computing where many different kinds of
  | hardware can all be brought together under a relatively simple
  | abstraction.
 
    | the__alchemist wrote:
    | It was a glorious "click" when learning embedded programming.
    | Even when writing Rust in typical desktop uses, it all
    | feels... abstract. Computer program logic. Where does the
    | magic happen? Where do you go from abstract logic to making
    | things happen? The answer is in voltatile memory reads and
    | writes to memory-mapped IO. You write a word to a memory
    | address, and a voltage changes. Etc.
 
| justsomehnguy wrote:
| TL;DR: bi-directional memory access with some means to notify the
| other part about "something has changed".
| 
| It's not that different for any other PIC/E device, be it a
| network card or a disk/HBA/RAID controller.
| 
| If you want to understand how it came to this - look at the
| history of ISA, PCI/PCI-X, a short stint for AGP and finally
| PCI-E.
| 
| Other comments provides a good ELI15 for the topic.
| 
| A minor note about "bus" - for PCEe it is mostly a historic term,
| because it's a serial, P2P connection, though the process of
| enumerating and qurying the devices is still very akin to what
| you would do on some bus-based system, e.g.: SAS is a serial
| "bus", compared to SCSI, but still you operate with it as some
| "logical" bus, because it is easier for humans to grok it this
| way.
 
| dyingkneepad wrote:
| On my system, the CPU sees the GPU as a PCI device. The "PCI
| config space" [0] is a standard thing and so the CPU can read it
| and figure out its device ID, vendor ID, revision, class, etc.
| From that, the OS looks at its PCI drivers and tries to find
| which one claims to drive that specific PCI device_id/vendor_id
| combination (or class in case there's some kind of generic
| universal driver for a certain class).
| 
| From there, the driver pretty much knows what to do. But
| primarily the driver will map the registers to memory addresses,
| so accessing offset 0xF0 from that map is equivalent as accessing
| register 0xF0. The definition of what each register does is
| something that the HW developers provide to the SW developers
| [1].
| 
| Setting modes (screen resolution) and a lot of other stuff is
| done directly by reading and writing to these registers. At some
| point they also have to talk about memory (and virtual addresses)
| and there's quite a complicated dance to map GPU virtual memory
| to CPU virtual memory. On discrete GPUs the data is actually
| "sent" to the memory somehow through the PCI bus (I suppose the
| GPU can read directly from the memory without going through the
| CPU?), but in the driver this is usually abstracted to "this is
| another memory map". On integrated systems both the CPU and GPU
| read directly from the system memory, but they may not share all
| caches so extra care is required here. In fact, caches may also
| mess the communication on discrete graphics, so extra care is
| always required. This paragraph is mostly done by the Kernel
| driver in Linux.
| 
| At some point the CPU will tell the GPU that a certain region of
| memory is the framebuffer to be displayed. And then the CPU will
| formulate binary programs that are written in the GPU's machine
| code, and the CPU will submit those programs (batches) and the
| GPU will execute them. These programs are generally in the form
| of "I'm using textures from these addresses, this memory holds
| the fragment shader, this other holds the geometry shader, the
| configuration of threading and execution units is described in
| this structure as you specified, SSBO index 0 is at this address,
| now go and run everything". After everything is done the CPU may
| even get an interrupt from the GPU saying things are done, so
| they can notify user space. This paragraph describes mostly the
| work done by the user space driver (in Linux, this is Mesa),
| which implements OpenGL/Vulkan/etc abstractions.
| 
| [0]: https://en.wikipedia.org/wiki/PCI_configuration_space [1]:
| https://01.org/linuxgraphics/documentation/hardware-specific...
 
| derekzhouzhen wrote:
| Other has mentioned MMIO. MMIO has several kinds:
| 
| 1. CPU accessing GPU hw with uncache-able MMIO, such as lower
| level register access
| 
| 2. GPU accessing CPU memory with cache-able MMIO, or DMA. such as
| command and data stream
| 
| 3. CPU accessing GPU memory with cache-able MMIO, such as
| textures
| 
| They all happen on the bus with different latency and bandwidth.
 
| ar_te wrote:
| And I you looking for some strange architecture forgoten by
| time:). https://www.copetti.org/writings/consoles/sega-saturn/
 
| throwra620 wrote:
 
| brooksbp wrote:
| Woah there, my dude. Let's try to understand a simple model
| first.
| 
| A CPU can access memory. When a CPU performs loads & stores it
| initiates transactions containing the address of the memory.
| Therefore, it is a bus master--it initiates transactions. A slave
| accepts transactions and services them. The interconnect routes
| those transactions to the appropriate hardware, e.g. the DDR
| controller, based on the system address map.
| 
| Let's add a CPU, interconnect, and 2GB of DRAM memory:
| +-------+       |  CPU  |       +---m---+           |
| +---s--------------------+       |      Interconnect      |
| +-------m----------------+               |
| +----s-----------+          | DDR controller |
| +----------------+                     System Address Map:
| 0x8000_0000 - 0x0000_0000  DDR controller
| 
| So, a memory access to 0x0004_0000 is going to DRAM memory
| storage.
| 
| Let's add a GPU.                 +-------+    +-------+       |
| CPU  |    |  GPU  |       +---m---+    +---s---+           |
| |       +---s------------m-------+       |      Interconnect
| |       +-------m----------------+               |
| +----s-----------+          | DDR controller |
| +----------------+                     System Address Map:
| 0x9000_0000 - 0x8000_0000  GPU         0x8000_0000 - 0x0000_0000
| DDR controller
| 
| Now the CPU can perform loads & stores from/to the GPU. The CPU
| can read/write registers in the GPU. But that's only one-way
| communication. Let's make the GPU a bus master as well:
| +-------+    +-------+       |  CPU  |    |  GPU  |
| +---m---+    +--s-m--+           |           | |       +---s
| -----------m-s-----+       |      Interconnect      |
| +-------m----------------+               |
| +----s-----------+          | DDR controller |
| +----------------+                     System Address Map:
| 0x9000_0000 - 0x8000_0000  GPU         0x8000_0000 - 0x0000_0000
| DDR controller
| 
| Now, the GPU can not only receive transactions, but it can also
| initiate transactions. Which also means it has access to DRAM
| memory too.
| 
| But this is still only one-way communication (CPU->GPU). How can
| the GPU communicate to the CPU? Well, both have access to DRAM
| memory. The CPU can store information in DRAM memory (0x8000_0000
| - 0x0000_0000) and then write to a register in the GPU
| (0x9000_0000 - 0x8000_0000) to inform the GPU that the
| information is ready. The GPU then reads that information from
| DRAM memory. In the other direction, the GPU can store
| information in DRAM memory, and then send an interrupt to the CPU
| to inform the CPU that the information is ready. The CPU then
| reads that information from DRAM memory. An alternative to using
| interrupts is to have the CPU poll. The GPU stores information in
| DRAM memory and then sets some bit in DRAM memory. The CPU polls
| on this bit in DRAM memory, and when it changes, the CPU knows
| that it can read the information in DRAM memory that was
| previously written by the GPU.
| 
| Hope this helps. It's very fun stuff!
 
| pizza234 wrote:
| You'll find a very good introduction in the comparch book "Write
| Great Code, Volume 1", chapter 12 ("Input and Output"), which
| also explains the history of system buses (therefore, you'll find
| an explanation of how ISA works).
| 
| Interestingly, there is a footnote explaining that "Computer
| Architecture: A Quantitative Approach provided a good chapter on
| I/O devices and buses; sadly, as it covered very old peripheral
| devices, the authors dropped the chapter rather than updating it
| in subsequent revisions."
 
| throwmeariver1 wrote:
| Everyone in tech should read the book "Understanding the Digital
| World" by Brian W. Kernighan.
 
  | arduinomancer wrote:
  | Is it very in-depth or more for layman readers?
 
    | throwmeariver1 wrote:
    | Most normal people would get a red head when reading it and
    | techies would nod along and sometimes say "uh... so that's
    | how it really works". It's in between but a good primer on
    | the essentials.
 
  | dyingkneepad wrote:
  | Is this before or after they read Knuth?
 
| zoenolan wrote:
| Other are not wrong in saying Memory mapped IO. taking a look at
| the Amiga hardware Reference manual [1] and a simple example [2]
| or a NES programming guide [3] would be a good way to see this in
| operation.
| 
| A more modern CPU/GPU setup is likely to use a ring buffer. The
| buffer will be in CPU memory. That memory is also mapped into the
| GPU address space. The Driver on the CPU will write commands into
| the buffer which the GPU will execute. These will be different to
| the shader unit instruction set.
| 
| Commands would be setting some internal GPU register to a value.
| Allowing the setting resolution, framebuffer base pointer, set up
| the output resolution, setting the mouse pointer position,
| reference a texture from system memory, load a shader, execute a
| shader, set a fence value (Useful for seeing when a resource,
| texture, shader is no longer in use).
| 
| Hierarchical DMA buffers are a useful feature of some DMA
| engines. You can think of them as similar to sub routines. The
| command buffer can contain an instruction to switch execution to
| another chunk of memory. This allows the driver to reuse
| operations or expensive to generate sequences. OpenGL's display
| list commonly compiled down to separate buffer.
| 
| [1] https://archive.org/details/amiga-hardware-reference-
| manual-...
| 
| [2] https://www.reaktor.com/blog/crash-course-to-amiga-
| assembly-...
| 
| [3] https://www.nesdev.org/wiki/Programming_guide
 
| chubot wrote:
| BTW I believe memory maps are set up by the ioctl() system call
| on Unix (including OS X), which is kind of a "catch all" hole
| poked through the kernel. Not sure about Windows.
| 
| I didn't understand that for a long time ...
| 
| I would like to see a "hello world GPU" example. I think you
| open() the device and the ioctl() it ... But what happens when
| things go wrong?
| 
| Similar to this "Hello JIT", where it shows you have to call
| mmap() to change permissions on the memory to execute dynamically
| generated code.
| 
| https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-...
| 
| I guess one problem is that this may be typically done in vendor
| code and they don't necessarily commit to an interface? They make
| you link their huge SDK
 
___________________________________________________________________
(page generated 2022-03-30 23:01 UTC)