proxy70

	[HN Gopher] How fast can a 6502 transfer memory? ___________________________________________________________________ How fast can a 6502 transfer memory? Author : xmyatniyx Score : 164 points Date : 2022-06-16 11:24 UTC (11 hours ago)
	web link (imapenguin.com)
	w3m dump (imapenguin.com)
	\| forinti wrote: \| I once thought about reading HD floppies on a BBC Micro (which \| can handle SD and DD). But it turns out it can't handle the speed \| at which the bits come in (500kbps). \| \| SD and even DD are fine (125kbps and 250kbps), so you could read \| 360KB floppies from a PC. \| spc476 wrote: \| The 6502 doesn't have a pipeline, so it's quite easy to count \| instruction cycles and find out how much time a given piece of \| code will take. I did this technique to bit-bang a serial port \| back in the day (given two one-bit ports with the CPU doing all \| the work because the system was too cheap to have an actual \| UART). \| joosters wrote: \| To be really pedantic, there's a big difference between 'memory \| bandwidth' and 'memory transfer speed'. The former is just \| reading (or writing) to a block of memory, and the latter is \| copying data from one location to another. So a 'memory transfer \| speed' is going to be slower. \| jmull wrote: \| I think it's actually not that pedantic. \| \| "Memory bandwidth" is being used in marketing materials today, \| so it's a little useful to understand what it means. (The \| author of this article confuses it with memory transfer, \| probably others do as well.) \| bluemax wrote: \| Wow, this brings back memories from more than 3 decades ago. I \| created a routine on the C64 to copy memory and calculated the \| performance then around 25KB/sec. \| \| The first version contained a memory corrupting bug that took \| some time to figure out. Depending on the locations of the source \| and destination you have to start copying forwards from the \| beginning of the source, or backwards from the end. If there's an \| overlap you risk overwriting the source before it is copied to \| the destination. \| scarface74 wrote: \| I immediately noticed that in one of the code samples that he is \| loading and storing data from memory that's not on the first page \| - memory locations $00-$FF - memory access to and from the first \| page took one less clock cycle. \| \| The LDA, STA, etc operators for zero page access are different \| opcodes than their two byte address equivalents \| mmphosis wrote: \| This also misses a lot of other tricks: self-modifying unrolled \| code, keeping track of blocks of memory that don't need to be \| updated, or blocks of memory that have the same value. Memory \| moves may not be the fastest way. 0 HIMEM: 5608 \| 1DATA553256330092213902139021393213504523826353650303725262750384 \| 56153058173120922939029328358454788756458365750371488455503510839 \| 66132653706165276445377489621322135821322334213502130051520282036 \| 2DATAQLNZQLNZQAQQDSAQRDSAQQDSAQVDSAQXCKDNAPFXANAQXXANANXNXNXQXNXC \| QXKXNZQXKXNZQCQODMAQODMAUUYQXCKATAUVQXCKMNXQXKANXUAUTCQXKENXUMUSQ \| JHSAHFQZTXRDFQZTAOCQZTAOBQHDSAPDSAQADSANZNZKDSAQZDSAXZQZOZXZUAXZJ \| 3 READ L$: READ H$ 4 FOR I = 1 TO LEN (L$) \| 5POKE767+I,10(ASC(MID$(H$,I,1))-65)+VAL(MID$(L$,I,1)) 6 \| NEXT RUN HGR : CALL 768: CALL 5608 \| metadat wrote: \| Is Mario Bros a ripoff of Sam's Journey? Or vice versa? \| \| Either way, beautiful. \| gs17 wrote: \| Sam's Journey looks to be a 2017 release, so it definitely \| didn't inspire Mario. \| \| [0] https://www.knightsofbytes.games/ \| cldellow wrote: \| Vice versa, Sam's Journey was released in 2017: \| https://www.knightsofbytes.games/samsjourney \| dusted wrote: \| I wouldn't call the 6502 a RISC CPU.. It was clearly designed for \| humans to program, it has multiple and complex addressing modes \| and instructions to make it easier for us.. \| \| Sure it is a small instruction set compared to modern CPUs, but \| RISC is an idea, not a number. \| \| I'd venture to say that RISC is designing with the goal of making \| very efficient instructions and allow very efficient compilers to \| be written for it.. It's the idea to have one faster way of doing \| something rather than multiple convenient ways, because the \| compiler don't care, and compiler vendors appreciate not having \| to chose between multiple almost similar instructions that may or \| may not be faster in some particular case if they don't have to. \| jsrcout wrote: \| Don't remember where I saw this, but someone said the 6502 was \| really an RTCC (Reduced Transistor Count Computer). \| [deleted] \| pvg wrote: \| It's a joke, as the article says. \| vardump wrote: \| Two nits to pick from this article: \| \| While the article does mention it ignores loop unrolling, it's a \| bit disingenuous, because that almost DOUBLES performance and \| it's what nearly all real world code is doing. \| \| Also Sam's Journey PAL version does not need any kind of DMA \| transfer tricks. NTSC version is just a tiny bit behind in \| timing, so to be glitch free, REU is used. PAL version still \| works with minor glitches on an NTSC system. \| \| This is because NTSC has 263 63 - 25 * 40 = 15569 cycles \| available per frame (ignoring those stolen by sprites) and PAL \| 263 * 63 - 25 * 40 = 18656 cycles (again, ignoring sprites). \| \| The difference is enough that the NTSC version can't move \| required 2000 bytes of color RAM and character RAM in time in the \| worst case without REU. \| Rediscover wrote: \| I'm remembering (possibly incorrectly) PAL being 312 (not 263). \| \| Is that what You intended? \| vardump wrote: \| Yeah. I accidentally left that value wrong after a copy \| paste. But the result is right. :-) \| Joyfield wrote: \| My DOCSIS 3.1 Internet connection has more download bandwidth \| than my Amiga 500 had to RAM. Latency, not so much. \| cmrdporcupine wrote: \| Anybody else interested in writing a WASM VM for the 6502 or \| 65816 etc? This was my brainwave this week. I think this would be \| a supremely nerdy fun thing to do. \| DeathArrow wrote: \| Beat that, Apple! \| cestith wrote: \| Apple used the very same processor family at one time. ;-) \| \| They've come a long way. \| NobodyNada wrote: \| For a very loose definition of "same professor family", they \| still do -- ARM is sort of a spiritual successor to the 6502: \| https://en.wikipedia.org/wiki/ARM_architecture_family#Histor. \| .. \| \| ARM was designed by the team at Acorn that had worked on the \| BBC Micro, which used a 6502. They decided to design a custom \| processor because bit felt none of the 16- or 32-bit \| processors on the market at the time met the standard set by \| the 6502 for simplicity and low cost. So, they designed their \| own architecture which took cues from both the cutting-edge \| RISC research in academia, and the simple practicality of the \| 6502. \| \| (On a similar note: the 6502's main competitor, the Zilog \| Z80, is an early ancestor of x86! The Z80 is an enhanced \| clone of the Intel 8080, which of course the 8086 was heavily \| based on.) \| \| This legacy still shows up today in the instruction \| mnemonics: ARM uses "branch" naming (BEQ - branch if equal, \| BCS - branch if carry set, etc) because that's what the 6502 \| used, whereas x86 spells it "jump" (JEQ, JCS, etc.). ARM uses \| LDR/STR to load and store registers from memory (like the \| 6502's LDA/LDX/LDY/STA/STX/STY), whereas x86 just spells \| everything "MOV". ARM only uses memory-mapped I/O to access \| hardware, whereas x86 has separate input and output ports. \| cestith wrote: \| The 6502 was a clone-ish of the Motorola 6800 made to be \| lower cost. The 6800 led to the 6809 (another \| competitor,used by the Tandy CoCo and IIRC the Dragon) and \| to the 68000 series, used by Apple in the Mac, Sun in its \| early systems, NeXT, Amiga, Atari in their later systems, \| and more. That led to the PowerPC partnership of Motorola, \| Apple, and IBM. \| \| PowerPC was outliving its useful life due not to ISA, but \| manufacturing limitations. So Apple went to Intel, but that \| wasn't fit for mobile. Apple partnered with ARM to make \| their mobile chips. Then their mobile chips grew into the \| M1 and M2 along with ARM, bringing them back to a RISC-ish \| platform like they had with PowerPC. So it's sort of a dual \| path back to the same place. \| NobodyNada wrote: \| > So Apple went to Intel, but that wasn't fit for mobile. \| Apple partnered with ARM to make their mobile chips. \| \| There's a lot of interesting history there too: in 1990, \| after seeing the first-generation ARM CPU, Apple \| partnered with Acorn to co-found ARM Ltd and develop a \| mobile processor for the Apple Newton. Although the \| Newton was a failure, ARM was very successful and powered \| pretty much the entirety of the mobile device revolution \| -- including of course the iPod and iPhone. \| \| Apple's co-founder status gives them a lot of influence \| over the ARM architecture -- they led the AArch64 design \| process, and they seem to be allowed to do things that \| even other architectural licensees aren't allowed to do, \| like implementing custom instructions in their ARM cores: \| https://news.ycombinator.com/item?id=29783549 \| \| And Apple's iteration of ARM owes a lot to the PowerPC \| world as well -- Apple's processor design team was \| originally PA Semi, a company that designed PowerPC \| cores. \| klelatti wrote: \| > they led the AArch64 design process \| \| Interesting - is there a reference for this? \| NobodyNada wrote: \| Here's a Twitter thread from a former Apple engineer: \| https://twitter.com/stuntpants/status/1346470705446092811 \| \| > arm64 is the Apple ISA, it was designed to enable \| Apple's microarchitecture plans. There's a reason Apple's \| first 64 bit core (Cyclone) was years ahead of everyone \| else, and it isn't just caches. \| \| > Arm64 didn't appear out of nowhere, Apple contracted \| ARM to design a new ISA for its purposes. When Apple \| began selling iPhones containing arm64 chips, ARM hadn't \| even finished their own core design to license to others. \| \| > ARM designed a standard that serves its clients and \| gets feedback from them on ISA evolution. In 2010 few \| cared about a 64-bit ARM core. Samsung & Qualcomm, the \| biggest mobile vendors, were certainly caught unaware by \| it when Apple shipped in 2013. \| \| > Apple planned to go super-wide with low clocks, highly \| OoO, highly speculative. They needed an ISA to enable \| that, which ARM provided. \| \| > M1 performance is not so because of the ARM ISA, the \| ARM ISA is so because of Apple core performance plans a \| decade ago. \| klelatti wrote: \| Very interesting - many thanks! \| \| Edit: I'm a bit puzzled by the claim that Apple was \| selling Aarch64 before Arm had finished their first \| design - A7 announced at end 2013 but A53 appeared in \| 2012? \| NobodyNada wrote: \| It looks like A53 was _announced_ in October 2012, but \| I've found no indication of whether the design was \| actually finished by then [0]. And remember that ARM just \| sells IP and other companies are responsible for \| manufacturing it; it doesn't look like anyone actually \| produced A53 cores until 2015 [1] -- whereas Apple was \| shipping actual consumer products with A7's in them by \| October 2013. \| \| [0]: https://www.techspot.com/news/50656-arm- \| announces-64-bit-cor... \| \| [1]: https://en.wikichip.org/wiki/arm_holdings/microarchi \| tectures... \| klelatti wrote: \| Very fair point. OTOH there was a lot of detailed info on \| the A53 available in 2013 and SoCs were being announced \| with it. \| \| I suspect this thread may be slightly exaggerating the \| position but certainly the case that Apple were well \| ahead of all the competitors - and no doubt they were \| deeply involved in the ISA design. \| cmrdporcupine wrote: \| I honestly don't think there's any kind of straight line \| from the 6809 to the 68000. They share little in common \| other than the '68' prefix and coming from the same \| company and being big endian. The instruction sets are \| very different. Designed by different teams. The \| peripheral chip set and bus management was different too. \| \| The 68k shares more with 1970s minicomputers especially \| the PDP-11 and/or VAX architectures than any MPU that \| preceded it. \| DeathArrow wrote: \| >the Zilog Z80, is an early ancestor of x86! The Z80 is an \| enhanced clone of the Intel 8080, which of course the 8086 \| was heavily based on \| \| I owned an Z80 based computer when I was 8 to 10 years old. \| Its instruction set and memory access does not have any \| resemblance for me with 8086. \| \| They seem like very distant relatives. \| klelatti wrote: \| The 8086 was designed to allow automated translation of \| 8080 assembly to 8086 assembly - so the instruction set \| may 'look' different but in fact has a lot in common. \| \| Not quite right too to call the Z80 an ancestor of the \| 8086 but certainly closely related due to the common \| inheritance from the 8080. \| NobodyNada wrote: \| Yeah, perhaps more of an uncle than a direct ancestor :) \| klelatti wrote: \| Indeed - someone should do a family tree of CPUs! \| cestith wrote: \| We need to decide where the NEC v20, v30, v40, and v50 \| live. \| krallja wrote: \| And the NSC-800, which is like a Z80 with 8085 half- \| interrupts! \| klelatti wrote: \| It's the offspring of the marriage of Z80 and 8085! \| MarkusWandel wrote: \| This also misses loop unrolling, combined with an assembly \| language version of "Duff's device" to be able to do an arbitrary \| number of transfers even if your loop is unrolled to, say, 8 \| transfers. \| \| This stuff used to matter! I had an NCR5380 chip on an Amiga, \| simple, memory mapped I/O, no DMA or interrupts. To get a tape \| drive to stream (remember that?) the byte transfer loop really \| had to be tweaked. But once fully tweaked, "whoooooooosh" instead \| of "chugga chugga chugga". \| \| And truly heroic programming techniques had to be employed on the \| C64 to do X/Y smooth scrolling games. Often a static part of the \| screen, conveniently displaying scores etc, existed to make it \| work - there was just enough bandwidth to do 80% of the screen, \| say, so you find an excuse to keep the rest of it static. \| \| I kinda miss those days, and I kinda don't. I guess it was good \| to have experienced them. \| djmips wrote: \| Those days still exist! If you want your mind blown watch the \| Epic Games Nanite talk from last year's SIGGRAPH where the core \| rendering of the dense vertex data is done directly in Compute \| , IE software rendered, instead of using the hardware \| rasterization hardware which has a minimum 4 pixel invocation \| overhead which gets expensive with very small triangles. \| \| This is but one example of this that's happening every day, \| there is much much more like hair rendering in EA FIFA soccer \| or automatic trading financial software running on GPUs. \| \| There's a whole world of applications where people are still \| concerned with every last cycle of performance just like in the \| C64 days. \| djmips wrote: \| Here I'll save you the trouble of trying to find the video. \| \| https://www.youtube.com/watch?v=eviSykqSUUw&list=PLabw4gCouT. \| .. \| amelius wrote: \| We're now doing sort of the same tricks, but with power \| management on mobile devices. \| djmips wrote: \| Kind of off topic for 6502 memory copy speeds but with regard \| to scrolling, there became a pretty cool software hack for the \| C64 (called VSP) where you could trick the poor VIC chip into \| starting scanning out the screen position later in memory. Move \| the start one character and the whole screen shifts left by 8 \| pixels. You only need to repaint a vertical column for this \| 'course' scroll instead of moving the entire screen of \| characters. This is something that should have been built into \| the hardware and was very useful on other systems that had that \| ability (like the NES for example) \| \| With it you can reduce the amount of memory you need to copy \| every 8 pixels (the 8 pixel part can be done with smooth scroll \| registers). \| \| There's a thread and example code on github here. \| https://www.lemon64.com/forum/viewtopic.php?t=70539 \| \| Also note it's such a terrible hack on the DRAM that it doesn't \| work on all C64s and there's a technical discussion about that \| here. https://www.linusakesson.net/scene/safevsp/index.php \| \| Hardware mod if VSP doesn't work on your C64 and more technical \| details. http://wiki.icomp.de/wiki/VSP-Fix \| \| Also it makes mention of the C64-Reloaded which is a modern C64 \| product that includes the fix. \| js2 wrote: \| > This also misses loop unrolling. \| \| It's mentioned at the bottom of the article in the "Thoughts" \| section: \| \| _You could certainly use self modifying code and unroll this \| copy routine to get better performance at the price of \| flexibility and arguably understanding for the average casual \| 6502 assembly coder. Again, this was not a "how fast can we \| absolutely make it" but an everyday use examination._ \| hinkley wrote: \| Duff's device is a fixed size loop unrolling with an ugly \| hack to make it behave for arbitrary inputs. The assembly \| makes sense but the C code is rough. \| \| It's not quite as fast as self-modifying or custom compiled \| code, but it's pretty close. \| tialaramex wrote: \| Tom Duff's device was doing that because he's doing MMIO, you \| should not [I know you're not suggesting it, but just in case \| anybody reading thinks it's clever] do this today when you \| don't want MMIO, your compiler is very capable of just doing an \| actual copy quickly, so tell it that's what you want, don't \| write gymnastics like Duff's device. \| \| However, expressing these partially unrolled loops nicely is a \| nice performance-not-safety feature of WUFFS called "Iterate \| loops": \| \| https://github.com/google/wuffs/blob/main/doc/note/iterate-l... \| \| Well, I say performance not safety, as always they want both, \| but you _could_ safely just write the never unrolled case, \| while the existence of Iterate loops allows you to express a \| much faster special case but know the compiler will fix things \| up properly no matter what. \| vardump wrote: \| We're talking about C64 (and maybe Amiga). \| \| Compiler is not going to do absolutely anything for you on \| those retro platforms. \| tialaramex wrote: \| Aw, just needs a better compiler (with a 6502 target) :D \| \| Jason Turner's CppCon 2021 talk, "Your New Mental Model of \| constexpr" has half the presentation as a C64 program \| (though for practical reasons not actually running on a C64 \| but instead an emulator) because _most_ of the heavy \| lifting is done by the C++ 20 compiler. \| https://youtu.be/MdrfPSUtMVM?t=1422 \| \| Now, Jason's approach is not going to beat hand-crafted \| 6502 machine code _in a fair fight_ but he often doesn 't \| need to fight fair and that's the point of his talk. \| localhost wrote: \| The C64 VIC-II chip would grab the address bus from the CPU \| every 8 scan lines on the screen. Some of the early "fast load" \| cartridges like the Epyx FastLoad cartridge that would \| accelerate loading games from the floppy drive would blank the \| entire screen during load so that their async data transfer \| routines wouldn't get interrupted by the VIC-II chip grabbing \| the bus. I wrote a similar (better?) cartridge where I would \| need to use the register on the VIC-II chip that reported the \| scan line as a sync marker to transfer 3 bytes asynchronously \| from the 1541 down the clock and data lines of the serial bus. \| Good times. \| MarkusWandel wrote: \| In my recollection Epyx Fastload did not blank the screen, \| though some earlier fast loaders did. \| \| I also remember the software voice synthesizer "SAM" needing \| to blank the screen to render glitch-free sampled audio. Then \| along came, what was it "Impossible Mission" ("Another \| visitor! Stay a while...") doing pretty clean sampled audio \| with the screen on. Not that the C64 SID chip was even \| remotely intended to be able to play sampled audio in the \| first place! \| \| The Amiga was unimaginably powerful by comparison. Even a \| basic configuration had 8x the memory a C64 had, and it had \| all those fancy DMA toys to offload the CPU. \| jimsmart wrote: \| > Not that the C64 SID chip was even remotely intended to \| be able to play sampled audio in the first place! \| \| I don't recall how SAM did it, but sample playing on the \| C64 SID chip was indeed a nice trick -- it was actually \| done by modulating the main output volume, which made a \| slight click when changing. \| \| Eventually this got used by some of the C64 musicians / \| music player libs, so one could play a channel of samples \| as well as the three regular synth channels on the SID. \| IIRC, Outrun used this particularly well in its title \| screen and/or loading music, having some vocal samples \| "O-O-Outrun!" (and skidding sound effects) as well a \| sampled drums. \| \| Annoyingly, IIRC, some revisions of the SID chip behaved \| slightly differently, and had louder or softer sample \| playback when this hack was used. But still: clever stuff. \| weinzierl wrote: \| ... and main output volume had only 16 levels so the \| samples were 4 bit quantized. It is a wonder that we \| could get understandable vocal samples with this hack at \| all. I distinctly remember "Goal!" from Peter Shilton's \| Fottball and "Accolade presents" from Test Drive [1]. In \| the examples one can hear the amount of quantization \| noise that low bit depth caused. \| \| [1] https://m.youtube.com/watch?v=L1u-WydiiCI \| vardump wrote: \| You can actually get about 6-7 bits resolution out of \| same SID volume register. 4 bits from the volume \| register, channel 3 disable and 3 filter bits. Requires \| some setup to get SID in a particular state first. \| \| For details, see: https://livet.se/mahoney/c64-files/Musi \| k_RunStop_Technical_D... \| le-mark wrote: \| Crypto mining with fpga was in the weeds down this path, or 100 \| Gbps signal processing. Two examples where low level stuff is \| still relevant, just not commodity and widely available like \| the 8 bit micros were. \| aerique wrote: \| I also read a high frequency trading blog that was posted \| here a few years ago. Same thing: hacking hardware and \| software so the first bytes of info could be grabbed from a \| stream and acted upon, instead of needing to wait for the \| whole package to have come in. \| \| Also when I was in the demo scene on the Atari ST one had to \| do specific timings in the assembly code to be able to draw \| outside the screen's borders (so the borders were on screen \| but couldn't be drawn on by code). \| Psyladine wrote: \| >I kinda miss those days, and I kinda don't. I guess it was \| good to have experienced them. \| \| There's a certain reflective quality and even satisfaction from \| using a chainsaw after coming up using a tree saw by hand. It \| feels progressive, even if it is just optimizing for time at \| the expense of energy. \| 6510 wrote: \| I eventually managed to do a scroll texts fast enough to do \| them while the pixels of the scroll text were printed to the \| screen. It was even fast enough to have char combinations in \| the scroll text that modified its speed and direction with \| speeds like "one time scroll 1 pixel the next 2". One of the \| tricks was to do "poor" timing with NOP's by requiring an empty \| row of bits between each scroll text. (The text becomes \| unreadable anyway if there is no space between the lines) \| vidarh wrote: \| Most C64 games used character-based graphics (coupled with the \| smooth scrolling support in the VIC) which meant you'd at most \| move 2000 bytes to scroll the entire screen every 4 to 8 pixels \| scrolled. \| \| You can easily scroll the entire screen on a C64 if that's all \| you're doing. \| \| Some games did also scroll bitmaps. There the naive version \| requires moving 9000 bytes (40x25x8 for the bitmap data, 40x25 \| for the colour data) every time you need to scroll, and that \| indeed starts to bite. There are games which reduces the cost \| this using a trick called AGSP ("Any Given Screen Position"). \| \| But you're right static parts of the screen were often larger \| to reduce the dynamic part. That was rarely down to just \| scrolling the screen in isolation, though, but because the \| overall budget of cycles you had to work with was tiny. Often \| you might also have a lot of other stuff which consumed lots of \| cycles _affected directly by the size of the playing field_. \| E.g. if you did sprite multiplexing (moving a sprite after it \| had been partially or fully rendered to reuse the same hardware \| sprite), you might well be keeping the CPU busy throughout the \| full rendering of the playing field. \| \| There was also the consideration of how much effort you wanted \| to go to in order to avoid glitches, since unless you could do \| the scrolling entirely while the VIC was rendering the parts of \| the screen outside the playing field, you'd need to make sure \| the rendering and copying didn't overlap, and of course just \| restricting playing field size was an easy workaround for that \| problem. \| MarkusWandel wrote: \| I didn't get that fancy. I got as far as a horizontal smooth \| scroller, but with the "move the screen memory during one \| 1/60 second redraw cycle" mentality - racing the redraw, and \| when it was just about caught up, whoa, time for the static \| bar at the bottom. \| \| Quite right, one could prepare the moved version in the \| background during the 7 steps where you're merely diddling \| the smooth scroll register, and then flip to it in an \| instant. But wait, was it possible to page flip the colour \| map? Also, always having the appropriate moved version ready \| even as the player is doing unpredictable things goes into \| the "heroic programming techniques" zone again. \| \| As for glitches, it's amazing what can be done if perfection \| is sacrificed and there were plenty of good games that did \| have them, e.g. sprite multiplexing. But I did mean \| "effortless looking perfect" smooth scrolling. \| jimsmart wrote: \| Ex-C64 games coder here: you are are correct - no, you \| couldn't relocate/page-flip the colour map, like you could \| the character map. So you had to update it all somehow on \| the required frame. \| \| The fastest technique I saw for updating the colour map in \| a single go, was to have the whole thing as a huge block of \| immediate mode load-stores, then one could 'scroll' the \| data across the LDA instructions within that code, in \| advance, over n-frames, then call this self-modified code \| block when one did the character screen flip (immediate \| load-stores was faster than load-stores from colour ram). \| e.g. scroll_splat_colour: LDA #$00 \| # colour data for char STA $D800 # colour ram \| LDA #$00 STA $D801 # etc., for every \| visible char onscreen in scrolling area \| \| And one would be updating/scrolling those values loaded \| into the A register, in chunks over previous frames, \| similar to: LDA scroll_splat_colour+6 \| STA scroll_splat_colour+1 LDA \| scroll_splat_colour+11 STA scroll_splat_colour+6 \| # etc., for every lda/sta in the above \| \| Perhaps not the clearest explanation, but hopefully enough \| to communicate the idea. \| \| FWIW, I didn't invent that technique, it was an improvement \| Jon Williams made to my code, whilst we both worked for \| Images Software (now Climax). Not sure where he got it \| from, maybe he invented it himself, maybe he cribbed it \| from elsewhere. \| \| Related: I thought sprite multiplexing was awesome, and \| there were quite a few tricks there too to get it \| performant. But that's another far more complex topic. \| vidarh wrote: \| Another "obvious" trick is to narrow the playing field \| but animate the rest, and then safe cycles by a \| combination of bands that requires less updates and \| sprites. E.g. Pole Position is a classic example, where \| the graphics covers most of the screen, but only about \| half actually has gameplay. The rest consists of a very \| narrow band of mountains, and a couple of bands of \| clouds. I haven't looked at what they did for Pole \| Position, but that pattern of the actual gameplay being \| constrained to a much smaller portion of the screen than \| what looks like the playing area is pretty common. \| vardump wrote: \| How to handle the case when player changes direction to \| exact opposite immediately after the frame color data was \| transferred? Double buffering splatting code? Although \| one copy of it is 5001 bytes, ouch. \| jimsmart wrote: \| You generally don't handle that case! :) -- Instead you \| let the player move within a rectangular area onscreen, \| and decide on which way you are going to scroll the \| screen in advance (or rather: after the fact, depending \| on how one looks at things), based upon where the player \| is inside that rectangle. So the screen catches up with \| where the player is moving/pushing. \| \| Eight-way scrolling like this was always a massive pain \| on the C64 (and other systems that used buffered \| scrolling with no h/w, e.g. Atari ST), but that way (a \| box the player moved around inside) was the only \| realistic way of handling it if you had to do a bunch of \| work in advance before doing the actual scrolling. Turns \| out that having the player in a loose rectangle is also \| easier on the eye too, which is perhaps why it's also \| used on systems that don't suffer the same h/w \| restrictions. \| \| Yeah, the colour RAM update was a lot of bytes to move. \| But dedicating a big chunk of code to it meant one could \| be a little freer to use slightly slower techniques \| elsewhere in the update cycle. Side note: the C64 \| actually only had 39 visible char across the screen when \| in 'scrolling' mode, because the borders where shrunk-in \| slightly (and slightly more than one expects). So one \| less char to worry about per line. That saved a tiny \| amount of code / memory / execution-time for the colour \| splat (and the scrolling of partial chunks - whether on \| back buffers, or the data within the colour splat code - \| over the other frames). Sure, it's only one less \| character. But it saved some cycles. And cycles mattered! \| Particularly when doing something with that much data to \| move / that took that much time. \| vardump wrote: \| > C64 actually only had 39 visible char across the screen \| \| But 40 color cells were still visible, unless horizontal \| scroll register was 0. \| jimsmart wrote: \| No, but that's an easy enough mistake to make :) -- It's \| called 38-column mode, and when enabled the VIC shrinks \| both borders in by 8 pixels, and then offsets the screen \| according to the x-scroll register bits. \| \| [Edit: another source says it's actually 7-pixels hidden \| on the left, and 9 on the right. But whatever: same \| principle, the screen is shrunk by 16 pixels in total \| horizontally] \| \| Which meant that at most only 39 characters were visible \| across the screen -- with two of those, one at each end \| of the row, being partially visible -- and that applies \| to both the character screen and its associated colour \| RAM. Only 38 characters were visible when the x scroll \| register was zero, and as soon as one shifted to a value \| of 1-7, the 39th column became visible (and the 1st one \| became partially offscreen). But the 40th column is never \| visible when in that mode. \| \| For more info see: \| \| http://www.devili.iki.fi/Computers/Commodore/C64/Programm \| ers... \| \| "When scrolling in the X direction, it is necessary to \| place the VIC-II chip into 38 column mode. This gives new \| data a place to scroll from. When scrolling LEFT, the new \| data should be placed on the right. When scrolling RIGHT \| the new data should be placed on the left. Please note \| that there are still 40 columns to screen memory, but \| only 38 are visible." \| \| -- But it's discussed on a handful of other pages too, if \| you google. \| vardump wrote: \| Oh damn... and I did a fair bit of coding on C64 back in \| the day. :-D \| \| Somehow I thought it hid 4 pixels both sides. Totally \| wrong. \| \| PS. Then it's so unfair bad line still takes 40 cycles! \| jimsmart wrote: \| Please stop giving the above comment downvotes because of \| this person's lack of knowledge: we all have to learn \| things -- there was once a time I didn't know this \| either. \| \| It's not like vardump here was being a dick about \| anything in their comment, cut them some slack! \| vardump wrote: \| Thanks. Although I really should have known better, wrote \| scrolling routines 35 years ago. \| \| It's scary time can corrupt memories we consider as \| facts. \| Luc wrote: \| I made an 8-way full-screen scroller. \| \| To avoid situations like this the player sprite at the \| center of the screen had momentum, i.e. the sprite had to \| rotate 180 degrees to change to the opposite direction, \| giving a few frames time to set everything up. \| jimsmart wrote: \| That's a nice little trick, cheers for sharing. (Not that \| I'll get a chance to use it these days, but still) \| 6510 wrote: \| > # etc., for every visible char onscreen in scrolling \| area \| \| For every changed char. (which is sometimes more and \| sometimes less) \| \| You could do them in order but if you're using only a few \| characters you need only 1 LDA for each char. (How to do \| this is left as a creative exercise for the reader) \| jimsmart wrote: \| But the overheads of tracking which characters might have \| been changed here completely outweighed simply scrolling \| / updating the whole thing. The code becomes too involved \| in tracking changes, and fudging about with- / rewriting- \| the splat code. \| \| You can leave it as 'a creative exercise for the reader', \| but that's because you can't solve this for the generic \| case (i.e. any map the graphics artists might give you) \| in less cycles than simply dealing with each and every \| character, which is the worst case. \| \| Processing that many bytes, and doing comparisons and \| extra branches, simply becomes overheads, and, very \| quickly, your code is slower than simply updating / \| scrolling everything simply. \| \| For the colour splat routine, having a giant, pre- \| assmebled block of immediate-mode load store pairs for \| every character is as optimal as it gets -- and handles \| all cases -- on the C64, you only have a frame to update \| the colour RAM (because it cannot be relocated/paged), \| and you are generally chasing the scan beam to move that \| much data before the next frame. \| \| You don't have the luxury of having extra cycles to re- \| write that block of code at runtime, and rewrite the code \| that scrolls the data within that code, nor do you have \| the luxury of having enough spare cycles to be comparing \| data, and branching conditionally depending on if it has \| changed or not. \| \| Perhaps you misunderstand the technique I describe, or \| perhaps you under-estimate the overheads required to \| perform what you describe. Or perhaps both. \| egypturnash wrote: \| That's... that's horrible. Beautiful, but horrible. \| \| Which kinda describes any advanced c64 technique, really. \| jimsmart wrote: \| Indeed, I totally agree on all points :) \| justinlloyd wrote: \| Yeah, compiled graphics and compiled colour tables, also, \| a routine that could self-modify code in regions of RAM \| to do the colour table writes. A slow set-up function at \| level start would build the code to be JSR'd later in the \| level. We did that on a few games on the C64 and the \| Speccy and Beeb and Atari. Later used the same techniques \| in DOS on PC. And of course, doing the same tricks but \| with D0 through D7 and A0 through A6 on Atari ST and \| Amiga. Also doing "stuff" in zero page because the \| address loads were shorter. And avoiding 256-byte page \| boundaries where possible because of the cycle penalty. \| jimsmart wrote: \| > A slow set-up function at level start would build the \| code \| \| Interesting, and good thinking :) \| \| IIRC, when we used this technique on the C64, we didn't \| build the code during init at runtime, we actually built \| the code in the dev environment, using macros, so it got \| built at assembly/compile time. So we skipped the small \| time hit at runtime init, at the expense of a slightly \| longer load time for the user (and a tiny bit longer on \| our assembly/compile times, although that was fairly \| negligible cos we were building on PCs). \| jimsmart wrote: \| Ex C64-games coder here! -- If your sprite multiplexer was \| taking most of CPU during the screen draw time, then honestly \| it was not a particularly great multiplexer! ;) \| \| Most decent multiplexers took just a scanline or two/three, \| multiple times down the screen (i.e. whenever relocating any \| already drawn sprites) -- often with decent sized gaps (time \| when the CPU wasn't involved in manipulating sprites and \| could do other things), with a larger chunk during the \| offscreen period / at the bottom of the screen, when one was \| preping the data (mostly sorting the sprite's y-coords) for \| the next frame's screen draw. \| \| -- During debugging/etc, we'd often enable colour changes to \| the screen border, at the beginning and end of the \| multiplexer code (for both the interrupt stuff in the \| playfield, and the non-playfield section), so we could \| visually see how it was working/performing. \| vidarh wrote: \| Sure, the "nice" way of doing it is to rely on the raster \| interrupt. But I've also seen way too much C64 code where \| pretty much everything ran in the interrupt handler, with \| associated stupid busy waiting because it saved people from \| having to synchronise. I'd guess more commonly for cheap \| and cheerful ports from less capable machines, but it's \| been a couple of decades since I've actually looked at any \| of this code. \| cesaref wrote: \| 64 coder here too! \| \| The border changing thing has just reminded me how bad the \| development process was using the commodore assembler with \| a 1541 drive which was horribly slow. assemble, dump image, \| reboot, crash, reboot, load assembler, try and work out \| what had happened :) \| \| At some point I ended up with a PC running a system called, \| I think PDS, which was a cross assembler with dongle to \| push the image straight into the memory of the C64. I even \| think you could inspect and change memory on the running \| machine - it was amazing! \| jimsmart wrote: \| Yeah, we all used PDS too, although not originally. \| Pretty good system, particularly for that era, and \| cost/capability-wise (though they weren't that cheap, and \| folk eventually started cloning the boards for them, \| IIRC). \| \| I remember it was annoying to have only 8 main source \| files in PDS though, most big projects went past the 8 \| files of however many kb (although it could also handle \| include files, which was how one got around that limit). \| \| Although when I actually started out as a C64 games dev, \| my dev system was a BBC Micro B, linked to a C64. Not \| quite a cool as PDS, but it could assemble code 2x the \| speed of the C64 (the processor clocked twice the speed \| on the Beeb), and it was great having a separate 'host' \| system for development. \| jimsmart wrote: \| Here's a link to info about the PDS kit, in case anyone \| is interested: \| \| https://www.cpcwiki.eu/index.php/PDS_development_system \| mgkimsal wrote: \| Just watched a video of C64 "Seven Cities of Gold" with a \| colleague yesterday, trying to convey just how... exciting \| that was in 1984. Watching on YouTube, I had forgotten just \| how small the playing 'viewport' was. It seems like possibly \| a more extreme example - I don't remember too many other \| games having an action viewport that small. \| kken wrote: \| This doesn't even cover all the neat assembly tricks with self- \| modifying code that you would actually use on a 6502 to speed up \| memory transfer. \| MatthiasWandel wrote: \| For games that scrolled the screen, those had to happen \| essentially between scans, so a lot of tricks were employed. \| Fixed addresses in the code, unrolled loops, and self modifying \| code to avoid the expensive zero page indicrect indexed \| addressing mode (the slowest instruction on the CPU). The other \| trick was to start moving the first line of screen just after \| it got displayed, which would give you nearly two jiffies to do \| it before the scan caught up to you on the next frame. \| vardump wrote: \| No need, it can easily happen during the scan. As long as the \| scan and update memory location never meet, there's \| absolutely no problem. \| natly wrote: \| It's crazy how much work went into those old games. I have a \| feeling those programmers weren't even paid that well \| considering how few people owned computers back then (so the \| market can't have been large). \| wkearney99 wrote: \| If you ever play(ed) the Atari 2600 version of River Raid, \| you got to witness some SERIOUS tweaking to work around the \| limits of that console. Every scanline processed on the fly \| during the vertical blanking interval. No screen buffer. \| The animation was soooo smooth. \| kabdib wrote: \| My first job out of college at Atari in 1982, writing game \| cartridges for the 400/800 computers, paid $25K a year. My \| first raise after a year was to $30K. \| \| There were programmers in other divisions making royalties \| off of their games. Tod Frye famously got $700K or so for \| his terrible version of 2600 Pac-Man (it was terrible not \| because he was a bad programmer, but because marketing \| decided that 2K of ROM had to be enough, and he was smart \| enough to pull off a miracle . . . of sorts). \| \| Also, the OP apparently doesn't know how to unroll loops, \| which is the first thing you do to your game's hot spots. \| (Never had to resort to self-modifying code). \| vikingerik wrote: \| I did this in a homebrew Atari 2600 game. For a Space Invaders \| grid of sprites. Each is triggered by writing to a register, as \| the electron beam scans through to display each sprite. \| \| The interval between sprites on the same scanline is 3 cpu \| cycles. That's a single 6502 instruction, the write to that \| register. How do you do any kind of load or compare instruction \| along with that to decide whether to display that sprite? \| \| The answer was to copy that stream of instructions to RAM ahead \| of time, and replace each write to a missing invader with a no- \| op. The code is here if anyone wants to see (the "inv3" demo): \| http://dos486.com/atari/ \| krallja wrote: \| > copy that stream of instructions to RAM ahead of time \| \| Even this is easier said than done: there are only 128 bytes \| of RAM in the entire machine, and that has to suffice for \| global variables and stack memory in addition to storing \| modified code like this! \| rasz wrote: \| Afaik its <120KB/s with all the tricks. 6502 was hand designed \| and brain optimized for clever use of available silicon real- \| estate, roughly 20% of CPU bus cycles are dead/bogus/useless. \| RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6 \| wasted , all increments at least 1 cycle wasted etc. Sad to \| think state machine handling DMA transfers in REU is probably \| less than 50 macrocells, and Commodore ran its own fab, they \| could have build-in REU DMA in C128 and it would cost cents. \| mywittyname wrote: \| Is there a way to make a compatible 6502 variant that doesn't \| have this waste? \| krallja wrote: \| "The 100 MHz 6502" does a different clever thing - it \| copies all the dedicated RAM and ROM into its own FPGA \| copy. Then it can perform 7 to 25 instructions before the \| next external read/write cycle! \| \| http://www.e-basteln.de/computing/65f02/65f02/ \| rasz wrote: \| https://en.wikipedia.org/wiki/CSG_65CE02#Pipeline_improveme \| n... fixed most painful ones, but afaik not all dead \| cycles. But it was 1988 and commodore didnt bother putting \| it into anything other than some IO card for the AMIGA, not \| to mention it still did nothing to cover slowness of moving \| data around. Japanese decided to do something about it for \| TurboGrafx-16 in 1987 Hu6502 \| http://shu.emuunlim.com/download/pcedocs/pce_cpu.html \| \| Transfer Alternate Increment (TAI), Transfer Increment \| Alternate (TIA), Transfer Decrement Decrement (TDD), \| Transfer Increment Increment (TII) - pretty much x86 'rep \| movsb', except not great at 6 cycles per byte (~160KB/s). \| For contrast 5 years older 80286 already did 'rep movsw' at \| 2 cycles per byte. 6 years later Pentium did 'rep movsd' at \| 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb' \| full cachelines at a time at full cache/memory controller \| speed. \| JPLeRouzic wrote: \| I think there are tricks to rewrite the microcode on \| Pentium, does similar tricks exist for 80286, 386 or 68K? \| \| It would be fun to reconfigure one as a high speed 6502. \| cmrdporcupine wrote: \| The 65816's MVP/MVN opcodes can do bulk transfers a teeny bit \| faster. \| lscharen wrote: \| For more 16-bit 65816 context -- other than for space-savings, \| these instructions are never used when performance is needed \| due to the low effective throughput of 7 cycles per byte. A \| basic unrolled loop using 16-bit instructions is 20 - 30% \| faster and specialized graphics routines that are able to use \| the stack can approach 3 cycles per byte using the PEA and PEI \| instructions. \| cmrdporcupine wrote: \| I'll defer to you I guess, as you seem to know more about \| this than me. The only thing is searching through the \| 6502.org forums I don't see a consensus on this?Plenty of \| people talking about the advantages of MVN/MVP for bulk \| transfers. I seem to recall doing the cycle counting myself \| at one point, too, and finding it advantageous. \| \| One neat trick (I remember reading about from Alan Cox I \| believe) if you have control over the hardware is to memory \| map I/O devices like serial input / output such that \| incrementing addresses starting at a given address all point \| to the same physical device/register. E.g. allocate 256 \| contiguous bytes in your memory map to point to the same \| thing. This way you can do bulk I/O transfers to/from memory \| using MVP/MVN instead of "get a byte, put a byte" instruction \| by instruction. \| rasz wrote: \| The trick you describe was being used by Silicon Valley \| Computer ADP50L IDE controller from early nineties (1991). \| Memory mapped I/O instead of traditional x86 port access \| lets you skip doing manual loop for 'rep movsb', result can \| be 50% speed bump \| \| https://forum.vcfed.org/index.php?threads/performance-of- \| lo-... \| \| Port IO Read Speed : 219.39 KB/s \| \| MMIO Read Speed : 310.77 KB/s \| \| Some variants of XTIDE hardware also implement this, as \| does the free bios. \| ksherlock wrote: \| MVP/MVN are 7 cycles per byte. \| \| If you're moving memory around in bank 0 (or have memory \| mapping), you can use the direct page register to \| read/write anywhere in bank 0 and the stack to read/write \| anywhere in bank 0. \| \| 16-bit LDA dp, PHA is 4 + 4 = 8 cycles or 4 cycles per \| byte. Best case would be if you know it's constant data \| before hand, eg, LDA #0, PHA, PHA .... 2 cycles per byte! \| \| For general purpose copying MVP and MVN are easier and have \| better code density. \| mmphosis wrote: \| _2 cycles per byte!_ It takes 4 cycles for PHA to push \| the 16-bit Accumulator, two bytes, onto the stack. There \| 's also 16-bit PHD, PHX and PHY. \| cmrdporcupine wrote: \| Ah here it is: \| http://forum.6502.org/viewtopic.php?f=2&t=5035 referencing \| a now-lost G+ post from Alan Cox: \| \| _" The emulator also has a fun hack for disk performance \| I'm hoping will get replicated in some of the upcoming \| retro 65C816 board design. Like the 6502 the 65C816 sucks \| at continually reading from an MMIO port and writing it to \| sequential memory locations. It sucks less than a 6502 \| because you've got 16bit index registers, but at the same \| clock it was doing about 100K/second that a Z80 can do 250K \| (with ini loops). The revised emulated disk interface has \| the same mmio port replicated across a chunk of address \| space and this allows a block move instruction (MVN) to do \| all the work at 6 clocks/byte. At that point the 65C816 \| suddenly jumps to twice as fast as the Z80 on disk I/O."_ \| [deleted] \| joosters wrote: \| On the original ZX Spectrum, you could measure the write \| bandwidth visually, because on startup it would write the value 2 \| into each byte in memory (which included the graphics RAM). It \| would then re-read and decrease the value of each byte twice, to \| check for any faulty memory. \| \| You could see these patterns on-screen as the reads and writes \| took place (I think it took about a couple of seconds to do this \| to 48k of RAM) \| becurious wrote: \| You could change the stack pointer to the top of the area of \| memory you wanted to fill and then use PUSH to fill at I think \| 11 clock cycles per two bytes. It was faster than unrolled LDI \| or LD (HL),A followed by INC HL. It would be filling memory in \| the wrong direction for a Rainbow processor but you could use \| it for repeating patterns. I think I did a checkerboard pattern \| that would shift every frame and it was pretty smooth. ___________________________________________________________________ (page generated 2022-06-16 23:01 UTC)