[HN Gopher] How fast can a 6502 transfer memory?
___________________________________________________________________
 
How fast can a 6502 transfer memory?
 
Author : xmyatniyx
Score  : 164 points
Date   : 2022-06-16 11:24 UTC (11 hours ago)
 
web link (imapenguin.com)
w3m dump (imapenguin.com)
 
| forinti wrote:
| I once thought about reading HD floppies on a BBC Micro (which
| can handle SD and DD). But it turns out it can't handle the speed
| at which the bits come in (500kbps).
| 
| SD and even DD are fine (125kbps and 250kbps), so you could read
| 360KB floppies from a PC.
 
| spc476 wrote:
| The 6502 doesn't have a pipeline, so it's quite easy to count
| instruction cycles and find out how much time a given piece of
| code will take. I did this technique to bit-bang a serial port
| back in the day (given two one-bit ports with the CPU doing all
| the work because the system was too cheap to have an actual
| UART).
 
| joosters wrote:
| To be really pedantic, there's a big difference between 'memory
| bandwidth' and 'memory transfer speed'. The former is just
| reading (or writing) to a block of memory, and the latter is
| copying data from one location to another. So a 'memory transfer
| speed' is going to be slower.
 
  | jmull wrote:
  | I think it's actually not that pedantic.
  | 
  | "Memory bandwidth" is being used in marketing materials today,
  | so it's a little useful to understand what it means. (The
  | author of this article confuses it with memory transfer,
  | probably others do as well.)
 
| bluemax wrote:
| Wow, this brings back memories from more than 3 decades ago. I
| created a routine on the C64 to copy memory and calculated the
| performance then around 25KB/sec.
| 
| The first version contained a memory corrupting bug that took
| some time to figure out. Depending on the locations of the source
| and destination you have to start copying forwards from the
| beginning of the source, or backwards from the end. If there's an
| overlap you risk overwriting the source before it is copied to
| the destination.
 
| scarface74 wrote:
| I immediately noticed that in one of the code samples that he is
| loading and storing data from memory that's not on the first page
| - memory locations $00-$FF - memory access to and from the first
| page took one less clock cycle.
| 
| The LDA, STA, etc operators for zero page access are different
| opcodes than their two byte address equivalents
 
| mmphosis wrote:
| This also misses a lot of other tricks: self-modifying unrolled
| code, keeping track of blocks of memory that don't need to be
| updated, or blocks of memory that have the same value. Memory
| moves may not be the fastest way.                 0 HIMEM: 5608
| 1DATA553256330092213902139021393213504523826353650303725262750384
| 56153058173120922939029328358454788756458365750371488455503510839
| 66132653706165276445377489621322135821322334213502130051520282036
| 2DATAQLNZQLNZQAQQDSAQRDSAQQDSAQVDSAQXCKDNAPFXANAQXXANANXNXNXQXNXC
| QXKXNZQXKXNZQCQODMAQODMAUUYQXCKATAUVQXCKMNXQXKANXUAUTCQXKENXUMUSQ
| JHSAHFQZTXRDFQZTAOCQZTAOBQHDSAPDSAQADSANZNZKDSAQZDSAXZQZOZXZUAXZJ
| 3 READ L$: READ H$       4 FOR I = 1 TO  LEN (L$)
| 5POKE767+I,10*(ASC(MID$(H$,I,1))-65)+VAL(MID$(L$,I,1))       6
| NEXT       RUN       HGR : CALL 768: CALL 5608
 
| metadat wrote:
| Is Mario Bros a ripoff of Sam's Journey? Or vice versa?
| 
| Either way, beautiful.
 
  | gs17 wrote:
  | Sam's Journey looks to be a 2017 release, so it definitely
  | didn't inspire Mario.
  | 
  | [0] https://www.knightsofbytes.games/
 
  | cldellow wrote:
  | Vice versa, Sam's Journey was released in 2017:
  | https://www.knightsofbytes.games/samsjourney
 
| dusted wrote:
| I wouldn't call the 6502 a RISC CPU.. It was clearly designed for
| humans to program, it has multiple and complex addressing modes
| and instructions to make it easier for us..
| 
| Sure it is a small instruction set compared to modern CPUs, but
| RISC is an idea, not a number.
| 
| I'd venture to say that RISC is designing with the goal of making
| very efficient instructions and allow very efficient compilers to
| be written for it.. It's the idea to have one faster way of doing
| something rather than multiple convenient ways, because the
| compiler don't care, and compiler vendors appreciate not having
| to chose between multiple almost similar instructions that may or
| may not be faster in some particular case if they don't have to.
 
  | jsrcout wrote:
  | Don't remember where I saw this, but someone said the 6502 was
  | really an RTCC (Reduced Transistor Count Computer).
 
  | [deleted]
 
  | pvg wrote:
  | It's a joke, as the article says.
 
| vardump wrote:
| Two nits to pick from this article:
| 
| While the article does mention it ignores loop unrolling, it's a
| bit disingenuous, because that almost DOUBLES performance and
| it's what nearly all real world code is doing.
| 
| Also Sam's Journey PAL version does not need any kind of DMA
| transfer tricks. NTSC version is just a tiny bit behind in
| timing, so to be glitch free, REU is used. PAL version still
| works with minor glitches on an NTSC system.
| 
| This is because NTSC has 263 * 63 - 25 * 40 = 15569 cycles
| available per frame (ignoring those stolen by sprites) and PAL
| 263 * 63 - 25 * 40 = 18656 cycles (again, ignoring sprites).
| 
| The difference is enough that the NTSC version can't move
| required 2000 bytes of color RAM and character RAM in time in the
| worst case without REU.
 
  | Rediscover wrote:
  | I'm remembering (possibly incorrectly) PAL being 312 (not 263).
  | 
  | Is that what You intended?
 
    | vardump wrote:
    | Yeah. I accidentally left that value wrong after a copy
    | paste. But the result is right. :-)
 
| Joyfield wrote:
| My DOCSIS 3.1 Internet connection has more download bandwidth
| than my Amiga 500 had to RAM. Latency, not so much.
 
| cmrdporcupine wrote:
| Anybody else interested in writing a WASM VM for the 6502 or
| 65816 etc? This was my brainwave this week. I think this would be
| a supremely nerdy fun thing to do.
 
| DeathArrow wrote:
| Beat that, Apple!
 
  | cestith wrote:
  | Apple used the very same processor family at one time. ;-)
  | 
  | They've come a long way.
 
    | NobodyNada wrote:
    | For a very loose definition of "same professor family", they
    | still do -- ARM is sort of a spiritual successor to the 6502:
    | https://en.wikipedia.org/wiki/ARM_architecture_family#Histor.
    | ..
    | 
    | ARM was designed by the team at Acorn that had worked on the
    | BBC Micro, which used a 6502. They decided to design a custom
    | processor because bit felt none of the 16- or 32-bit
    | processors on the market at the time met the standard set by
    | the 6502 for simplicity and low cost. So, they designed their
    | own architecture which took cues from both the cutting-edge
    | RISC research in academia, and the simple practicality of the
    | 6502.
    | 
    | (On a similar note: the 6502's main competitor, the Zilog
    | Z80, is an early ancestor of x86! The Z80 is an enhanced
    | clone of the Intel 8080, which of course the 8086 was heavily
    | based on.)
    | 
    | This legacy still shows up today in the instruction
    | mnemonics: ARM uses "branch" naming (BEQ - branch if equal,
    | BCS - branch if carry set, etc) because that's what the 6502
    | used, whereas x86 spells it "jump" (JEQ, JCS, etc.). ARM uses
    | LDR/STR to load and store registers from memory (like the
    | 6502's LDA/LDX/LDY/STA/STX/STY), whereas x86 just spells
    | everything "MOV". ARM only uses memory-mapped I/O to access
    | hardware, whereas x86 has separate input and output ports.
 
      | cestith wrote:
      | The 6502 was a clone-ish of the Motorola 6800 made to be
      | lower cost. The 6800 led to the 6809 (another
      | competitor,used by the Tandy CoCo and IIRC the Dragon) and
      | to the 68000 series, used by Apple in the Mac, Sun in its
      | early systems, NeXT, Amiga, Atari in their later systems,
      | and more. That led to the PowerPC partnership of Motorola,
      | Apple, and IBM.
      | 
      | PowerPC was outliving its useful life due not to ISA, but
      | manufacturing limitations. So Apple went to Intel, but that
      | wasn't fit for mobile. Apple partnered with ARM to make
      | their mobile chips. Then their mobile chips grew into the
      | M1 and M2 along with ARM, bringing them back to a RISC-ish
      | platform like they had with PowerPC. So it's sort of a dual
      | path back to the same place.
 
        | NobodyNada wrote:
        | > So Apple went to Intel, but that wasn't fit for mobile.
        | Apple partnered with ARM to make their mobile chips.
        | 
        | There's a lot of interesting history there too: in 1990,
        | after seeing the first-generation ARM CPU, Apple
        | partnered with Acorn to co-found ARM Ltd and develop a
        | mobile processor for the Apple Newton. Although the
        | Newton was a failure, ARM was very successful and powered
        | pretty much the entirety of the mobile device revolution
        | -- including of course the iPod and iPhone.
        | 
        | Apple's co-founder status gives them a lot of influence
        | over the ARM architecture -- they led the AArch64 design
        | process, and they seem to be allowed to do things that
        | even other architectural licensees aren't allowed to do,
        | like implementing custom instructions in their ARM cores:
        | https://news.ycombinator.com/item?id=29783549
        | 
        | And Apple's iteration of ARM owes a lot to the PowerPC
        | world as well -- Apple's processor design team was
        | originally PA Semi, a company that designed PowerPC
        | cores.
 
        | klelatti wrote:
        | > they led the AArch64 design process
        | 
        | Interesting - is there a reference for this?
 
        | NobodyNada wrote:
        | Here's a Twitter thread from a former Apple engineer:
        | https://twitter.com/stuntpants/status/1346470705446092811
        | 
        | > arm64 is the Apple ISA, it was designed to enable
        | Apple's microarchitecture plans. There's a reason Apple's
        | first 64 bit core (Cyclone) was years ahead of everyone
        | else, and it isn't just caches.
        | 
        | > Arm64 didn't appear out of nowhere, Apple contracted
        | ARM to design a new ISA for its purposes. When Apple
        | began selling iPhones containing arm64 chips, ARM hadn't
        | even finished their own core design to license to others.
        | 
        | > ARM designed a standard that serves its clients and
        | gets feedback from them on ISA evolution. In 2010 few
        | cared about a 64-bit ARM core. Samsung & Qualcomm, the
        | biggest mobile vendors, were certainly caught unaware by
        | it when Apple shipped in 2013.
        | 
        | > Apple planned to go super-wide with low clocks, highly
        | OoO, highly speculative. They needed an ISA to enable
        | that, which ARM provided.
        | 
        | > M1 performance is not so because of the ARM ISA, the
        | ARM ISA is so because of Apple core performance plans a
        | decade ago.
 
        | klelatti wrote:
        | Very interesting - many thanks!
        | 
        | Edit: I'm a bit puzzled by the claim that Apple was
        | selling Aarch64 before Arm had finished their first
        | design - A7 announced at end 2013 but A53 appeared in
        | 2012?
 
        | NobodyNada wrote:
        | It looks like A53 was _announced_ in October 2012, but
        | I've found no indication of whether the design was
        | actually finished by then [0]. And remember that ARM just
        | sells IP and other companies are responsible for
        | manufacturing it; it doesn't look like anyone actually
        | produced A53 cores until 2015 [1] -- whereas Apple was
        | shipping actual consumer products with A7's in them by
        | October 2013.
        | 
        | [0]: https://www.techspot.com/news/50656-arm-
        | announces-64-bit-cor...
        | 
        | [1]: https://en.wikichip.org/wiki/arm_holdings/microarchi
        | tectures...
 
        | klelatti wrote:
        | Very fair point. OTOH there was a lot of detailed info on
        | the A53 available in 2013 and SoCs were being announced
        | with it.
        | 
        | I suspect this thread may be slightly exaggerating the
        | position but certainly the case that Apple were well
        | ahead of all the competitors - and no doubt they were
        | deeply involved in the ISA design.
 
        | cmrdporcupine wrote:
        | I honestly don't think there's any kind of straight line
        | from the 6809 to the 68000. They share little in common
        | other than the '68' prefix and coming from the same
        | company and being big endian. The instruction sets are
        | very different. Designed by different teams. The
        | peripheral chip set and bus management was different too.
        | 
        | The 68k shares more with 1970s minicomputers especially
        | the PDP-11 and/or VAX architectures than any MPU that
        | preceded it.
 
      | DeathArrow wrote:
      | >the Zilog Z80, is an early ancestor of x86! The Z80 is an
      | enhanced clone of the Intel 8080, which of course the 8086
      | was heavily based on
      | 
      | I owned an Z80 based computer when I was 8 to 10 years old.
      | Its instruction set and memory access does not have any
      | resemblance for me with 8086.
      | 
      | They seem like very distant relatives.
 
        | klelatti wrote:
        | The 8086 was designed to allow automated translation of
        | 8080 assembly to 8086 assembly - so the instruction set
        | may 'look' different but in fact has a lot in common.
        | 
        | Not quite right too to call the Z80 an ancestor of the
        | 8086 but certainly closely related due to the common
        | inheritance from the 8080.
 
        | NobodyNada wrote:
        | Yeah, perhaps more of an uncle than a direct ancestor :)
 
        | klelatti wrote:
        | Indeed - someone should do a family tree of CPUs!
 
        | cestith wrote:
        | We need to decide where the NEC v20, v30, v40, and v50
        | live.
 
        | krallja wrote:
        | And the NSC-800, which is like a Z80 with 8085 half-
        | interrupts!
 
        | klelatti wrote:
        | It's the offspring of the marriage of Z80 and 8085!
 
| MarkusWandel wrote:
| This also misses loop unrolling, combined with an assembly
| language version of "Duff's device" to be able to do an arbitrary
| number of transfers even if your loop is unrolled to, say, 8
| transfers.
| 
| This stuff used to matter! I had an NCR5380 chip on an Amiga,
| simple, memory mapped I/O, no DMA or interrupts. To get a tape
| drive to stream (remember that?) the byte transfer loop really
| had to be tweaked. But once fully tweaked, "whoooooooosh" instead
| of "chugga chugga chugga".
| 
| And truly heroic programming techniques had to be employed on the
| C64 to do X/Y smooth scrolling games. Often a static part of the
| screen, conveniently displaying scores etc, existed to make it
| work - there was just enough bandwidth to do 80% of the screen,
| say, so you find an excuse to keep the rest of it static.
| 
| I kinda miss those days, and I kinda don't. I guess it was good
| to have experienced them.
 
  | djmips wrote:
  | Those days still exist! If you want your mind blown watch the
  | Epic Games Nanite talk from last year's SIGGRAPH where the core
  | rendering of the dense vertex data is done directly in Compute
  | , IE software rendered, instead of using the hardware
  | rasterization hardware which has a minimum 4 pixel invocation
  | overhead which gets expensive with very small triangles.
  | 
  | This is but one example of this that's happening every day,
  | there is much much more like hair rendering in EA FIFA soccer
  | or automatic trading financial software running on GPUs.
  | 
  | There's a whole world of applications where people are still
  | concerned with every last cycle of performance just like in the
  | C64 days.
 
    | djmips wrote:
    | Here I'll save you the trouble of trying to find the video.
    | 
    | https://www.youtube.com/watch?v=eviSykqSUUw&list=PLabw4gCouT.
    | ..
 
  | amelius wrote:
  | We're now doing sort of the same tricks, but with power
  | management on mobile devices.
 
  | djmips wrote:
  | Kind of off topic for 6502 memory copy speeds but with regard
  | to scrolling, there became a pretty cool software hack for the
  | C64 (called VSP) where you could trick the poor VIC chip into
  | starting scanning out the screen position later in memory. Move
  | the start one character and the whole screen shifts left by 8
  | pixels. You only need to repaint a vertical column for this
  | 'course' scroll instead of moving the entire screen of
  | characters. This is something that should have been built into
  | the hardware and was very useful on other systems that had that
  | ability (like the NES for example)
  | 
  | With it you can reduce the amount of memory you need to copy
  | every 8 pixels (the 8 pixel part can be done with smooth scroll
  | registers).
  | 
  | There's a thread and example code on github here.
  | https://www.lemon64.com/forum/viewtopic.php?t=70539
  | 
  | Also note it's such a terrible hack on the DRAM that it doesn't
  | work on all C64s and there's a technical discussion about that
  | here. https://www.linusakesson.net/scene/safevsp/index.php
  | 
  | Hardware mod if VSP doesn't work on your C64 and more technical
  | details. http://wiki.icomp.de/wiki/VSP-Fix
  | 
  | Also it makes mention of the C64-Reloaded which is a modern C64
  | product that includes the fix.
 
  | js2 wrote:
  | > This also misses loop unrolling.
  | 
  | It's mentioned at the bottom of the article in the "Thoughts"
  | section:
  | 
  |  _You could certainly use self modifying code and unroll this
  | copy routine to get better performance at the price of
  | flexibility and arguably understanding for the average casual
  | 6502 assembly coder. Again, this was not a "how fast can we
  | absolutely make it" but an everyday use examination._
 
    | hinkley wrote:
    | Duff's device is a fixed size loop unrolling with an ugly
    | hack to make it behave for arbitrary inputs. The assembly
    | makes sense but the C code is rough.
    | 
    | It's not quite as fast as self-modifying or custom compiled
    | code, but it's pretty close.
 
  | tialaramex wrote:
  | Tom Duff's device was doing that because he's doing MMIO, you
  | should not [I know you're not suggesting it, but just in case
  | anybody reading thinks it's clever] do this today when you
  | don't want MMIO, your compiler is very capable of just doing an
  | actual copy quickly, so tell it that's what you want, don't
  | write gymnastics like Duff's device.
  | 
  | However, expressing these partially unrolled loops nicely is a
  | nice performance-not-safety feature of WUFFS called "Iterate
  | loops":
  | 
  | https://github.com/google/wuffs/blob/main/doc/note/iterate-l...
  | 
  | Well, I say performance not safety, as always they want both,
  | but you _could_ safely just write the never unrolled case,
  | while the existence of Iterate loops allows you to express a
  | much faster special case but know the compiler will fix things
  | up properly no matter what.
 
    | vardump wrote:
    | We're talking about C64 (and maybe Amiga).
    | 
    | Compiler is not going to do absolutely anything for you on
    | those retro platforms.
 
      | tialaramex wrote:
      | Aw, just needs a better compiler (with a 6502 target) :D
      | 
      | Jason Turner's CppCon 2021 talk, "Your New Mental Model of
      | constexpr" has half the presentation as a C64 program
      | (though for practical reasons not actually running on a C64
      | but instead an emulator) because _most_ of the heavy
      | lifting is done by the C++ 20 compiler.
      | https://youtu.be/MdrfPSUtMVM?t=1422
      | 
      | Now, Jason's approach is not going to beat hand-crafted
      | 6502 machine code _in a fair fight_ but he often doesn 't
      | need to fight fair and that's the point of his talk.
 
  | localhost wrote:
  | The C64 VIC-II chip would grab the address bus from the CPU
  | every 8 scan lines on the screen. Some of the early "fast load"
  | cartridges like the Epyx FastLoad cartridge that would
  | accelerate loading games from the floppy drive would blank the
  | entire screen during load so that their async data transfer
  | routines wouldn't get interrupted by the VIC-II chip grabbing
  | the bus. I wrote a similar (better?) cartridge where I would
  | need to use the register on the VIC-II chip that reported the
  | scan line as a sync marker to transfer 3 bytes asynchronously
  | from the 1541 down the clock and data lines of the serial bus.
  | Good times.
 
    | MarkusWandel wrote:
    | In my recollection Epyx Fastload did not blank the screen,
    | though some earlier fast loaders did.
    | 
    | I also remember the software voice synthesizer "SAM" needing
    | to blank the screen to render glitch-free sampled audio. Then
    | along came, what was it "Impossible Mission" ("Another
    | visitor! Stay a while...") doing pretty clean sampled audio
    | with the screen on. Not that the C64 SID chip was even
    | remotely intended to be able to play sampled audio in the
    | first place!
    | 
    | The Amiga was unimaginably powerful by comparison. Even a
    | basic configuration had 8x the memory a C64 had, and it had
    | all those fancy DMA toys to offload the CPU.
 
      | jimsmart wrote:
      | > Not that the C64 SID chip was even remotely intended to
      | be able to play sampled audio in the first place!
      | 
      | I don't recall how SAM did it, but sample playing on the
      | C64 SID chip was indeed a nice trick -- it was actually
      | done by modulating the main output volume, which made a
      | slight click when changing.
      | 
      | Eventually this got used by some of the C64 musicians /
      | music player libs, so one could play a channel of samples
      | as well as the three regular synth channels on the SID.
      | IIRC, Outrun used this particularly well in its title
      | screen and/or loading music, having some vocal samples
      | "O-O-Outrun!" (and skidding sound effects) as well a
      | sampled drums.
      | 
      | Annoyingly, IIRC, some revisions of the SID chip behaved
      | slightly differently, and had louder or softer sample
      | playback when this hack was used. But still: clever stuff.
 
        | weinzierl wrote:
        | ... and main output volume had only 16 levels so the
        | samples were 4 bit quantized. It is a wonder that we
        | could get understandable vocal samples with this hack at
        | all. I distinctly remember "Goal!" from Peter Shilton's
        | Fottball and "Accolade presents" from Test Drive [1]. In
        | the examples one can hear the amount of quantization
        | noise that low bit depth caused.
        | 
        | [1] https://m.youtube.com/watch?v=L1u-WydiiCI
 
        | vardump wrote:
        | You can actually get about 6-7 bits resolution out of
        | same SID volume register. 4 bits from the volume
        | register, channel 3 disable and 3 filter bits. Requires
        | some setup to get SID in a particular state first.
        | 
        | For details, see: https://livet.se/mahoney/c64-files/Musi
        | k_RunStop_Technical_D...
 
  | le-mark wrote:
  | Crypto mining with fpga was in the weeds down this path, or 100
  | Gbps signal processing. Two examples where low level stuff is
  | still relevant, just not commodity and widely available like
  | the 8 bit micros were.
 
    | aerique wrote:
    | I also read a high frequency trading blog that was posted
    | here a few years ago. Same thing: hacking hardware and
    | software so the first bytes of info could be grabbed from a
    | stream and acted upon, instead of needing to wait for the
    | whole package to have come in.
    | 
    | Also when I was in the demo scene on the Atari ST one had to
    | do specific timings in the assembly code to be able to draw
    | outside the screen's borders (so the borders were on screen
    | but couldn't be drawn on by code).
 
  | Psyladine wrote:
  | >I kinda miss those days, and I kinda don't. I guess it was
  | good to have experienced them.
  | 
  | There's a certain reflective quality and even satisfaction from
  | using a chainsaw after coming up using a tree saw by hand. It
  | feels progressive, even if it is just optimizing for time at
  | the expense of energy.
 
  | 6510 wrote:
  | I eventually managed to do a scroll texts fast enough to do
  | them while the pixels of the scroll text were printed to the
  | screen. It was even fast enough to have char combinations in
  | the scroll text that modified its speed and direction with
  | speeds like "one time scroll 1 pixel the next 2". One of the
  | tricks was to do "poor" timing with NOP's by requiring an empty
  | row of bits between each scroll text. (The text becomes
  | unreadable anyway if there is no space between the lines)
 
  | vidarh wrote:
  | Most C64 games used character-based graphics (coupled with the
  | smooth scrolling support in the VIC) which meant you'd at most
  | move 2000 bytes to scroll the entire screen every 4 to 8 pixels
  | scrolled.
  | 
  | You can easily scroll the entire screen on a C64 if that's all
  | you're doing.
  | 
  | Some games did also scroll bitmaps. There the naive version
  | requires moving 9000 bytes (40x25x8 for the bitmap data, 40x25
  | for the colour data) every time you need to scroll, and that
  | indeed starts to bite. There are games which reduces the cost
  | this using a trick called AGSP ("Any Given Screen Position").
  | 
  | But you're right static parts of the screen were often larger
  | to reduce the dynamic part. That was rarely down to just
  | scrolling the screen in isolation, though, but because the
  | overall budget of cycles you had to work with was tiny. Often
  | you might also have a lot of other stuff which consumed lots of
  | cycles _affected directly by the size of the playing field_.
  | E.g. if you did sprite multiplexing (moving a sprite after it
  | had been partially or fully rendered to reuse the same hardware
  | sprite), you might well be keeping the CPU busy throughout the
  | full rendering of the playing field.
  | 
  | There was also the consideration of how much effort you wanted
  | to go to in order to avoid glitches, since unless you could do
  | the scrolling entirely while the VIC was rendering the parts of
  | the screen outside the playing field, you'd need to make sure
  | the rendering and copying didn't overlap, and of course just
  | restricting playing field size was an easy workaround for that
  | problem.
 
    | MarkusWandel wrote:
    | I didn't get that fancy. I got as far as a horizontal smooth
    | scroller, but with the "move the screen memory during one
    | 1/60 second redraw cycle" mentality - racing the redraw, and
    | when it was just about caught up, whoa, time for the static
    | bar at the bottom.
    | 
    | Quite right, one could prepare the moved version in the
    | background during the 7 steps where you're merely diddling
    | the smooth scroll register, and then flip to it in an
    | instant. But wait, was it possible to page flip the colour
    | map? Also, always having the appropriate moved version ready
    | even as the player is doing unpredictable things goes into
    | the "heroic programming techniques" zone again.
    | 
    | As for glitches, it's amazing what can be done if perfection
    | is sacrificed and there were plenty of good games that did
    | have them, e.g. sprite multiplexing. But I did mean
    | "effortless looking perfect" smooth scrolling.
 
      | jimsmart wrote:
      | Ex-C64 games coder here: you are are correct - no, you
      | couldn't relocate/page-flip the colour map, like you could
      | the character map. So you had to update it all somehow on
      | the required frame.
      | 
      | The fastest technique I saw for updating the colour map in
      | a single go, was to have the whole thing as a huge block of
      | immediate mode load-stores, then one could 'scroll' the
      | data across the LDA instructions within that code, in
      | advance, over n-frames, then call this self-modified code
      | block when one did the character screen flip (immediate
      | load-stores was faster than load-stores from colour ram).
      | e.g.                 scroll_splat_colour:         LDA #$00
      | # colour data for char         STA $D800 # colour ram
      | LDA #$00         STA $D801         # etc., for every
      | visible char onscreen in scrolling area
      | 
      | And one would be updating/scrolling those values loaded
      | into the A register, in chunks over previous frames,
      | similar to:                   LDA scroll_splat_colour+6
      | STA scroll_splat_colour+1         LDA
      | scroll_splat_colour+11         STA scroll_splat_colour+6
      | # etc., for every lda/sta in the above
      | 
      | Perhaps not the clearest explanation, but hopefully enough
      | to communicate the idea.
      | 
      | FWIW, I didn't invent that technique, it was an improvement
      | Jon Williams made to my code, whilst we both worked for
      | Images Software (now Climax). Not sure where he got it
      | from, maybe he invented it himself, maybe he cribbed it
      | from elsewhere.
      | 
      | Related: I thought sprite multiplexing was awesome, and
      | there were quite a few tricks there too to get it
      | performant. But that's another far more complex topic.
 
        | vidarh wrote:
        | Another "obvious" trick is to narrow the playing field
        | but animate the rest, and then safe cycles by a
        | combination of bands that requires less updates and
        | sprites. E.g. Pole Position is a classic example, where
        | the graphics covers most of the screen, but only about
        | half actually has gameplay. The rest consists of a very
        | narrow band of mountains, and a couple of bands of
        | clouds. I haven't looked at what they did for Pole
        | Position, but that pattern of the actual gameplay being
        | constrained to a much smaller portion of the screen than
        | what looks like the playing area is pretty common.
 
        | vardump wrote:
        | How to handle the case when player changes direction to
        | exact opposite immediately after the frame color data was
        | transferred? Double buffering splatting code? Although
        | one copy of it is 5001 bytes, ouch.
 
        | jimsmart wrote:
        | You generally don't handle that case! :) -- Instead you
        | let the player move within a rectangular area onscreen,
        | and decide on which way you are going to scroll the
        | screen in advance (or rather: after the fact, depending
        | on how one looks at things), based upon where the player
        | is inside that rectangle. So the screen catches up with
        | where the player is moving/pushing.
        | 
        | Eight-way scrolling like this was always a massive pain
        | on the C64 (and other systems that used buffered
        | scrolling with no h/w, e.g. Atari ST), but that way (a
        | box the player moved around inside) was the only
        | realistic way of handling it if you had to do a bunch of
        | work in advance before doing the actual scrolling. Turns
        | out that having the player in a loose rectangle is also
        | easier on the eye too, which is perhaps why it's also
        | used on systems that don't suffer the same h/w
        | restrictions.
        | 
        | Yeah, the colour RAM update was a lot of bytes to move.
        | But dedicating a big chunk of code to it meant one could
        | be a little freer to use slightly slower techniques
        | elsewhere in the update cycle. Side note: the C64
        | actually only had 39 visible char across the screen when
        | in 'scrolling' mode, because the borders where shrunk-in
        | slightly (and slightly more than one expects). So one
        | less char to worry about per line. That saved a tiny
        | amount of code / memory / execution-time for the colour
        | splat (and the scrolling of partial chunks - whether on
        | back buffers, or the data within the colour splat code -
        | over the other frames). Sure, it's only one less
        | character. But it saved some cycles. And cycles mattered!
        | Particularly when doing something with that much data to
        | move / that took that much time.
 
        | vardump wrote:
        | > C64 actually only had 39 visible char across the screen
        | 
        | But 40 color cells were still visible, unless horizontal
        | scroll register was 0.
 
        | jimsmart wrote:
        | No, but that's an easy enough mistake to make :) -- It's
        | called 38-column mode, and when enabled the VIC shrinks
        | both borders in by 8 pixels, and then offsets the screen
        | according to the x-scroll register bits.
        | 
        | [Edit: another source says it's actually 7-pixels hidden
        | on the left, and 9 on the right. But whatever: same
        | principle, the screen is shrunk by 16 pixels in total
        | horizontally]
        | 
        | Which meant that at most only 39 characters were visible
        | across the screen -- with two of those, one at each end
        | of the row, being partially visible -- and that applies
        | to both the character screen and its associated colour
        | RAM. Only 38 characters were visible when the x scroll
        | register was zero, and as soon as one shifted to a value
        | of 1-7, the 39th column became visible (and the 1st one
        | became partially offscreen). But the 40th column is never
        | visible when in that mode.
        | 
        | For more info see:
        | 
        | http://www.devili.iki.fi/Computers/Commodore/C64/Programm
        | ers...
        | 
        | "When scrolling in the X direction, it is necessary to
        | place the VIC-II chip into 38 column mode. This gives new
        | data a place to scroll from. When scrolling LEFT, the new
        | data should be placed on the right. When scrolling RIGHT
        | the new data should be placed on the left. Please note
        | that there are still 40 columns to screen memory, but
        | only 38 are visible."
        | 
        | -- But it's discussed on a handful of other pages too, if
        | you google.
 
        | vardump wrote:
        | Oh damn... and I did a fair bit of coding on C64 back in
        | the day. :-D
        | 
        | Somehow I thought it hid 4 pixels both sides. Totally
        | wrong.
        | 
        | PS. Then it's so unfair bad line still takes 40 cycles!
 
        | jimsmart wrote:
        | Please stop giving the above comment downvotes because of
        | this person's lack of knowledge: we all have to learn
        | things -- there was once a time I didn't know this
        | either.
        | 
        | It's not like vardump here was being a dick about
        | anything in their comment, cut them some slack!
 
        | vardump wrote:
        | Thanks. Although I really should have known better, wrote
        | scrolling routines 35 years ago.
        | 
        | It's scary time can corrupt memories we consider as
        | facts.
 
        | Luc wrote:
        | I made an 8-way full-screen scroller.
        | 
        | To avoid situations like this the player sprite at the
        | center of the screen had momentum, i.e. the sprite had to
        | rotate 180 degrees to change to the opposite direction,
        | giving a few frames time to set everything up.
 
        | jimsmart wrote:
        | That's a nice little trick, cheers for sharing. (Not that
        | I'll get a chance to use it these days, but still)
 
        | 6510 wrote:
        | > # etc., for every visible char onscreen in scrolling
        | area
        | 
        | For every changed char. (which is sometimes more and
        | sometimes less)
        | 
        | You could do them in order but if you're using only a few
        | characters you need only 1 LDA for each char. (How to do
        | this is left as a creative exercise for the reader)
 
        | jimsmart wrote:
        | But the overheads of tracking which characters might have
        | been changed here completely outweighed simply scrolling
        | / updating the whole thing. The code becomes too involved
        | in tracking changes, and fudging about with- / rewriting-
        | the splat code.
        | 
        | You can leave it as 'a creative exercise for the reader',
        | but that's because you can't solve this for the generic
        | case (i.e. any map the graphics artists might give you)
        | in less cycles than simply dealing with each and every
        | character, which is the worst case.
        | 
        | Processing that many bytes, and doing comparisons and
        | extra branches, simply becomes overheads, and, very
        | quickly, your code is slower than simply updating /
        | scrolling everything simply.
        | 
        | For the colour splat routine, having a giant, pre-
        | assmebled block of immediate-mode load store pairs for
        | every character is as optimal as it gets -- and handles
        | all cases -- on the C64, you only have a frame to update
        | the colour RAM (because it cannot be relocated/paged),
        | and you are generally chasing the scan beam to move that
        | much data before the next frame.
        | 
        | You don't have the luxury of having extra cycles to re-
        | write that block of code at runtime, and rewrite the code
        | that scrolls the data within that code, nor do you have
        | the luxury of having enough spare cycles to be comparing
        | data, and branching conditionally depending on if it has
        | changed or not.
        | 
        | Perhaps you misunderstand the technique I describe, or
        | perhaps you under-estimate the overheads required to
        | perform what you describe. Or perhaps both.
 
        | egypturnash wrote:
        | That's... that's horrible. Beautiful, but horrible.
        | 
        | Which kinda describes any advanced c64 technique, really.
 
        | jimsmart wrote:
        | Indeed, I totally agree on all points :)
 
        | justinlloyd wrote:
        | Yeah, compiled graphics and compiled colour tables, also,
        | a routine that could self-modify code in regions of RAM
        | to do the colour table writes. A slow set-up function at
        | level start would build the code to be JSR'd later in the
        | level. We did that on a few games on the C64 and the
        | Speccy and Beeb and Atari. Later used the same techniques
        | in DOS on PC. And of course, doing the same tricks but
        | with D0 through D7 and A0 through A6 on Atari ST and
        | Amiga. Also doing "stuff" in zero page because the
        | address loads were shorter. And avoiding 256-byte page
        | boundaries where possible because of the cycle penalty.
 
        | jimsmart wrote:
        | > A slow set-up function at level start would build the
        | code
        | 
        | Interesting, and good thinking :)
        | 
        | IIRC, when we used this technique on the C64, we didn't
        | build the code during init at runtime, we actually built
        | the code in the dev environment, using macros, so it got
        | built at assembly/compile time. So we skipped the small
        | time hit at runtime init, at the expense of a slightly
        | longer load time for the user (and a tiny bit longer on
        | our assembly/compile times, although that was fairly
        | negligible cos we were building on PCs).
 
    | jimsmart wrote:
    | Ex C64-games coder here! -- If your sprite multiplexer was
    | taking most of CPU during the screen draw time, then honestly
    | it was not a particularly great multiplexer! ;)
    | 
    | Most decent multiplexers took just a scanline or two/three,
    | multiple times down the screen (i.e. whenever relocating any
    | already drawn sprites) -- often with decent sized gaps (time
    | when the CPU wasn't involved in manipulating sprites and
    | could do other things), with a larger chunk during the
    | offscreen period / at the bottom of the screen, when one was
    | preping the data (mostly sorting the sprite's y-coords) for
    | the next frame's screen draw.
    | 
    | -- During debugging/etc, we'd often enable colour changes to
    | the screen border, at the beginning and end of the
    | multiplexer code (for both the interrupt stuff in the
    | playfield, and the non-playfield section), so we could
    | visually see how it was working/performing.
 
      | vidarh wrote:
      | Sure, the "nice" way of doing it is to rely on the raster
      | interrupt. But I've also seen way too much C64 code where
      | pretty much everything ran in the interrupt handler, with
      | associated stupid busy waiting because it saved people from
      | having to synchronise. I'd guess more commonly for cheap
      | and cheerful ports from less capable machines, but it's
      | been a couple of decades since I've actually looked at any
      | of this code.
 
      | cesaref wrote:
      | 64 coder here too!
      | 
      | The border changing thing has just reminded me how bad the
      | development process was using the commodore assembler with
      | a 1541 drive which was horribly slow. assemble, dump image,
      | reboot, crash, reboot, load assembler, try and work out
      | what had happened :)
      | 
      | At some point I ended up with a PC running a system called,
      | I think PDS, which was a cross assembler with dongle to
      | push the image straight into the memory of the C64. I even
      | think you could inspect and change memory on the running
      | machine - it was amazing!
 
        | jimsmart wrote:
        | Yeah, we all used PDS too, although not originally.
        | Pretty good system, particularly for that era, and
        | cost/capability-wise (though they weren't that cheap, and
        | folk eventually started cloning the boards for them,
        | IIRC).
        | 
        | I remember it was annoying to have only 8 main source
        | files in PDS though, most big projects went past the 8
        | files of however many kb (although it could also handle
        | include files, which was how one got around that limit).
        | 
        | Although when I actually started out as a C64 games dev,
        | my dev system was a BBC Micro B, linked to a C64. Not
        | quite a cool as PDS, but it could assemble code 2x the
        | speed of the C64 (the processor clocked twice the speed
        | on the Beeb), and it was great having a separate 'host'
        | system for development.
 
        | jimsmart wrote:
        | Here's a link to info about the PDS kit, in case anyone
        | is interested:
        | 
        | https://www.cpcwiki.eu/index.php/PDS_development_system
 
    | mgkimsal wrote:
    | Just watched a video of C64 "Seven Cities of Gold" with a
    | colleague yesterday, trying to convey just how... exciting
    | that was in 1984. Watching on YouTube, I had forgotten just
    | how small the playing 'viewport' was. It seems like possibly
    | a more extreme example - I don't remember too many other
    | games having an action viewport that small.
 
| kken wrote:
| This doesn't even cover all the neat assembly tricks with self-
| modifying code that you would actually use on a 6502 to speed up
| memory transfer.
 
  | MatthiasWandel wrote:
  | For games that scrolled the screen, those had to happen
  | essentially between scans, so a lot of tricks were employed.
  | Fixed addresses in the code, unrolled loops, and self modifying
  | code to avoid the expensive zero page indicrect indexed
  | addressing mode (the slowest instruction on the CPU). The other
  | trick was to start moving the first line of screen just after
  | it got displayed, which would give you nearly two jiffies to do
  | it before the scan caught up to you on the next frame.
 
    | vardump wrote:
    | No need, it can easily happen during the scan. As long as the
    | scan and update memory location never meet, there's
    | absolutely no problem.
 
    | natly wrote:
    | It's crazy how much work went into those old games. I have a
    | feeling those programmers weren't even paid that well
    | considering how few people owned computers back then (so the
    | market can't have been large).
 
      | wkearney99 wrote:
      | If you ever play(ed) the Atari 2600 version of River Raid,
      | you got to witness some SERIOUS tweaking to work around the
      | limits of that console. Every scanline processed on the fly
      | during the vertical blanking interval. No screen buffer.
      | The animation was soooo smooth.
 
      | kabdib wrote:
      | My first job out of college at Atari in 1982, writing game
      | cartridges for the 400/800 computers, paid $25K a year. My
      | first raise after a year was to $30K.
      | 
      | There were programmers in other divisions making royalties
      | off of their games. Tod Frye famously got $700K or so for
      | his terrible version of 2600 Pac-Man (it was terrible not
      | because he was a bad programmer, but because marketing
      | decided that 2K of ROM had to be enough, and he was smart
      | enough to pull off a miracle . . . of sorts).
      | 
      | Also, the OP apparently doesn't know how to unroll loops,
      | which is the first thing you do to your game's hot spots.
      | (Never had to resort to self-modifying code).
 
  | vikingerik wrote:
  | I did this in a homebrew Atari 2600 game. For a Space Invaders
  | grid of sprites. Each is triggered by writing to a register, as
  | the electron beam scans through to display each sprite.
  | 
  | The interval between sprites on the same scanline is 3 cpu
  | cycles. That's a single 6502 instruction, the write to that
  | register. How do you do any kind of load or compare instruction
  | along with that to decide whether to display that sprite?
  | 
  | The answer was to copy that stream of instructions to RAM ahead
  | of time, and replace each write to a missing invader with a no-
  | op. The code is here if anyone wants to see (the "inv3" demo):
  | http://dos486.com/atari/
 
    | krallja wrote:
    | > copy that stream of instructions to RAM ahead of time
    | 
    | Even this is easier said than done: there are only 128 bytes
    | of RAM in the entire machine, and that has to suffice for
    | global variables and stack memory in addition to storing
    | modified code like this!
 
  | rasz wrote:
  | Afaik its <120KB/s with all the tricks. 6502 was hand designed
  | and brain optimized for clever use of available silicon real-
  | estate, roughly 20% of CPU bus cycles are dead/bogus/useless.
  | RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6
  | wasted , all increments at least 1 cycle wasted etc. Sad to
  | think state machine handling DMA transfers in REU is probably
  | less than 50 macrocells, and Commodore ran its own fab, they
  | could have build-in REU DMA in C128 and it would cost cents.
 
    | mywittyname wrote:
    | Is there a way to make a compatible 6502 variant that doesn't
    | have this waste?
 
      | krallja wrote:
      | "The 100 MHz 6502" does a different clever thing - it
      | copies all the dedicated RAM and ROM into its own FPGA
      | copy. Then it can perform 7 to 25 instructions before the
      | next external read/write cycle!
      | 
      | http://www.e-basteln.de/computing/65f02/65f02/
 
      | rasz wrote:
      | https://en.wikipedia.org/wiki/CSG_65CE02#Pipeline_improveme
      | n... fixed most painful ones, but afaik not all dead
      | cycles. But it was 1988 and commodore didnt bother putting
      | it into anything other than some IO card for the AMIGA, not
      | to mention it still did nothing to cover slowness of moving
      | data around. Japanese decided to do something about it for
      | TurboGrafx-16 in 1987 Hu6502
      | http://shu.emuunlim.com/download/pcedocs/pce_cpu.html
      | 
      | Transfer Alternate Increment (TAI), Transfer Increment
      | Alternate (TIA), Transfer Decrement Decrement (TDD),
      | Transfer Increment Increment (TII) - pretty much x86 'rep
      | movsb', except not great at 6 cycles per byte (~160KB/s).
      | For contrast 5 years older 80286 already did 'rep movsw' at
      | 2 cycles per byte. 6 years later Pentium did 'rep movsd' at
      | 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb'
      | full cachelines at a time at full cache/memory controller
      | speed.
 
        | JPLeRouzic wrote:
        | I think there are tricks to rewrite the microcode on
        | Pentium, does similar tricks exist for 80286, 386 or 68K?
        | 
        | It would be fun to reconfigure one as a high speed 6502.
 
| cmrdporcupine wrote:
| The 65816's MVP/MVN opcodes can do bulk transfers a teeny bit
| faster.
 
  | lscharen wrote:
  | For more 16-bit 65816 context -- other than for space-savings,
  | these instructions are never used when performance is needed
  | due to the low effective throughput of 7 cycles per byte. A
  | basic unrolled loop using 16-bit instructions is 20 - 30%
  | faster and specialized graphics routines that are able to use
  | the stack can approach 3 cycles per byte using the PEA and PEI
  | instructions.
 
    | cmrdporcupine wrote:
    | I'll defer to you I guess, as you seem to know more about
    | this than me. The only thing is searching through the
    | 6502.org forums I don't see a consensus on this?Plenty of
    | people talking about the advantages of MVN/MVP for bulk
    | transfers. I seem to recall doing the cycle counting myself
    | at one point, too, and finding it advantageous.
    | 
    | One neat trick (I remember reading about from Alan Cox I
    | believe) if you have control over the hardware is to memory
    | map I/O devices like serial input / output such that
    | incrementing addresses starting at a given address all point
    | to the same physical device/register. E.g. allocate 256
    | contiguous bytes in your memory map to point to the same
    | thing. This way you can do bulk I/O transfers to/from memory
    | using MVP/MVN instead of "get a byte, put a byte" instruction
    | by instruction.
 
      | rasz wrote:
      | The trick you describe was being used by Silicon Valley
      | Computer ADP50L IDE controller from early nineties (1991).
      | Memory mapped I/O instead of traditional x86 port access
      | lets you skip doing manual loop for 'rep movsb', result can
      | be 50% speed bump
      | 
      | https://forum.vcfed.org/index.php?threads/performance-of-
      | lo-...
      | 
      | Port IO Read Speed : 219.39 KB/s
      | 
      | MMIO Read Speed : 310.77 KB/s
      | 
      | Some variants of XTIDE hardware also implement this, as
      | does the free bios.
 
      | ksherlock wrote:
      | MVP/MVN are 7 cycles per byte.
      | 
      | If you're moving memory around in bank 0 (or have memory
      | mapping), you can use the direct page register to
      | read/write anywhere in bank 0 and the stack to read/write
      | anywhere in bank 0.
      | 
      | 16-bit LDA dp, PHA is 4 + 4 = 8 cycles or 4 cycles per
      | byte. Best case would be if you know it's constant data
      | before hand, eg, LDA #0, PHA, PHA .... 2 cycles per byte!
      | 
      | For general purpose copying MVP and MVN are easier and have
      | better code density.
 
        | mmphosis wrote:
        | _2 cycles per byte!_ It takes 4 cycles for PHA to push
        | the 16-bit Accumulator, two bytes, onto the stack. There
        | 's also 16-bit PHD, PHX and PHY.
 
      | cmrdporcupine wrote:
      | Ah here it is:
      | http://forum.6502.org/viewtopic.php?f=2&t=5035 referencing
      | a now-lost G+ post from Alan Cox:
      | 
      |  _" The emulator also has a fun hack for disk performance
      | I'm hoping will get replicated in some of the upcoming
      | retro 65C816 board design. Like the 6502 the 65C816 sucks
      | at continually reading from an MMIO port and writing it to
      | sequential memory locations. It sucks less than a 6502
      | because you've got 16bit index registers, but at the same
      | clock it was doing about 100K/second that a Z80 can do 250K
      | (with ini loops). The revised emulated disk interface has
      | the same mmio port replicated across a chunk of address
      | space and this allows a block move instruction (MVN) to do
      | all the work at 6 clocks/byte. At that point the 65C816
      | suddenly jumps to twice as fast as the Z80 on disk I/O."_
 
| [deleted]
 
| joosters wrote:
| On the original ZX Spectrum, you could measure the write
| bandwidth visually, because on startup it would write the value 2
| into each byte in memory (which included the graphics RAM). It
| would then re-read and decrease the value of each byte twice, to
| check for any faulty memory.
| 
| You could see these patterns on-screen as the reads and writes
| took place (I think it took about a couple of seconds to do this
| to 48k of RAM)
 
  | becurious wrote:
  | You could change the stack pointer to the top of the area of
  | memory you wanted to fill and then use PUSH to fill at I think
  | 11 clock cycles per two bytes. It was faster than unrolled LDI
  | or LD (HL),A followed by INC HL. It would be filling memory in
  | the wrong direction for a Rainbow processor but you could use
  | it for repeating patterns. I think I did a checkerboard pattern
  | that would shift every frame and it was pretty smooth.
 
___________________________________________________________________
(page generated 2022-06-16 23:01 UTC)