|
| forinti wrote:
| I once thought about reading HD floppies on a BBC Micro (which
| can handle SD and DD). But it turns out it can't handle the speed
| at which the bits come in (500kbps).
|
| SD and even DD are fine (125kbps and 250kbps), so you could read
| 360KB floppies from a PC.
| spc476 wrote:
| The 6502 doesn't have a pipeline, so it's quite easy to count
| instruction cycles and find out how much time a given piece of
| code will take. I did this technique to bit-bang a serial port
| back in the day (given two one-bit ports with the CPU doing all
| the work because the system was too cheap to have an actual
| UART).
| joosters wrote:
| To be really pedantic, there's a big difference between 'memory
| bandwidth' and 'memory transfer speed'. The former is just
| reading (or writing) to a block of memory, and the latter is
| copying data from one location to another. So a 'memory transfer
| speed' is going to be slower.
| jmull wrote:
| I think it's actually not that pedantic.
|
| "Memory bandwidth" is being used in marketing materials today,
| so it's a little useful to understand what it means. (The
| author of this article confuses it with memory transfer,
| probably others do as well.)
| bluemax wrote:
| Wow, this brings back memories from more than 3 decades ago. I
| created a routine on the C64 to copy memory and calculated the
| performance then around 25KB/sec.
|
| The first version contained a memory corrupting bug that took
| some time to figure out. Depending on the locations of the source
| and destination you have to start copying forwards from the
| beginning of the source, or backwards from the end. If there's an
| overlap you risk overwriting the source before it is copied to
| the destination.
| scarface74 wrote:
| I immediately noticed that in one of the code samples that he is
| loading and storing data from memory that's not on the first page
| - memory locations $00-$FF - memory access to and from the first
| page took one less clock cycle.
|
| The LDA, STA, etc operators for zero page access are different
| opcodes than their two byte address equivalents
| mmphosis wrote:
| This also misses a lot of other tricks: self-modifying unrolled
| code, keeping track of blocks of memory that don't need to be
| updated, or blocks of memory that have the same value. Memory
| moves may not be the fastest way. 0 HIMEM: 5608
| 1DATA553256330092213902139021393213504523826353650303725262750384
| 56153058173120922939029328358454788756458365750371488455503510839
| 66132653706165276445377489621322135821322334213502130051520282036
| 2DATAQLNZQLNZQAQQDSAQRDSAQQDSAQVDSAQXCKDNAPFXANAQXXANANXNXNXQXNXC
| QXKXNZQXKXNZQCQODMAQODMAUUYQXCKATAUVQXCKMNXQXKANXUAUTCQXKENXUMUSQ
| JHSAHFQZTXRDFQZTAOCQZTAOBQHDSAPDSAQADSANZNZKDSAQZDSAXZQZOZXZUAXZJ
| 3 READ L$: READ H$ 4 FOR I = 1 TO LEN (L$)
| 5POKE767+I,10*(ASC(MID$(H$,I,1))-65)+VAL(MID$(L$,I,1)) 6
| NEXT RUN HGR : CALL 768: CALL 5608
| metadat wrote:
| Is Mario Bros a ripoff of Sam's Journey? Or vice versa?
|
| Either way, beautiful.
| gs17 wrote:
| Sam's Journey looks to be a 2017 release, so it definitely
| didn't inspire Mario.
|
| [0] https://www.knightsofbytes.games/
| cldellow wrote:
| Vice versa, Sam's Journey was released in 2017:
| https://www.knightsofbytes.games/samsjourney
| dusted wrote:
| I wouldn't call the 6502 a RISC CPU.. It was clearly designed for
| humans to program, it has multiple and complex addressing modes
| and instructions to make it easier for us..
|
| Sure it is a small instruction set compared to modern CPUs, but
| RISC is an idea, not a number.
|
| I'd venture to say that RISC is designing with the goal of making
| very efficient instructions and allow very efficient compilers to
| be written for it.. It's the idea to have one faster way of doing
| something rather than multiple convenient ways, because the
| compiler don't care, and compiler vendors appreciate not having
| to chose between multiple almost similar instructions that may or
| may not be faster in some particular case if they don't have to.
| jsrcout wrote:
| Don't remember where I saw this, but someone said the 6502 was
| really an RTCC (Reduced Transistor Count Computer).
| [deleted]
| pvg wrote:
| It's a joke, as the article says.
| vardump wrote:
| Two nits to pick from this article:
|
| While the article does mention it ignores loop unrolling, it's a
| bit disingenuous, because that almost DOUBLES performance and
| it's what nearly all real world code is doing.
|
| Also Sam's Journey PAL version does not need any kind of DMA
| transfer tricks. NTSC version is just a tiny bit behind in
| timing, so to be glitch free, REU is used. PAL version still
| works with minor glitches on an NTSC system.
|
| This is because NTSC has 263 * 63 - 25 * 40 = 15569 cycles
| available per frame (ignoring those stolen by sprites) and PAL
| 263 * 63 - 25 * 40 = 18656 cycles (again, ignoring sprites).
|
| The difference is enough that the NTSC version can't move
| required 2000 bytes of color RAM and character RAM in time in the
| worst case without REU.
| Rediscover wrote:
| I'm remembering (possibly incorrectly) PAL being 312 (not 263).
|
| Is that what You intended?
| vardump wrote:
| Yeah. I accidentally left that value wrong after a copy
| paste. But the result is right. :-)
| Joyfield wrote:
| My DOCSIS 3.1 Internet connection has more download bandwidth
| than my Amiga 500 had to RAM. Latency, not so much.
| cmrdporcupine wrote:
| Anybody else interested in writing a WASM VM for the 6502 or
| 65816 etc? This was my brainwave this week. I think this would be
| a supremely nerdy fun thing to do.
| DeathArrow wrote:
| Beat that, Apple!
| cestith wrote:
| Apple used the very same processor family at one time. ;-)
|
| They've come a long way.
| NobodyNada wrote:
| For a very loose definition of "same professor family", they
| still do -- ARM is sort of a spiritual successor to the 6502:
| https://en.wikipedia.org/wiki/ARM_architecture_family#Histor.
| ..
|
| ARM was designed by the team at Acorn that had worked on the
| BBC Micro, which used a 6502. They decided to design a custom
| processor because bit felt none of the 16- or 32-bit
| processors on the market at the time met the standard set by
| the 6502 for simplicity and low cost. So, they designed their
| own architecture which took cues from both the cutting-edge
| RISC research in academia, and the simple practicality of the
| 6502.
|
| (On a similar note: the 6502's main competitor, the Zilog
| Z80, is an early ancestor of x86! The Z80 is an enhanced
| clone of the Intel 8080, which of course the 8086 was heavily
| based on.)
|
| This legacy still shows up today in the instruction
| mnemonics: ARM uses "branch" naming (BEQ - branch if equal,
| BCS - branch if carry set, etc) because that's what the 6502
| used, whereas x86 spells it "jump" (JEQ, JCS, etc.). ARM uses
| LDR/STR to load and store registers from memory (like the
| 6502's LDA/LDX/LDY/STA/STX/STY), whereas x86 just spells
| everything "MOV". ARM only uses memory-mapped I/O to access
| hardware, whereas x86 has separate input and output ports.
| cestith wrote:
| The 6502 was a clone-ish of the Motorola 6800 made to be
| lower cost. The 6800 led to the 6809 (another
| competitor,used by the Tandy CoCo and IIRC the Dragon) and
| to the 68000 series, used by Apple in the Mac, Sun in its
| early systems, NeXT, Amiga, Atari in their later systems,
| and more. That led to the PowerPC partnership of Motorola,
| Apple, and IBM.
|
| PowerPC was outliving its useful life due not to ISA, but
| manufacturing limitations. So Apple went to Intel, but that
| wasn't fit for mobile. Apple partnered with ARM to make
| their mobile chips. Then their mobile chips grew into the
| M1 and M2 along with ARM, bringing them back to a RISC-ish
| platform like they had with PowerPC. So it's sort of a dual
| path back to the same place.
| NobodyNada wrote:
| > So Apple went to Intel, but that wasn't fit for mobile.
| Apple partnered with ARM to make their mobile chips.
|
| There's a lot of interesting history there too: in 1990,
| after seeing the first-generation ARM CPU, Apple
| partnered with Acorn to co-found ARM Ltd and develop a
| mobile processor for the Apple Newton. Although the
| Newton was a failure, ARM was very successful and powered
| pretty much the entirety of the mobile device revolution
| -- including of course the iPod and iPhone.
|
| Apple's co-founder status gives them a lot of influence
| over the ARM architecture -- they led the AArch64 design
| process, and they seem to be allowed to do things that
| even other architectural licensees aren't allowed to do,
| like implementing custom instructions in their ARM cores:
| https://news.ycombinator.com/item?id=29783549
|
| And Apple's iteration of ARM owes a lot to the PowerPC
| world as well -- Apple's processor design team was
| originally PA Semi, a company that designed PowerPC
| cores.
| klelatti wrote:
| > they led the AArch64 design process
|
| Interesting - is there a reference for this?
| NobodyNada wrote:
| Here's a Twitter thread from a former Apple engineer:
| https://twitter.com/stuntpants/status/1346470705446092811
|
| > arm64 is the Apple ISA, it was designed to enable
| Apple's microarchitecture plans. There's a reason Apple's
| first 64 bit core (Cyclone) was years ahead of everyone
| else, and it isn't just caches.
|
| > Arm64 didn't appear out of nowhere, Apple contracted
| ARM to design a new ISA for its purposes. When Apple
| began selling iPhones containing arm64 chips, ARM hadn't
| even finished their own core design to license to others.
|
| > ARM designed a standard that serves its clients and
| gets feedback from them on ISA evolution. In 2010 few
| cared about a 64-bit ARM core. Samsung & Qualcomm, the
| biggest mobile vendors, were certainly caught unaware by
| it when Apple shipped in 2013.
|
| > Apple planned to go super-wide with low clocks, highly
| OoO, highly speculative. They needed an ISA to enable
| that, which ARM provided.
|
| > M1 performance is not so because of the ARM ISA, the
| ARM ISA is so because of Apple core performance plans a
| decade ago.
| klelatti wrote:
| Very interesting - many thanks!
|
| Edit: I'm a bit puzzled by the claim that Apple was
| selling Aarch64 before Arm had finished their first
| design - A7 announced at end 2013 but A53 appeared in
| 2012?
| NobodyNada wrote:
| It looks like A53 was _announced_ in October 2012, but
| I've found no indication of whether the design was
| actually finished by then [0]. And remember that ARM just
| sells IP and other companies are responsible for
| manufacturing it; it doesn't look like anyone actually
| produced A53 cores until 2015 [1] -- whereas Apple was
| shipping actual consumer products with A7's in them by
| October 2013.
|
| [0]: https://www.techspot.com/news/50656-arm-
| announces-64-bit-cor...
|
| [1]: https://en.wikichip.org/wiki/arm_holdings/microarchi
| tectures...
| klelatti wrote:
| Very fair point. OTOH there was a lot of detailed info on
| the A53 available in 2013 and SoCs were being announced
| with it.
|
| I suspect this thread may be slightly exaggerating the
| position but certainly the case that Apple were well
| ahead of all the competitors - and no doubt they were
| deeply involved in the ISA design.
| cmrdporcupine wrote:
| I honestly don't think there's any kind of straight line
| from the 6809 to the 68000. They share little in common
| other than the '68' prefix and coming from the same
| company and being big endian. The instruction sets are
| very different. Designed by different teams. The
| peripheral chip set and bus management was different too.
|
| The 68k shares more with 1970s minicomputers especially
| the PDP-11 and/or VAX architectures than any MPU that
| preceded it.
| DeathArrow wrote:
| >the Zilog Z80, is an early ancestor of x86! The Z80 is an
| enhanced clone of the Intel 8080, which of course the 8086
| was heavily based on
|
| I owned an Z80 based computer when I was 8 to 10 years old.
| Its instruction set and memory access does not have any
| resemblance for me with 8086.
|
| They seem like very distant relatives.
| klelatti wrote:
| The 8086 was designed to allow automated translation of
| 8080 assembly to 8086 assembly - so the instruction set
| may 'look' different but in fact has a lot in common.
|
| Not quite right too to call the Z80 an ancestor of the
| 8086 but certainly closely related due to the common
| inheritance from the 8080.
| NobodyNada wrote:
| Yeah, perhaps more of an uncle than a direct ancestor :)
| klelatti wrote:
| Indeed - someone should do a family tree of CPUs!
| cestith wrote:
| We need to decide where the NEC v20, v30, v40, and v50
| live.
| krallja wrote:
| And the NSC-800, which is like a Z80 with 8085 half-
| interrupts!
| klelatti wrote:
| It's the offspring of the marriage of Z80 and 8085!
| MarkusWandel wrote:
| This also misses loop unrolling, combined with an assembly
| language version of "Duff's device" to be able to do an arbitrary
| number of transfers even if your loop is unrolled to, say, 8
| transfers.
|
| This stuff used to matter! I had an NCR5380 chip on an Amiga,
| simple, memory mapped I/O, no DMA or interrupts. To get a tape
| drive to stream (remember that?) the byte transfer loop really
| had to be tweaked. But once fully tweaked, "whoooooooosh" instead
| of "chugga chugga chugga".
|
| And truly heroic programming techniques had to be employed on the
| C64 to do X/Y smooth scrolling games. Often a static part of the
| screen, conveniently displaying scores etc, existed to make it
| work - there was just enough bandwidth to do 80% of the screen,
| say, so you find an excuse to keep the rest of it static.
|
| I kinda miss those days, and I kinda don't. I guess it was good
| to have experienced them.
| djmips wrote:
| Those days still exist! If you want your mind blown watch the
| Epic Games Nanite talk from last year's SIGGRAPH where the core
| rendering of the dense vertex data is done directly in Compute
| , IE software rendered, instead of using the hardware
| rasterization hardware which has a minimum 4 pixel invocation
| overhead which gets expensive with very small triangles.
|
| This is but one example of this that's happening every day,
| there is much much more like hair rendering in EA FIFA soccer
| or automatic trading financial software running on GPUs.
|
| There's a whole world of applications where people are still
| concerned with every last cycle of performance just like in the
| C64 days.
| djmips wrote:
| Here I'll save you the trouble of trying to find the video.
|
| https://www.youtube.com/watch?v=eviSykqSUUw&list=PLabw4gCouT.
| ..
| amelius wrote:
| We're now doing sort of the same tricks, but with power
| management on mobile devices.
| djmips wrote:
| Kind of off topic for 6502 memory copy speeds but with regard
| to scrolling, there became a pretty cool software hack for the
| C64 (called VSP) where you could trick the poor VIC chip into
| starting scanning out the screen position later in memory. Move
| the start one character and the whole screen shifts left by 8
| pixels. You only need to repaint a vertical column for this
| 'course' scroll instead of moving the entire screen of
| characters. This is something that should have been built into
| the hardware and was very useful on other systems that had that
| ability (like the NES for example)
|
| With it you can reduce the amount of memory you need to copy
| every 8 pixels (the 8 pixel part can be done with smooth scroll
| registers).
|
| There's a thread and example code on github here.
| https://www.lemon64.com/forum/viewtopic.php?t=70539
|
| Also note it's such a terrible hack on the DRAM that it doesn't
| work on all C64s and there's a technical discussion about that
| here. https://www.linusakesson.net/scene/safevsp/index.php
|
| Hardware mod if VSP doesn't work on your C64 and more technical
| details. http://wiki.icomp.de/wiki/VSP-Fix
|
| Also it makes mention of the C64-Reloaded which is a modern C64
| product that includes the fix.
| js2 wrote:
| > This also misses loop unrolling.
|
| It's mentioned at the bottom of the article in the "Thoughts"
| section:
|
| _You could certainly use self modifying code and unroll this
| copy routine to get better performance at the price of
| flexibility and arguably understanding for the average casual
| 6502 assembly coder. Again, this was not a "how fast can we
| absolutely make it" but an everyday use examination._
| hinkley wrote:
| Duff's device is a fixed size loop unrolling with an ugly
| hack to make it behave for arbitrary inputs. The assembly
| makes sense but the C code is rough.
|
| It's not quite as fast as self-modifying or custom compiled
| code, but it's pretty close.
| tialaramex wrote:
| Tom Duff's device was doing that because he's doing MMIO, you
| should not [I know you're not suggesting it, but just in case
| anybody reading thinks it's clever] do this today when you
| don't want MMIO, your compiler is very capable of just doing an
| actual copy quickly, so tell it that's what you want, don't
| write gymnastics like Duff's device.
|
| However, expressing these partially unrolled loops nicely is a
| nice performance-not-safety feature of WUFFS called "Iterate
| loops":
|
| https://github.com/google/wuffs/blob/main/doc/note/iterate-l...
|
| Well, I say performance not safety, as always they want both,
| but you _could_ safely just write the never unrolled case,
| while the existence of Iterate loops allows you to express a
| much faster special case but know the compiler will fix things
| up properly no matter what.
| vardump wrote:
| We're talking about C64 (and maybe Amiga).
|
| Compiler is not going to do absolutely anything for you on
| those retro platforms.
| tialaramex wrote:
| Aw, just needs a better compiler (with a 6502 target) :D
|
| Jason Turner's CppCon 2021 talk, "Your New Mental Model of
| constexpr" has half the presentation as a C64 program
| (though for practical reasons not actually running on a C64
| but instead an emulator) because _most_ of the heavy
| lifting is done by the C++ 20 compiler.
| https://youtu.be/MdrfPSUtMVM?t=1422
|
| Now, Jason's approach is not going to beat hand-crafted
| 6502 machine code _in a fair fight_ but he often doesn 't
| need to fight fair and that's the point of his talk.
| localhost wrote:
| The C64 VIC-II chip would grab the address bus from the CPU
| every 8 scan lines on the screen. Some of the early "fast load"
| cartridges like the Epyx FastLoad cartridge that would
| accelerate loading games from the floppy drive would blank the
| entire screen during load so that their async data transfer
| routines wouldn't get interrupted by the VIC-II chip grabbing
| the bus. I wrote a similar (better?) cartridge where I would
| need to use the register on the VIC-II chip that reported the
| scan line as a sync marker to transfer 3 bytes asynchronously
| from the 1541 down the clock and data lines of the serial bus.
| Good times.
| MarkusWandel wrote:
| In my recollection Epyx Fastload did not blank the screen,
| though some earlier fast loaders did.
|
| I also remember the software voice synthesizer "SAM" needing
| to blank the screen to render glitch-free sampled audio. Then
| along came, what was it "Impossible Mission" ("Another
| visitor! Stay a while...") doing pretty clean sampled audio
| with the screen on. Not that the C64 SID chip was even
| remotely intended to be able to play sampled audio in the
| first place!
|
| The Amiga was unimaginably powerful by comparison. Even a
| basic configuration had 8x the memory a C64 had, and it had
| all those fancy DMA toys to offload the CPU.
| jimsmart wrote:
| > Not that the C64 SID chip was even remotely intended to
| be able to play sampled audio in the first place!
|
| I don't recall how SAM did it, but sample playing on the
| C64 SID chip was indeed a nice trick -- it was actually
| done by modulating the main output volume, which made a
| slight click when changing.
|
| Eventually this got used by some of the C64 musicians /
| music player libs, so one could play a channel of samples
| as well as the three regular synth channels on the SID.
| IIRC, Outrun used this particularly well in its title
| screen and/or loading music, having some vocal samples
| "O-O-Outrun!" (and skidding sound effects) as well a
| sampled drums.
|
| Annoyingly, IIRC, some revisions of the SID chip behaved
| slightly differently, and had louder or softer sample
| playback when this hack was used. But still: clever stuff.
| weinzierl wrote:
| ... and main output volume had only 16 levels so the
| samples were 4 bit quantized. It is a wonder that we
| could get understandable vocal samples with this hack at
| all. I distinctly remember "Goal!" from Peter Shilton's
| Fottball and "Accolade presents" from Test Drive [1]. In
| the examples one can hear the amount of quantization
| noise that low bit depth caused.
|
| [1] https://m.youtube.com/watch?v=L1u-WydiiCI
| vardump wrote:
| You can actually get about 6-7 bits resolution out of
| same SID volume register. 4 bits from the volume
| register, channel 3 disable and 3 filter bits. Requires
| some setup to get SID in a particular state first.
|
| For details, see: https://livet.se/mahoney/c64-files/Musi
| k_RunStop_Technical_D...
| le-mark wrote:
| Crypto mining with fpga was in the weeds down this path, or 100
| Gbps signal processing. Two examples where low level stuff is
| still relevant, just not commodity and widely available like
| the 8 bit micros were.
| aerique wrote:
| I also read a high frequency trading blog that was posted
| here a few years ago. Same thing: hacking hardware and
| software so the first bytes of info could be grabbed from a
| stream and acted upon, instead of needing to wait for the
| whole package to have come in.
|
| Also when I was in the demo scene on the Atari ST one had to
| do specific timings in the assembly code to be able to draw
| outside the screen's borders (so the borders were on screen
| but couldn't be drawn on by code).
| Psyladine wrote:
| >I kinda miss those days, and I kinda don't. I guess it was
| good to have experienced them.
|
| There's a certain reflective quality and even satisfaction from
| using a chainsaw after coming up using a tree saw by hand. It
| feels progressive, even if it is just optimizing for time at
| the expense of energy.
| 6510 wrote:
| I eventually managed to do a scroll texts fast enough to do
| them while the pixels of the scroll text were printed to the
| screen. It was even fast enough to have char combinations in
| the scroll text that modified its speed and direction with
| speeds like "one time scroll 1 pixel the next 2". One of the
| tricks was to do "poor" timing with NOP's by requiring an empty
| row of bits between each scroll text. (The text becomes
| unreadable anyway if there is no space between the lines)
| vidarh wrote:
| Most C64 games used character-based graphics (coupled with the
| smooth scrolling support in the VIC) which meant you'd at most
| move 2000 bytes to scroll the entire screen every 4 to 8 pixels
| scrolled.
|
| You can easily scroll the entire screen on a C64 if that's all
| you're doing.
|
| Some games did also scroll bitmaps. There the naive version
| requires moving 9000 bytes (40x25x8 for the bitmap data, 40x25
| for the colour data) every time you need to scroll, and that
| indeed starts to bite. There are games which reduces the cost
| this using a trick called AGSP ("Any Given Screen Position").
|
| But you're right static parts of the screen were often larger
| to reduce the dynamic part. That was rarely down to just
| scrolling the screen in isolation, though, but because the
| overall budget of cycles you had to work with was tiny. Often
| you might also have a lot of other stuff which consumed lots of
| cycles _affected directly by the size of the playing field_.
| E.g. if you did sprite multiplexing (moving a sprite after it
| had been partially or fully rendered to reuse the same hardware
| sprite), you might well be keeping the CPU busy throughout the
| full rendering of the playing field.
|
| There was also the consideration of how much effort you wanted
| to go to in order to avoid glitches, since unless you could do
| the scrolling entirely while the VIC was rendering the parts of
| the screen outside the playing field, you'd need to make sure
| the rendering and copying didn't overlap, and of course just
| restricting playing field size was an easy workaround for that
| problem.
| MarkusWandel wrote:
| I didn't get that fancy. I got as far as a horizontal smooth
| scroller, but with the "move the screen memory during one
| 1/60 second redraw cycle" mentality - racing the redraw, and
| when it was just about caught up, whoa, time for the static
| bar at the bottom.
|
| Quite right, one could prepare the moved version in the
| background during the 7 steps where you're merely diddling
| the smooth scroll register, and then flip to it in an
| instant. But wait, was it possible to page flip the colour
| map? Also, always having the appropriate moved version ready
| even as the player is doing unpredictable things goes into
| the "heroic programming techniques" zone again.
|
| As for glitches, it's amazing what can be done if perfection
| is sacrificed and there were plenty of good games that did
| have them, e.g. sprite multiplexing. But I did mean
| "effortless looking perfect" smooth scrolling.
| jimsmart wrote:
| Ex-C64 games coder here: you are are correct - no, you
| couldn't relocate/page-flip the colour map, like you could
| the character map. So you had to update it all somehow on
| the required frame.
|
| The fastest technique I saw for updating the colour map in
| a single go, was to have the whole thing as a huge block of
| immediate mode load-stores, then one could 'scroll' the
| data across the LDA instructions within that code, in
| advance, over n-frames, then call this self-modified code
| block when one did the character screen flip (immediate
| load-stores was faster than load-stores from colour ram).
| e.g. scroll_splat_colour: LDA #$00
| # colour data for char STA $D800 # colour ram
| LDA #$00 STA $D801 # etc., for every
| visible char onscreen in scrolling area
|
| And one would be updating/scrolling those values loaded
| into the A register, in chunks over previous frames,
| similar to: LDA scroll_splat_colour+6
| STA scroll_splat_colour+1 LDA
| scroll_splat_colour+11 STA scroll_splat_colour+6
| # etc., for every lda/sta in the above
|
| Perhaps not the clearest explanation, but hopefully enough
| to communicate the idea.
|
| FWIW, I didn't invent that technique, it was an improvement
| Jon Williams made to my code, whilst we both worked for
| Images Software (now Climax). Not sure where he got it
| from, maybe he invented it himself, maybe he cribbed it
| from elsewhere.
|
| Related: I thought sprite multiplexing was awesome, and
| there were quite a few tricks there too to get it
| performant. But that's another far more complex topic.
| vidarh wrote:
| Another "obvious" trick is to narrow the playing field
| but animate the rest, and then safe cycles by a
| combination of bands that requires less updates and
| sprites. E.g. Pole Position is a classic example, where
| the graphics covers most of the screen, but only about
| half actually has gameplay. The rest consists of a very
| narrow band of mountains, and a couple of bands of
| clouds. I haven't looked at what they did for Pole
| Position, but that pattern of the actual gameplay being
| constrained to a much smaller portion of the screen than
| what looks like the playing area is pretty common.
| vardump wrote:
| How to handle the case when player changes direction to
| exact opposite immediately after the frame color data was
| transferred? Double buffering splatting code? Although
| one copy of it is 5001 bytes, ouch.
| jimsmart wrote:
| You generally don't handle that case! :) -- Instead you
| let the player move within a rectangular area onscreen,
| and decide on which way you are going to scroll the
| screen in advance (or rather: after the fact, depending
| on how one looks at things), based upon where the player
| is inside that rectangle. So the screen catches up with
| where the player is moving/pushing.
|
| Eight-way scrolling like this was always a massive pain
| on the C64 (and other systems that used buffered
| scrolling with no h/w, e.g. Atari ST), but that way (a
| box the player moved around inside) was the only
| realistic way of handling it if you had to do a bunch of
| work in advance before doing the actual scrolling. Turns
| out that having the player in a loose rectangle is also
| easier on the eye too, which is perhaps why it's also
| used on systems that don't suffer the same h/w
| restrictions.
|
| Yeah, the colour RAM update was a lot of bytes to move.
| But dedicating a big chunk of code to it meant one could
| be a little freer to use slightly slower techniques
| elsewhere in the update cycle. Side note: the C64
| actually only had 39 visible char across the screen when
| in 'scrolling' mode, because the borders where shrunk-in
| slightly (and slightly more than one expects). So one
| less char to worry about per line. That saved a tiny
| amount of code / memory / execution-time for the colour
| splat (and the scrolling of partial chunks - whether on
| back buffers, or the data within the colour splat code -
| over the other frames). Sure, it's only one less
| character. But it saved some cycles. And cycles mattered!
| Particularly when doing something with that much data to
| move / that took that much time.
| vardump wrote:
| > C64 actually only had 39 visible char across the screen
|
| But 40 color cells were still visible, unless horizontal
| scroll register was 0.
| jimsmart wrote:
| No, but that's an easy enough mistake to make :) -- It's
| called 38-column mode, and when enabled the VIC shrinks
| both borders in by 8 pixels, and then offsets the screen
| according to the x-scroll register bits.
|
| [Edit: another source says it's actually 7-pixels hidden
| on the left, and 9 on the right. But whatever: same
| principle, the screen is shrunk by 16 pixels in total
| horizontally]
|
| Which meant that at most only 39 characters were visible
| across the screen -- with two of those, one at each end
| of the row, being partially visible -- and that applies
| to both the character screen and its associated colour
| RAM. Only 38 characters were visible when the x scroll
| register was zero, and as soon as one shifted to a value
| of 1-7, the 39th column became visible (and the 1st one
| became partially offscreen). But the 40th column is never
| visible when in that mode.
|
| For more info see:
|
| http://www.devili.iki.fi/Computers/Commodore/C64/Programm
| ers...
|
| "When scrolling in the X direction, it is necessary to
| place the VIC-II chip into 38 column mode. This gives new
| data a place to scroll from. When scrolling LEFT, the new
| data should be placed on the right. When scrolling RIGHT
| the new data should be placed on the left. Please note
| that there are still 40 columns to screen memory, but
| only 38 are visible."
|
| -- But it's discussed on a handful of other pages too, if
| you google.
| vardump wrote:
| Oh damn... and I did a fair bit of coding on C64 back in
| the day. :-D
|
| Somehow I thought it hid 4 pixels both sides. Totally
| wrong.
|
| PS. Then it's so unfair bad line still takes 40 cycles!
| jimsmart wrote:
| Please stop giving the above comment downvotes because of
| this person's lack of knowledge: we all have to learn
| things -- there was once a time I didn't know this
| either.
|
| It's not like vardump here was being a dick about
| anything in their comment, cut them some slack!
| vardump wrote:
| Thanks. Although I really should have known better, wrote
| scrolling routines 35 years ago.
|
| It's scary time can corrupt memories we consider as
| facts.
| Luc wrote:
| I made an 8-way full-screen scroller.
|
| To avoid situations like this the player sprite at the
| center of the screen had momentum, i.e. the sprite had to
| rotate 180 degrees to change to the opposite direction,
| giving a few frames time to set everything up.
| jimsmart wrote:
| That's a nice little trick, cheers for sharing. (Not that
| I'll get a chance to use it these days, but still)
| 6510 wrote:
| > # etc., for every visible char onscreen in scrolling
| area
|
| For every changed char. (which is sometimes more and
| sometimes less)
|
| You could do them in order but if you're using only a few
| characters you need only 1 LDA for each char. (How to do
| this is left as a creative exercise for the reader)
| jimsmart wrote:
| But the overheads of tracking which characters might have
| been changed here completely outweighed simply scrolling
| / updating the whole thing. The code becomes too involved
| in tracking changes, and fudging about with- / rewriting-
| the splat code.
|
| You can leave it as 'a creative exercise for the reader',
| but that's because you can't solve this for the generic
| case (i.e. any map the graphics artists might give you)
| in less cycles than simply dealing with each and every
| character, which is the worst case.
|
| Processing that many bytes, and doing comparisons and
| extra branches, simply becomes overheads, and, very
| quickly, your code is slower than simply updating /
| scrolling everything simply.
|
| For the colour splat routine, having a giant, pre-
| assmebled block of immediate-mode load store pairs for
| every character is as optimal as it gets -- and handles
| all cases -- on the C64, you only have a frame to update
| the colour RAM (because it cannot be relocated/paged),
| and you are generally chasing the scan beam to move that
| much data before the next frame.
|
| You don't have the luxury of having extra cycles to re-
| write that block of code at runtime, and rewrite the code
| that scrolls the data within that code, nor do you have
| the luxury of having enough spare cycles to be comparing
| data, and branching conditionally depending on if it has
| changed or not.
|
| Perhaps you misunderstand the technique I describe, or
| perhaps you under-estimate the overheads required to
| perform what you describe. Or perhaps both.
| egypturnash wrote:
| That's... that's horrible. Beautiful, but horrible.
|
| Which kinda describes any advanced c64 technique, really.
| jimsmart wrote:
| Indeed, I totally agree on all points :)
| justinlloyd wrote:
| Yeah, compiled graphics and compiled colour tables, also,
| a routine that could self-modify code in regions of RAM
| to do the colour table writes. A slow set-up function at
| level start would build the code to be JSR'd later in the
| level. We did that on a few games on the C64 and the
| Speccy and Beeb and Atari. Later used the same techniques
| in DOS on PC. And of course, doing the same tricks but
| with D0 through D7 and A0 through A6 on Atari ST and
| Amiga. Also doing "stuff" in zero page because the
| address loads were shorter. And avoiding 256-byte page
| boundaries where possible because of the cycle penalty.
| jimsmart wrote:
| > A slow set-up function at level start would build the
| code
|
| Interesting, and good thinking :)
|
| IIRC, when we used this technique on the C64, we didn't
| build the code during init at runtime, we actually built
| the code in the dev environment, using macros, so it got
| built at assembly/compile time. So we skipped the small
| time hit at runtime init, at the expense of a slightly
| longer load time for the user (and a tiny bit longer on
| our assembly/compile times, although that was fairly
| negligible cos we were building on PCs).
| jimsmart wrote:
| Ex C64-games coder here! -- If your sprite multiplexer was
| taking most of CPU during the screen draw time, then honestly
| it was not a particularly great multiplexer! ;)
|
| Most decent multiplexers took just a scanline or two/three,
| multiple times down the screen (i.e. whenever relocating any
| already drawn sprites) -- often with decent sized gaps (time
| when the CPU wasn't involved in manipulating sprites and
| could do other things), with a larger chunk during the
| offscreen period / at the bottom of the screen, when one was
| preping the data (mostly sorting the sprite's y-coords) for
| the next frame's screen draw.
|
| -- During debugging/etc, we'd often enable colour changes to
| the screen border, at the beginning and end of the
| multiplexer code (for both the interrupt stuff in the
| playfield, and the non-playfield section), so we could
| visually see how it was working/performing.
| vidarh wrote:
| Sure, the "nice" way of doing it is to rely on the raster
| interrupt. But I've also seen way too much C64 code where
| pretty much everything ran in the interrupt handler, with
| associated stupid busy waiting because it saved people from
| having to synchronise. I'd guess more commonly for cheap
| and cheerful ports from less capable machines, but it's
| been a couple of decades since I've actually looked at any
| of this code.
| cesaref wrote:
| 64 coder here too!
|
| The border changing thing has just reminded me how bad the
| development process was using the commodore assembler with
| a 1541 drive which was horribly slow. assemble, dump image,
| reboot, crash, reboot, load assembler, try and work out
| what had happened :)
|
| At some point I ended up with a PC running a system called,
| I think PDS, which was a cross assembler with dongle to
| push the image straight into the memory of the C64. I even
| think you could inspect and change memory on the running
| machine - it was amazing!
| jimsmart wrote:
| Yeah, we all used PDS too, although not originally.
| Pretty good system, particularly for that era, and
| cost/capability-wise (though they weren't that cheap, and
| folk eventually started cloning the boards for them,
| IIRC).
|
| I remember it was annoying to have only 8 main source
| files in PDS though, most big projects went past the 8
| files of however many kb (although it could also handle
| include files, which was how one got around that limit).
|
| Although when I actually started out as a C64 games dev,
| my dev system was a BBC Micro B, linked to a C64. Not
| quite a cool as PDS, but it could assemble code 2x the
| speed of the C64 (the processor clocked twice the speed
| on the Beeb), and it was great having a separate 'host'
| system for development.
| jimsmart wrote:
| Here's a link to info about the PDS kit, in case anyone
| is interested:
|
| https://www.cpcwiki.eu/index.php/PDS_development_system
| mgkimsal wrote:
| Just watched a video of C64 "Seven Cities of Gold" with a
| colleague yesterday, trying to convey just how... exciting
| that was in 1984. Watching on YouTube, I had forgotten just
| how small the playing 'viewport' was. It seems like possibly
| a more extreme example - I don't remember too many other
| games having an action viewport that small.
| kken wrote:
| This doesn't even cover all the neat assembly tricks with self-
| modifying code that you would actually use on a 6502 to speed up
| memory transfer.
| MatthiasWandel wrote:
| For games that scrolled the screen, those had to happen
| essentially between scans, so a lot of tricks were employed.
| Fixed addresses in the code, unrolled loops, and self modifying
| code to avoid the expensive zero page indicrect indexed
| addressing mode (the slowest instruction on the CPU). The other
| trick was to start moving the first line of screen just after
| it got displayed, which would give you nearly two jiffies to do
| it before the scan caught up to you on the next frame.
| vardump wrote:
| No need, it can easily happen during the scan. As long as the
| scan and update memory location never meet, there's
| absolutely no problem.
| natly wrote:
| It's crazy how much work went into those old games. I have a
| feeling those programmers weren't even paid that well
| considering how few people owned computers back then (so the
| market can't have been large).
| wkearney99 wrote:
| If you ever play(ed) the Atari 2600 version of River Raid,
| you got to witness some SERIOUS tweaking to work around the
| limits of that console. Every scanline processed on the fly
| during the vertical blanking interval. No screen buffer.
| The animation was soooo smooth.
| kabdib wrote:
| My first job out of college at Atari in 1982, writing game
| cartridges for the 400/800 computers, paid $25K a year. My
| first raise after a year was to $30K.
|
| There were programmers in other divisions making royalties
| off of their games. Tod Frye famously got $700K or so for
| his terrible version of 2600 Pac-Man (it was terrible not
| because he was a bad programmer, but because marketing
| decided that 2K of ROM had to be enough, and he was smart
| enough to pull off a miracle . . . of sorts).
|
| Also, the OP apparently doesn't know how to unroll loops,
| which is the first thing you do to your game's hot spots.
| (Never had to resort to self-modifying code).
| vikingerik wrote:
| I did this in a homebrew Atari 2600 game. For a Space Invaders
| grid of sprites. Each is triggered by writing to a register, as
| the electron beam scans through to display each sprite.
|
| The interval between sprites on the same scanline is 3 cpu
| cycles. That's a single 6502 instruction, the write to that
| register. How do you do any kind of load or compare instruction
| along with that to decide whether to display that sprite?
|
| The answer was to copy that stream of instructions to RAM ahead
| of time, and replace each write to a missing invader with a no-
| op. The code is here if anyone wants to see (the "inv3" demo):
| http://dos486.com/atari/
| krallja wrote:
| > copy that stream of instructions to RAM ahead of time
|
| Even this is easier said than done: there are only 128 bytes
| of RAM in the entire machine, and that has to suffice for
| global variables and stack memory in addition to storing
| modified code like this!
| rasz wrote:
| Afaik its <120KB/s with all the tricks. 6502 was hand designed
| and brain optimized for clever use of available silicon real-
| estate, roughly 20% of CPU bus cycles are dead/bogus/useless.
| RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6
| wasted , all increments at least 1 cycle wasted etc. Sad to
| think state machine handling DMA transfers in REU is probably
| less than 50 macrocells, and Commodore ran its own fab, they
| could have build-in REU DMA in C128 and it would cost cents.
| mywittyname wrote:
| Is there a way to make a compatible 6502 variant that doesn't
| have this waste?
| krallja wrote:
| "The 100 MHz 6502" does a different clever thing - it
| copies all the dedicated RAM and ROM into its own FPGA
| copy. Then it can perform 7 to 25 instructions before the
| next external read/write cycle!
|
| http://www.e-basteln.de/computing/65f02/65f02/
| rasz wrote:
| https://en.wikipedia.org/wiki/CSG_65CE02#Pipeline_improveme
| n... fixed most painful ones, but afaik not all dead
| cycles. But it was 1988 and commodore didnt bother putting
| it into anything other than some IO card for the AMIGA, not
| to mention it still did nothing to cover slowness of moving
| data around. Japanese decided to do something about it for
| TurboGrafx-16 in 1987 Hu6502
| http://shu.emuunlim.com/download/pcedocs/pce_cpu.html
|
| Transfer Alternate Increment (TAI), Transfer Increment
| Alternate (TIA), Transfer Decrement Decrement (TDD),
| Transfer Increment Increment (TII) - pretty much x86 'rep
| movsb', except not great at 6 cycles per byte (~160KB/s).
| For contrast 5 years older 80286 already did 'rep movsw' at
| 2 cycles per byte. 6 years later Pentium did 'rep movsd' at
| 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb'
| full cachelines at a time at full cache/memory controller
| speed.
| JPLeRouzic wrote:
| I think there are tricks to rewrite the microcode on
| Pentium, does similar tricks exist for 80286, 386 or 68K?
|
| It would be fun to reconfigure one as a high speed 6502.
| cmrdporcupine wrote:
| The 65816's MVP/MVN opcodes can do bulk transfers a teeny bit
| faster.
| lscharen wrote:
| For more 16-bit 65816 context -- other than for space-savings,
| these instructions are never used when performance is needed
| due to the low effective throughput of 7 cycles per byte. A
| basic unrolled loop using 16-bit instructions is 20 - 30%
| faster and specialized graphics routines that are able to use
| the stack can approach 3 cycles per byte using the PEA and PEI
| instructions.
| cmrdporcupine wrote:
| I'll defer to you I guess, as you seem to know more about
| this than me. The only thing is searching through the
| 6502.org forums I don't see a consensus on this?Plenty of
| people talking about the advantages of MVN/MVP for bulk
| transfers. I seem to recall doing the cycle counting myself
| at one point, too, and finding it advantageous.
|
| One neat trick (I remember reading about from Alan Cox I
| believe) if you have control over the hardware is to memory
| map I/O devices like serial input / output such that
| incrementing addresses starting at a given address all point
| to the same physical device/register. E.g. allocate 256
| contiguous bytes in your memory map to point to the same
| thing. This way you can do bulk I/O transfers to/from memory
| using MVP/MVN instead of "get a byte, put a byte" instruction
| by instruction.
| rasz wrote:
| The trick you describe was being used by Silicon Valley
| Computer ADP50L IDE controller from early nineties (1991).
| Memory mapped I/O instead of traditional x86 port access
| lets you skip doing manual loop for 'rep movsb', result can
| be 50% speed bump
|
| https://forum.vcfed.org/index.php?threads/performance-of-
| lo-...
|
| Port IO Read Speed : 219.39 KB/s
|
| MMIO Read Speed : 310.77 KB/s
|
| Some variants of XTIDE hardware also implement this, as
| does the free bios.
| ksherlock wrote:
| MVP/MVN are 7 cycles per byte.
|
| If you're moving memory around in bank 0 (or have memory
| mapping), you can use the direct page register to
| read/write anywhere in bank 0 and the stack to read/write
| anywhere in bank 0.
|
| 16-bit LDA dp, PHA is 4 + 4 = 8 cycles or 4 cycles per
| byte. Best case would be if you know it's constant data
| before hand, eg, LDA #0, PHA, PHA .... 2 cycles per byte!
|
| For general purpose copying MVP and MVN are easier and have
| better code density.
| mmphosis wrote:
| _2 cycles per byte!_ It takes 4 cycles for PHA to push
| the 16-bit Accumulator, two bytes, onto the stack. There
| 's also 16-bit PHD, PHX and PHY.
| cmrdporcupine wrote:
| Ah here it is:
| http://forum.6502.org/viewtopic.php?f=2&t=5035 referencing
| a now-lost G+ post from Alan Cox:
|
| _" The emulator also has a fun hack for disk performance
| I'm hoping will get replicated in some of the upcoming
| retro 65C816 board design. Like the 6502 the 65C816 sucks
| at continually reading from an MMIO port and writing it to
| sequential memory locations. It sucks less than a 6502
| because you've got 16bit index registers, but at the same
| clock it was doing about 100K/second that a Z80 can do 250K
| (with ini loops). The revised emulated disk interface has
| the same mmio port replicated across a chunk of address
| space and this allows a block move instruction (MVN) to do
| all the work at 6 clocks/byte. At that point the 65C816
| suddenly jumps to twice as fast as the Z80 on disk I/O."_
| [deleted]
| joosters wrote:
| On the original ZX Spectrum, you could measure the write
| bandwidth visually, because on startup it would write the value 2
| into each byte in memory (which included the graphics RAM). It
| would then re-read and decrease the value of each byte twice, to
| check for any faulty memory.
|
| You could see these patterns on-screen as the reads and writes
| took place (I think it took about a couple of seconds to do this
| to 48k of RAM)
| becurious wrote:
| You could change the stack pointer to the top of the area of
| memory you wanted to fill and then use PUSH to fill at I think
| 11 clock cycles per two bytes. It was faster than unrolled LDI
| or LD (HL),A followed by INC HL. It would be filling memory in
| the wrong direction for a Rainbow processor but you could use
| it for repeating patterns. I think I did a checkerboard pattern
| that would shift every frame and it was pretty smooth.
___________________________________________________________________
(page generated 2022-06-16 23:01 UTC) |