example.mov
This project uses the ACME assembler for no particular reason but that it was the least complicated and most easy to find assembler. I used ACME 0.97.
For testing I found it best to use VICE, I had a few problems launching .PRG files with the default configuration so I created my own launch script which enables auto start PRG mode.
Note: the script assumes that ROM files (KERNAL, BASIC, character set) are present in the ROM/ directory. Make sure to place them there. They can be found in the release download of VICE.
cd src
make run
The screen should turn black and green dots should appear.
Note that normally you would need to start the program yourself using
a SYS 2049
statement (or whatever the initial address is) to jump into
the code. We add a header value that adds a bit of BASIC code that does
this for us.
If you want to run a specific version of VICE, set the X64SC
environment
variable.
I used a SD2IEC adapter and copied the src/myprg.prg
file to the root
of the SD card. Then it is just a matter of loading the program:
LOAD "myprg.prg",8
RUN
I had problems where the SD2IEC adapter didn't even load the directory list
(LOAD "$",8
) and simply flashed the error LED. Do diagnose this I ran this
code:
10 OPEN 15,8,15: INPUT #15,A$,B$,C$,D$
20 CLOSE 15
30 PRINT A$,B$,C$,D$
RUN
The message I got was "74,Drive not ready,12,0" which was caused by a intermittent fault of the SD card socket (card detect or write protect was not triggered correctly). Re-inserting the SD card a few times solved the issue.
There are three keys you can press:
- R for clearing the screen and resetting the X/Y/Z values
- N to play a melodic sound based on the current trajectory
- M to play a bass sound based on the current trajectory
- F for toggling fast multiplication
FMULT
from the BASIC kernal is not the fastest multiplication routine
there could be and by far the worst offender in terms of runtime, at least
according to VICE profiling
which shows that it takes around +55% of the total runtime just to do FMULT
and that is after we already used things like shifting the exponent directly
to multiply dt
:
(C:$b9b2) prof graph depth 6
Total % Self %
------------- ------ ------------- ------
[1] START 36.745.342 100,0% 0 0,0%
[2] RST -> fce2 36.363.781 99,0% 0 0,0%
[3] e39a -> e422 36.363.781 99,0% 0 0,0%
[4] a7e7 -> a7ed 36.363.781 99,0% 0 0,0%
[5] a7e7 -> a7ed 36.363.781 99,0% 71.098 0,2%
[8] 0c79 -> .xyz_step 35.824.933 97,5% 1.275.328 3,5%
[83] 0e52 -> .FMULT 4.680.938 12,7% 291.002 0,8%
[9] 0dbf -> .FMULT 4.210.284 11,5% 299.894 0,8%
[95] 0e34 -> .FMULT 4.155.646 11,3% 295.502 0,8%
[122] 0de4 -> .FMULT 4.146.617 11,3% 294.084 0,8%
[38] 0d68 -> .FMULT 4.042.304 11,0% 293.582 0,8%
[111] 0e05 -> .FACINX 917.413 2,5% 33.696 0,1%
[68] 0d14 -> .FACINX 867.592 2,4% 33.678 0,1%
[27] 0d89 -> .FACINX 855.761 2,3% 33.678 0,1%
[... skipped the rest of the context dump ...]
Ever since I stumbled upon this writeup about multiplying floats by adding them I wanted to try this for CBM floats and, after lots of simulation in python I finally reached an implementation that can draw the Lorentz attraktor. The python implementation to simulate the CBM floats is not 100% accurate as it doesn't simulate the presence of an always-set MSB in the mantissa for normalization (bit 31). But since it is always-set in RAM to simulate the mantissa it should be equivalent. At least the algorithm in assembler doesn't diverge much from the python implementation.
Drawing f(x) = 2/3 x + 0.5
, fast mult implementation vs. FMULT
:
The original implementation uses a certain bias to correct the mantissa, on average, for the error. I decided to go with the naïve implementation at a higher error (I think ~10%). If you want, you can implement that, of course.
Now the profile looks like this:
(C:$b9b2) prof graph depth 6
Total % Self %
------------- ------ ------------- ------
[1] START 51.054.907 100,0% 0 0,0%
[2] RST -> fce2 50.524.736 99,0% 0 0,0%
[3] e39a -> e422 50.524.736 99,0% 0 0,0%
[4] a7e7 -> a7ed 50.524.736 99,0% 0 0,0%
[5] a7e7 -> a7ed 50.524.736 99,0% 211.356 0,4%
[6] 0c79 -> .xyz_step 48.922.880 95,8% 3.734.688 7,3%
[26] 0e05 -> .FACINX 2.916.418 5,7% 100.104 0,2%
[21] 0e1d -> .FADD 2.559.430 5,0% 935.282 1,8%
[12] 0e42 -> .FADD 2.391.310 4,7% 796.493 1,6%
[8] 0e5c -> .QINT 1.626.714 3,2% 200.196 0,4%
[31] 0df8 -> .FSUBT 1.612.184 3,2% 1.137.599 2,2%
[10] 0e59 -> .fast_mult 1.126.188 2,2% 686.869 1,3%
[16] 0e3b -> .fast_mult 1.124.436 2,2% 685.117 1,3%
[18] 0e24 -> .MOVMF 617.207 1,2% 450.441 0,9%
[... skipped the rest of the context dump ...]
FMULT
- or rather fast_mult
- is not even in the top 3 now. We effectively
removed the main bottleneck and are still drawing the equation rather nicely
considering the gain.
System with fast multiplication, errors are visible but OK:
comparison.mp4
What about the future? So what are the main offenders now?
(C:$0cc2) prof flat 10
Total % Self %
------------- ------ ------------- ------
2.426.112 18,0% 2.426.112 18,0% b999
1.511.033 11,2% 1.511.033 11,2% .CONUPK
1.479.900 11,0% 1.479.900 11,0% b9b0
3.005.589 22,2% 1.194.185 8,8% .FADD
1.786.270 13,2% 1.088.937 8,1% .fast_mult
12.947.888 95,8% 987.944 7,3% .xyz_step
1.072.242 7,9% 702.461 5,2% .FSUB
647.328 4,8% 647.328 4,8% .MOVFA
617.890 4,6% 617.890 4,6% .MOVFM
842.430 6,2% 585.855 4,3% .FSUBT
$b999
is, according to this helpful page,
a subroutine for float to integer conversion. .CONUPK
is used in the fast
mult for convenience to unpack the CBM float from RAM into FAC1. Both
could probably be optimized for the use case but I think the error is
already large enough as to not risk adding more.
This project used several resources for which I am really grateful:
- 6502.org Instruction reference
- 6502.org Addressing reference
- 6502.org Opcodes reference
- c64-wiki BASIC floating point ops
- c64-wiki Zeropage layout
- codebase64 floating point ops
- codebase64 screen mode doc
- Christian Bauer's VIC-II screen doc (it seems to be the most correct one I found)
- codebase64 micro tracker example
- cbm64 memory assignment map
There are three python scripts that were helpful in the development of the final assembler code.
attraktor.py
- a reference implementation that got subsequently optimized to bit oeprationsencode_xy.py
- a way of testing & reverse engineering the bit addressing in the 320x200 (but actually 40x25 byte) screen of the C64cbm_to_float.py
- a script of converting C64 floating point numbers (5 byte) to python floats for easy debugging with the VICE monitor