-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Call background tasks only once per ms #2297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To benefit from gcc's "once-only headers" implementation, the "wrapper-#ifndef" must be the first non-comment part of the file, according to the manual for various gcc/cpp versions.
.. this means that when we want to modify RUN_BACKGROUND_TASKS, only one change is needed instead of 3
This code is shared by most parts, except where not all the #ifdefs inside the tick function were present in all ports. This mostly would have broken gamepad tick support on non-samd ports. The "ms32" and "ms64" variants of the tick functions are introduced because there is no 64-bit atomic read. Disabling interrupts avoids a low probability bug where milliseconds could be off by ~49.5 days once every ~49.5 days (2^32 ms). Avoiding disabling interrupts when only the low 32 bits are needed is a minor optimization. Testing performed: on metro m4 express, USB still works and time.monotonic_ns() still counts up
If you define MONITOR_BACKGROUND_TASK, then a physical output pin (Metro M4 Express's "SCL" pin by default) will be set HIGH while in the background task and LOW at other times
This improves performance of running python code by 34%, based on the "pystone" benchmark on metro m4 express at 5000 passes (1127.65 -> 1521.6 passes/second). In addition, by instrumenting the tick function and monitoring on an oscilloscope, the time actually spent in run_background_tasks() on the metro m4 decreases from average 43% to 0.5%. (however, there's some additional overhead that is moved around and not accounted for in that "0.5%" figure, each time supervisor_run_background_tasks_if_tick is called but no tick has occurred) On the CPB, it increases pystone from 633 to 769, a smaller percentage increase of 21%. I did not measure the time actually spent in run_background_tasks() on CPB. Testing performed: on metro m4 and cpb, run pystone adapted from python3.4 (change time.time to time.monotonic for sub-second resolution) Besides running a 5000 pass test, I also ran a 50-pass test while scoping how long an output pin was set. Average: 34.59ms or 1445/s on m4, 67.61ms or 739/s on cbp, both matching the other pystone result reasonably well. import pystone import board import digitalio import time d = digitalio.DigitalInOut(board.D13) d.direction = digitalio.Direction.OUTPUT while True: d.value = 0 time.sleep(.01) d.value = 1 pystone.main(50)
@tannewt will have comments about 1ms displayio frame rate resolution. He wants to achieve certain frame rates precisely which may not be possible at 1ms granularity |
It'd be nice to have more resolution for displays but that isn't urgent enough to block this. I am a little worried this will impact USB performance though because it will increase the time between receiving usb data and queueing up another. Could we have the usb interrupt also set a |
@tannewt that seems a good idea. Suggestions on naming an API that would trigger the background tasks to run as soon as possible? |
Could be In other words, each thing that might want to run soon can ask that it itself be run, without affecting the scheduling of the other tasks. |
The build failures are because not all MCUs support atomic operations on values of type bool, sigh. I will implement this feature in a way that does not require |
I'd start with a single global to guard if they run at all. Most of them already do some checking to see if they need to run, it's just costly at the moment. |
@tannewt is this USB interrupt something that exists now? If so, can you suggest what to search for? I came up empty. |
@jepler I think the USB interrupt is in TinyUSB although we could move it out. Is this ready for another look? |
@tannewt I think your assessment about the USB interrupt is right: as it's contained in tinyusb right now, we'd need some more changes before we can use it to trigger background tasks sooner. It looks like I have some failed builds to attend to but they're shallow (use of ticks_ms global need to be converted to function calls), so it should be ready for stylistic checks. Any suggestions on a way to benchmark the possible USB performance regression you're concerned about? That would be good to know before accepting this PR. |
I would just time a couple 100k transfer as a sanity check. Once this is merged we'll have more testing on it. |
@jepler do you feel this is ready? I did a casual review already. |
Testing on pyportal, sitting at the ">>>" REPL. $ dd if=/dev/zero of=/media/jepler/CIRCUITPY/zero bs=512 count=200 Adafruit CircuitPython 5.0.0-beta.0 Adafruit CircuitPython 5.0.0-beta.0-42-gbfdfe0e68 Difference: -0.62%, probably not statistically significant |
Has this been tested on an M0 board? I see no particular reason why it would not be fine, just thought I'd mention it in case CPX/Gemma M0/Feather M0 haven's seen this code... BTW, does your test rx code on the Adafruit board confirm that the exact amount of data sent has been received? E.g. for your example |
@kevinjwalters wanna help test it? :) we have build artifacts over at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you!
Just ran |
On a simple loop, it's nearly twice as fast:
|
@dhalbert wanna re-try that test that the person compared a samd51 with pyboard? we should be at near-parity? |
We were half the MicroPython speed after the initial speed-up, so I think this is very close, if not now completely equivalent. |
@dhalbert yep, wanna re-run it on a samd51? or i can |
This adapts the "inline assembler" code from the UF2 bootloader, which in turn is said to be adapted from the arduino neopixel library. This requires the cache remain ON when using M0, and be turned OFF on M4 (determined by trial and error) Testing performed on a Metro M4: * measured timings using o'scope and found all values within datasheet tolerance. * Drove a string of 96 neopixels without visible glitches * on-board neopixel worked Testing performed on a Circuit Playground Express (M0): * Color wheel code works on built-in neopixels * Color wheel code works on 96 neopixel strip As a bonus, this may have freed up a bit of flash on M0 targets. (2988 -> 3068 bytes free on Trinket M0) Closes: adafruit#2297
samd: neopixel: Fix neopixels after #2297
run_background_tasks: Do nothing unless there has been a tick
To do this, we need the refactored tick (moved to supervisor), so that we can add the atomic flag setting in just one site instead of in all ports.
This improves performance of running python code on a samd51 34%, based on the "pystone" benchmark on metro m4 express at 5000 passes (1127.65 -> 1521.6 passes/second).
In addition, by instrumenting the tick function and monitoring on an oscilloscope, the time actually spent in run_background_tasks() on the metro m4 decreases from average 43% to 0.5%. (however, there's some additional overhead that is moved around and not accounted for in that "0.5%" figure, each time supervisor_run_background_tasks_if_tick is called but no tick has occurred)
On the CPB, it increases pystone from 633 to 769, a smaller percentage increase of 21%. I did not measure the time actually spent in run_background_tasks() on CPB.
Testing performed: on metro m4 and cpb, run pystone adapted from python3.4 (change time.time to time.monotonic for sub-second resolution)
Besides running a 5000 pass test, I also ran a 50-pass test while scoping how long an output pin was set. Average: 34.59ms or 1445/s on m4, 67.61ms or 739/s on cbp, both matching the other pystone result reasonably well.