Skip to content

Call background tasks only once per ms #2297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 3, 2019
Merged

Conversation

jepler
Copy link

@jepler jepler commented Nov 18, 2019

run_background_tasks: Do nothing unless there has been a tick

To do this, we need the refactored tick (moved to supervisor), so that we can add the atomic flag setting in just one site instead of in all ports.

This improves performance of running python code on a samd51 34%, based on the "pystone" benchmark on metro m4 express at 5000 passes (1127.65 -> 1521.6 passes/second).

In addition, by instrumenting the tick function and monitoring on an oscilloscope, the time actually spent in run_background_tasks() on the metro m4 decreases from average 43% to 0.5%. (however, there's some additional overhead that is moved around and not accounted for in that "0.5%" figure, each time supervisor_run_background_tasks_if_tick is called but no tick has occurred)

On the CPB, it increases pystone from 633 to 769, a smaller percentage increase of 21%. I did not measure the time actually spent in run_background_tasks() on CPB.

Testing performed: on metro m4 and cpb, run pystone adapted from python3.4 (change time.time to time.monotonic for sub-second resolution)

Besides running a 5000 pass test, I also ran a 50-pass test while scoping how long an output pin was set. Average: 34.59ms or 1445/s on m4, 67.61ms or 739/s on cbp, both matching the other pystone result reasonably well.

To benefit from gcc's "once-only headers" implementation, the
"wrapper-#ifndef" must be the first non-comment part of the file,
according to the manual for various gcc/cpp versions.
.. this means that when we want to modify RUN_BACKGROUND_TASKS, only one
change is needed instead of 3
This code is shared by most parts, except where not all the #ifdefs
inside the tick function were present in all ports.  This mostly would
have broken gamepad tick support on non-samd ports.

The "ms32" and "ms64" variants of the tick functions are introduced
because there is no 64-bit atomic read.  Disabling interrupts avoids
a low probability bug where milliseconds could be off by ~49.5 days
once every ~49.5 days (2^32 ms).

Avoiding disabling interrupts when only the low 32 bits are needed is a minor
optimization.

Testing performed: on metro m4 express, USB still works and
time.monotonic_ns() still counts up
If you define MONITOR_BACKGROUND_TASK, then a physical output pin
(Metro M4 Express's "SCL" pin by default) will be set HIGH while in
the background task and LOW at other times
This improves performance of running python code by 34%, based
on the "pystone" benchmark on metro m4 express at 5000 passes
(1127.65 -> 1521.6 passes/second).

In addition, by instrumenting the tick function and monitoring on an
oscilloscope, the time actually spent in run_background_tasks() on
the metro m4 decreases from average 43% to 0.5%. (however, there's
some additional overhead that is moved around and not accounted for
in that "0.5%" figure, each time supervisor_run_background_tasks_if_tick
is called but no tick has occurred)

On the CPB, it increases pystone from 633 to 769, a smaller percentage
increase of 21%.  I did not measure the time actually spent in
run_background_tasks() on CPB.

Testing performed: on metro m4 and cpb, run pystone adapted from python3.4
(change time.time to time.monotonic for sub-second resolution)

Besides running a 5000 pass test, I also ran a 50-pass test while
scoping how long an output pin was set.  Average: 34.59ms or 1445/s on m4,
67.61ms or 739/s on cbp, both matching the other pystone result reasonably
well.

import pystone
import board
import digitalio
import time

d = digitalio.DigitalInOut(board.D13)
d.direction = digitalio.Direction.OUTPUT

while True:
    d.value = 0
    time.sleep(.01)
    d.value = 1
    pystone.main(50)
@ladyada ladyada requested a review from dhalbert November 18, 2019 17:36
@dhalbert
Copy link
Collaborator

@tannewt will have comments about 1ms displayio frame rate resolution. He wants to achieve certain frame rates precisely which may not be possible at 1ms granularity

@dhalbert dhalbert requested a review from tannewt November 18, 2019 17:52
@tannewt
Copy link
Member

tannewt commented Nov 19, 2019

It'd be nice to have more resolution for displays but that isn't urgent enough to block this.

I am a little worried this will impact USB performance though because it will increase the time between receiving usb data and queueing up another. Could we have the usb interrupt also set a run_background flag in addition to the tick doing so?

@jepler
Copy link
Author

jepler commented Nov 19, 2019

@tannewt that seems a good idea. Suggestions on naming an API that would trigger the background tasks to run as soon as possible?

@dhalbert
Copy link
Collaborator

Could be run_background_soon(). But I wonder if we should have a flag for each background task, instead of forcing all to run. So run_displayio_background_now, run_usb_background_now, etc. or something like that.

In other words, each thing that might want to run soon can ask that it itself be run, without affecting the scheduling of the other tasks.

@jepler
Copy link
Author

jepler commented Nov 20, 2019

The build failures are because not all MCUs support atomic operations on values of type bool, sigh. I will implement this feature in a way that does not require atomic_bool.

@tannewt
Copy link
Member

tannewt commented Nov 20, 2019

I'd start with a single global to guard if they run at all. Most of them already do some checking to see if they need to run, it's just costly at the moment.

@jepler
Copy link
Author

jepler commented Nov 20, 2019

@tannewt is this USB interrupt something that exists now? If so, can you suggest what to search for? I came up empty.

@tannewt
Copy link
Member

tannewt commented Nov 21, 2019

@jepler I think the USB interrupt is in TinyUSB although we could move it out.

Is this ready for another look?

@jepler
Copy link
Author

jepler commented Nov 21, 2019

@tannewt I think your assessment about the USB interrupt is right: as it's contained in tinyusb right now, we'd need some more changes before we can use it to trigger background tasks sooner.

It looks like I have some failed builds to attend to but they're shallow (use of ticks_ms global need to be converted to function calls), so it should be ready for stylistic checks.

Any suggestions on a way to benchmark the possible USB performance regression you're concerned about? That would be good to know before accepting this PR.

@tannewt
Copy link
Member

tannewt commented Nov 22, 2019

I would just time a couple 100k transfer as a sanity check. Once this is merged we'll have more testing on it.

@dhalbert
Copy link
Collaborator

@jepler do you feel this is ready? I did a casual review already.

@jepler
Copy link
Author

jepler commented Nov 29, 2019

Testing on pyportal, sitting at the ">>>" REPL.

$ dd if=/dev/zero of=/media/jepler/CIRCUITPY/zero bs=512 count=200

Adafruit CircuitPython 5.0.0-beta.0
Before background changes, best of 3: 48.7 kB/s

Adafruit CircuitPython 5.0.0-beta.0-42-gbfdfe0e68
After background changes, best of 3: 48.4 kB/s

Difference: -0.62%, probably not statistically significant

@jepler
Copy link
Author

jepler commented Nov 29, 2019

@dhalbert If it passes CI this time, I think it's ready for review. I added the performance figures that @tannewt requested, and merged master (there were conflicts due to uart changes)

@kevinjwalters
Copy link

kevinjwalters commented Nov 29, 2019

Has this been tested on an M0 board? I see no particular reason why it would not be fine, just thought I'd mention it in case CPX/Gemma M0/Feather M0 haven's seen this code...

BTW, does your test rx code on the Adafruit board confirm that the exact amount of data sent has been received? E.g. for your example 102400 received.

@ladyada
Copy link
Member

ladyada commented Nov 29, 2019

Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@tannewt tannewt merged commit fce81e6 into adafruit:master Dec 3, 2019
@kevinjwalters
Copy link

kevinjwalters commented Dec 3, 2019

Just ran adafruit-circuitpython-circuitplayground_express-en_US-20191129-7eea65d.uf2 up on a CPX. It runs ok. I copied a file (writes at about 80kB a second) from Windoze and that was ok. @jepler were you doing other read tests or just writes into the FAT filesystem?

@dhalbert
Copy link
Collaborator

dhalbert commented Dec 3, 2019

On a simple loop, it's nearly twice as fast:

>>> import time
>>> def t(c):
...     start = time.monotonic()
...     for _ in range(c):
...         pass
...     print(time.monotonic() - start)
...
>>> t(5000)
0.0609741
>>> t(100000)
1.2301
>>> t(1000000)
12.351
>>> def t(c):
...     start = time.monotonic()
...     for _ in range(c):
...         pass
...     print(time.monotonic() - start)
...     
>>> t(5000
... )
0.0350037
>>> t(100000)
0.695007
>>> t(1000000)
6.92801

@ladyada
Copy link
Member

ladyada commented Dec 3, 2019

@dhalbert wanna re-try that test that the person compared a samd51 with pyboard? we should be at near-parity?

@dhalbert
Copy link
Collaborator

dhalbert commented Dec 3, 2019

We were half the MicroPython speed after the initial speed-up, so I think this is very close, if not now completely equivalent.

@ladyada
Copy link
Member

ladyada commented Dec 3, 2019

@dhalbert yep, wanna re-run it on a samd51? or i can

jepler added a commit to jepler/circuitpython that referenced this pull request Dec 8, 2019
This adapts the "inline assembler" code from the UF2 bootloader, which
in turn is said to be adapted from the arduino neopixel library.

This requires the cache remain ON when using M0, and be turned OFF on M4
(determined by trial and error)

Testing performed on a Metro M4:
 * measured timings using o'scope and found all values within
   datasheet tolerance.
 * Drove a string of 96 neopixels without visible glitches
 * on-board neopixel worked

Testing performed on a Circuit Playground Express (M0):
 * Color wheel code works on built-in neopixels
 * Color wheel code works on 96 neopixel strip

As a bonus, this may have freed up a bit of flash on M0 targets. (2988 ->
3068 bytes free on Trinket M0)

Closes: adafruit#2297
tannewt added a commit that referenced this pull request Dec 10, 2019
samd: neopixel: Fix neopixels after #2297
@jepler jepler deleted the tick-refactor branch November 3, 2021 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants