Optimize tickless tick computation #9786

c1728p9 · 2019-02-21T02:37:54Z

Description

Optimize the tick computation by computing the elapsed ticks based on relative time rather than absolute time. On the NUCLEO_L073RZ this reduces the execution time of SysTimer::resume from >35us to less than ~~15us~~ 20us.

~~Also, to ensure the relative time calculations do not overflow set the max time the OS can suspend to 1 hour.~~

Pull request type

[x ] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@pan-

ciarmcom · 2019-02-21T04:00:26Z

@c1728p9, thank you for your changes.
@pan- @ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

kjbracey

This saves a fair bit on the M0, where it converts a 64-bit software division to a 32-bit software division, but on M3 and M4 it should save a lot more - it should be optimised into a "multiply by reciprocal" using UMULL. (I know GCC and ARMC5 do this, hopefully IAR and ARMC6 do too).

This is slightly at odds with something I'm looking to do with SysTimer - that was actually going to make it more 64-bit, to help support a manual suspend API, where I was anticipating long sleep times. But I think I could deal with that by having conditional behaviour - switch between optimised 32-bit and heavy 64-bit computations depending on whether deep sleep is locked. If deep sleep is permitted, we have a free reign on interrupt latency.

rtos/TARGET_CORTEX/SysTimer.cpp

jeromecoutant · 2019-02-21T12:52:53Z

@janjongboom
Maybe you can also have a look as I think you are using tickless for L072 Lora board ?

@LMESTM

c1728p9 · 2019-02-21T23:05:53Z

rtos/TARGET_CORTEX/SysTimer.cpp

+    // 32 bit division is used here for better performance since this is in a critical section.
+    const timestamp_t elapsed_us = (timestamp_t)(ticker_read_us(_ticker_data) - _time_us);
+    uint32_t elapsed_ticks = (elapsed_us + _remainder_us) / US_IN_TICK;
+    _remainder_us = elapsed_us - elapsed_ticks * US_IN_TICK;


I think this remainder calculation is wrong.

c1728p9 · 2019-02-22T00:40:03Z

I updated this PR to preserver the long sleep times without having an impact on interrupt latency. Bringing back 64 bit division brought execution time of SysTimer::resume up to to a bit under 20us. I also added a separate commit to remove the critical section in places where it wasn't needed, such as SysTimer::resume.

kjbracey · 2019-02-22T07:14:26Z

rtos/TARGET_CORTEX/SysTimer.cpp


-    uint64_t new_tick = (ticker_read_us(_ticker_data) - _start_time) * OS_TICK_FREQ / 1000000;
-    if (new_tick > _tick) {
+    uint64_t elapsed_ticks = (ticker_read_us(_ticker_data) - _time_us) / US_IN_TICK;


Gah!!! I briefly wondered whether this might be a solution, but immediately shot it down because I assumed SysTimer::resume was being called in a critical section from outside. This is great.

I think it's still worth optimising for the common short case - if this is taking 20us every millisecond, that is 2% of CPU time just for the division. (On a 4MHz system I saw about 10% of CPU time being eaten up by the millisecond tickers.).

How about:

uint64_t elapsed_us = ticker_read_us(_ticker_data) - _time_us; uint64_t elapsed_ticks; if (elapsed_us > 0xFFFFFFFF) { // This will typically invoke full 64/64 software division elapsed_ticks = elapsed_us / US_IN_TICK; } else { // This will will invoke 32/32 software division on ARMv6-M or ARMv8-M baseline, and // on ARMv7-M or ARMv8-M mainline can be optimised to a UMULL by 2^n/US_IN_TICK, or // at worst be a UDIV instruction. Thus it will be faster, maybe greatly. elapsed_ticks = (uint32_t) elapsed_us / US_IN_TICK; }

I assumed SysTimer::resume was being called in a critical section from outside.

It used to be in a critical section, but had to be changed when the RTX API it relied on became unavailable. I missed it at this time as well.

Initially I had it avoid division all together in the fast case - when only a single tick had elapsed:

if (elapsed_time < US_IN_TICK) { elapsed_ticks = 0; } else if (elapsed_time < US_IN_TICK * 2) { elapsed_ticks = 1; } else { elapsed_ticks = elapsed_time / US_IN_TICK; }

I ended up going with the simpler implementation you see now since I was just worried about reducing worst case execution time. Also, I was only setup to profile only worst-case execution time, which did not change much from this.

It sounds like you have profiled the average execution time on a couple of devices. Would you be able to profile the different optimizations and create a PR for the one which looks the most promising?

I'll have to see if those extra cases help too. My feeling is that division tends to have early termination, so might not help much, but worth checking.

I was just worried about reducing worst case execution time.

But we don't care about worst-case any more if we're outside of the critical section, right?

But we don't care about worst-case any more if we're outside of the critical section, right?

True, worst case doesn't matter anymore. At the time I was only setup to profile worst-case, so I wasn't able to evaluate the impact of this change.

rtos/TARGET_CORTEX/SysTimer.cpp

c1728p9 · 2019-02-22T16:32:10Z

Rebased to master and updated US_IN_TICK calculation/compiler assert.

Optimize the tick computation in the following ways: 1. Use relative time rather than absolute time 2. Replace multiplication/division pair with just multiplication or division: "* 1000000ULL / OS_TICK_FREQ" -> "* US_IN_TICK" "* OS_TICK_FREQ / 1000000" -> "/ US_IN_TICK"

Remove unnecessary critical sections from the SysTimer code since the access should already be serialized.

janjongboom · 2019-02-25T11:35:13Z

@jeromecoutant I don't feel comfortable reviewing this at this level. But I trust @c1728p9

cmonr · 2019-02-25T22:46:07Z

CI started whilst waiting on reviews.

mbed-ci · 2019-02-26T01:54:50Z

Test run: SUCCESS

Summary: 12 of 12 test jobs passed
Build number : 1
Build artifacts

cmonr · 2019-02-26T01:56:22Z

@mbed-os-core @pan- Thoguhts?

c1728p9 mentioned this pull request Feb 21, 2019

Use us ticker for tickless on devs with wrapper #9785

Merged

ciarmcom requested review from pan- and a team February 21, 2019 04:00

ciarmcom requested a review from a team February 21, 2019 04:00

ciarmcom added the needs: review label Feb 21, 2019

kjbracey reviewed Feb 21, 2019

View reviewed changes

rtos/TARGET_CORTEX/SysTimer.cpp Outdated Show resolved Hide resolved

c1728p9 force-pushed the tickless_optimization branch from 86cef5b to bf011e0 Compare February 21, 2019 17:11

c1728p9 commented Feb 21, 2019

View reviewed changes

c1728p9 force-pushed the tickless_optimization branch from bf011e0 to 7f3e4b2 Compare February 22, 2019 00:32

cmonr requested a review from kjbracey February 22, 2019 01:28

kjbracey approved these changes Feb 22, 2019

View reviewed changes

c1728p9 force-pushed the tickless_optimization branch from 7f3e4b2 to cd0f5bb Compare February 22, 2019 16:31

c1728p9 added 2 commits February 23, 2019 17:44

Optimize tickless interrupt latency

f6ed7ce

Remove unnecessary critical sections from the SysTimer code since the access should already be serialized.

c1728p9 force-pushed the tickless_optimization branch from cd0f5bb to f6ed7ce Compare February 23, 2019 23:45

cmonr added risk: G release-version: 5.12.0-rc1 labels Feb 26, 2019

bulislaw approved these changes Feb 26, 2019

View reviewed changes

cmonr merged commit 3352b43 into ARMmbed:master Feb 27, 2019

cmonr removed the needs: review label Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize tickless tick computation #9786

Optimize tickless tick computation #9786

Uh oh!

c1728p9 commented Feb 21, 2019 •

edited

Loading

Uh oh!

ciarmcom commented Feb 21, 2019

Uh oh!

kjbracey left a comment

Uh oh!

Uh oh!

jeromecoutant commented Feb 21, 2019

Uh oh!

c1728p9 Feb 21, 2019

Uh oh!

c1728p9 commented Feb 22, 2019

Uh oh!

kjbracey Feb 22, 2019

Uh oh!

c1728p9 Feb 22, 2019 •

edited

Loading

Uh oh!

kjbracey Feb 25, 2019

Uh oh!

c1728p9 Feb 25, 2019

Uh oh!

Uh oh!

c1728p9 commented Feb 22, 2019

Uh oh!

janjongboom commented Feb 25, 2019

Uh oh!

cmonr commented Feb 25, 2019

Uh oh!

mbed-ci commented Feb 26, 2019

Uh oh!

cmonr commented Feb 26, 2019

Uh oh!

Uh oh!

Optimize tickless tick computation #9786

Optimize tickless tick computation #9786

Uh oh!

Conversation

c1728p9 commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Pull request type

Reviewers

Uh oh!

ciarmcom commented Feb 21, 2019

Uh oh!

kjbracey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromecoutant commented Feb 21, 2019

Uh oh!

c1728p9 Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

c1728p9 commented Feb 22, 2019

Uh oh!

kjbracey Feb 22, 2019

Choose a reason for hiding this comment

Uh oh!

c1728p9 Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjbracey Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

c1728p9 Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

c1728p9 commented Feb 22, 2019

Uh oh!

janjongboom commented Feb 25, 2019

Uh oh!

cmonr commented Feb 25, 2019

Uh oh!

mbed-ci commented Feb 26, 2019

Test run: SUCCESS

Uh oh!

cmonr commented Feb 26, 2019

Uh oh!

Uh oh!

c1728p9 commented Feb 21, 2019 •

edited

Loading

c1728p9 Feb 22, 2019 •

edited

Loading