Skip to content

Optimize tickless tick computation #9786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 27, 2019
Merged

Conversation

c1728p9
Copy link
Contributor

@c1728p9 c1728p9 commented Feb 21, 2019

Description

Optimize the tick computation by computing the elapsed ticks based on relative time rather than absolute time. On the NUCLEO_L073RZ this reduces the execution time of SysTimer::resume from >35us to less than 15us 20us.

Also, to ensure the relative time calculations do not overflow set the max time the OS can suspend to 1 hour.

Pull request type

[x ] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@pan-

@ciarmcom
Copy link
Member

@c1728p9, thank you for your changes.
@pan- @ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

@ciarmcom ciarmcom requested a review from a team February 21, 2019 04:00
Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saves a fair bit on the M0, where it converts a 64-bit software division to a 32-bit software division, but on M3 and M4 it should save a lot more - it should be optimised into a "multiply by reciprocal" using UMULL. (I know GCC and ARMC5 do this, hopefully IAR and ARMC6 do too).

This is slightly at odds with something I'm looking to do with SysTimer - that was actually going to make it more 64-bit, to help support a manual suspend API, where I was anticipating long sleep times. But I think I could deal with that by having conditional behaviour - switch between optimised 32-bit and heavy 64-bit computations depending on whether deep sleep is locked. If deep sleep is permitted, we have a free reign on interrupt latency.

@jeromecoutant
Copy link
Collaborator

@janjongboom
Maybe you can also have a look as I think you are using tickless for L072 Lora board ?

@LMESTM

@c1728p9 c1728p9 force-pushed the tickless_optimization branch from 86cef5b to bf011e0 Compare February 21, 2019 17:11
// 32 bit division is used here for better performance since this is in a critical section.
const timestamp_t elapsed_us = (timestamp_t)(ticker_read_us(_ticker_data) - _time_us);
uint32_t elapsed_ticks = (elapsed_us + _remainder_us) / US_IN_TICK;
_remainder_us = elapsed_us - elapsed_ticks * US_IN_TICK;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this remainder calculation is wrong.

@c1728p9 c1728p9 force-pushed the tickless_optimization branch from bf011e0 to 7f3e4b2 Compare February 22, 2019 00:32
@c1728p9
Copy link
Contributor Author

c1728p9 commented Feb 22, 2019

I updated this PR to preserver the long sleep times without having an impact on interrupt latency. Bringing back 64 bit division brought execution time of SysTimer::resume up to to a bit under 20us. I also added a separate commit to remove the critical section in places where it wasn't needed, such as SysTimer::resume.

@cmonr cmonr requested a review from kjbracey February 22, 2019 01:28

uint64_t new_tick = (ticker_read_us(_ticker_data) - _start_time) * OS_TICK_FREQ / 1000000;
if (new_tick > _tick) {
uint64_t elapsed_ticks = (ticker_read_us(_ticker_data) - _time_us) / US_IN_TICK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah!!! I briefly wondered whether this might be a solution, but immediately shot it down because I assumed SysTimer::resume was being called in a critical section from outside. This is great.

I think it's still worth optimising for the common short case - if this is taking 20us every millisecond, that is 2% of CPU time just for the division. (On a 4MHz system I saw about 10% of CPU time being eaten up by the millisecond tickers.).

How about:

uint64_t elapsed_us = ticker_read_us(_ticker_data) - _time_us;
uint64_t elapsed_ticks;
if (elapsed_us > 0xFFFFFFFF) {
    // This will typically invoke full 64/64 software division
    elapsed_ticks = elapsed_us / US_IN_TICK;
} else {
    // This will will invoke 32/32 software division on ARMv6-M or ARMv8-M baseline, and
    // on ARMv7-M or ARMv8-M mainline can be optimised to a UMULL by 2^n/US_IN_TICK, or
    // at worst be a UDIV instruction. Thus it will be faster, maybe greatly.
    elapsed_ticks = (uint32_t) elapsed_us / US_IN_TICK;
}

Copy link
Contributor Author

@c1728p9 c1728p9 Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed SysTimer::resume was being called in a critical section from outside.

It used to be in a critical section, but had to be changed when the RTX API it relied on became unavailable. I missed it at this time as well.

Initially I had it avoid division all together in the fast case - when only a single tick had elapsed:

if (elapsed_time < US_IN_TICK) {
    elapsed_ticks = 0;
} else if (elapsed_time < US_IN_TICK * 2) {
    elapsed_ticks = 1;
} else {
    elapsed_ticks = elapsed_time / US_IN_TICK;
}

I ended up going with the simpler implementation you see now since I was just worried about reducing worst case execution time. Also, I was only setup to profile only worst-case execution time, which did not change much from this.

It sounds like you have profiled the average execution time on a couple of devices. Would you be able to profile the different optimizations and create a PR for the one which looks the most promising?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to see if those extra cases help too. My feeling is that division tends to have early termination, so might not help much, but worth checking.

I was just worried about reducing worst case execution time.

But we don't care about worst-case any more if we're outside of the critical section, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't care about worst-case any more if we're outside of the critical section, right?

True, worst case doesn't matter anymore. At the time I was only setup to profile worst-case, so I wasn't able to evaluate the impact of this change.

@c1728p9 c1728p9 force-pushed the tickless_optimization branch from 7f3e4b2 to cd0f5bb Compare February 22, 2019 16:31
@c1728p9
Copy link
Contributor Author

c1728p9 commented Feb 22, 2019

Rebased to master and updated US_IN_TICK calculation/compiler assert.

Optimize the tick computation in the following ways:
1. Use relative time rather than absolute time
2. Replace multiplication/division pair with just multiplication
    or division:
    "* 1000000ULL / OS_TICK_FREQ"   ->   "* US_IN_TICK"
    "* OS_TICK_FREQ / 1000000"      ->   "/ US_IN_TICK"
Remove unnecessary critical sections from the SysTimer code since
the access should already be serialized.
@c1728p9 c1728p9 force-pushed the tickless_optimization branch from cd0f5bb to f6ed7ce Compare February 23, 2019 23:45
@janjongboom
Copy link
Contributor

@jeromecoutant I don't feel comfortable reviewing this at this level. But I trust @c1728p9

@cmonr
Copy link
Contributor

cmonr commented Feb 25, 2019

CI started whilst waiting on reviews.

@mbed-ci
Copy link

mbed-ci commented Feb 26, 2019

Test run: SUCCESS

Summary: 12 of 12 test jobs passed
Build number : 1
Build artifacts

@cmonr
Copy link
Contributor

cmonr commented Feb 26, 2019

@mbed-os-core @pan- Thoguhts?

@cmonr cmonr merged commit 3352b43 into ARMmbed:master Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants