Skip to content

Update idle loop to reduce calls to suspend #6534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 25, 2018

Conversation

c1728p9
Copy link
Contributor

@c1728p9 c1728p9 commented Apr 3, 2018

Change tickless handling to use public RTX calls and to prevent unnecessary calls to suspend/resume by looping until the kernel needs to be resumed.

@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 3, 2018

This change also allows #6536 to work correctly, as the low power timer no longer needs to written to on every wakeup.

@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 3, 2018

CC @JonatanAntoni @bulislaw

Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the SysTimer, but I've looked at the RTX end.

if (!osRtxInfo.kernel.pendSV) { // check devices didn't run any interrupt handlers kicking the OS between suspend and critical section
sleep(); // uses WFI - will wake as soon as any pending interrupts
}
core_util_critical_section_exit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be call for an ISB here - you're wanting pending interrupts to trigger between this line and the next one, and an ISB should make sure that happens.

In theory I think you could redisable interrupts before any have a chance to trigger - formally the sequence "enable IRQs / disable IRQs" needs an ISB somewhere in the middle to guarantee they have time to trigger.

Relevant ARM ARM quote:

If execution of a CPS instruction:

  • Increases the execution priority, the CPS execution serializes that change to the instruction stream.
  • Decreases the execution priority, the architecture guarantees only that the new priority is visible to instructions executed after either executing an ISB, or performing an exception entry or exception return.

(In practice they're likely to trigger before you hit the next disable, which means you'd see pendSV set by the next time you hit line 104.)

os_timer->cancel_tick();
// calculate how long we slept
elapsed_ticks = os_timer->update_tick();
if (osRtxInfo.kernel.pendSV) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be part of the while condition? Could be true on first entry to the while

Change tickless handling to use public RTX calls and to prevent
unnecessary calls to suspend/resume by looping until the kernel
needs to be resumed.
0xc0170
0xc0170 previously approved these changes Apr 5, 2018
@0xc0170 0xc0170 requested a review from bulislaw April 5, 2018 12:09
while (!os_timer->suspend_time_passed() && !event_pending) {

core_util_critical_section_enter();
if (osRtxInfo.kernel.pendSV) {
Copy link
Member

@JonatanAntoni JonatanAntoni Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osRtxInfo.kernel.pendSV is not public API. Perhaps @RobertRostohar can justify if this solution is solid and a small API change would be valuable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might help with the review, details are here #6273 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that there should be better way - pendSV peeking was the best I could think of with current APIs.

The "simple" API change I would suggest is to permit osKernelSuspend/Resume to be callable from critical section (which is what happens via direct svcKernelSuspend calls at the moment, and this replaces), or to have them or equivalents actually return from the suspend call such that the thread is now in a critical section.

I guess in general one also needs to think about priviliged/unprivileged operation, in case a system is using User mode - system suspension, critical sections and the WFI instruction are privileged operations anyway, so this whole "normal thread code does suspend" only works if the idle thread is privileged (or the suspend call makes you privileged?), or you have a WFE-only version (which I believe is less flexible and can reduce power saving opportunities).

Copy link
Contributor Author

@c1728p9 c1728p9 Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonatanAntoni when I was initially working on tickless I proposed an extension to the RTOS to allow this, which you might be able to use:
https://github.com/ARMmbed/mbed-os/pull/5031/files#diff-415fd494349b5414064e0cc0d9f0d596R189

As for this PR, how do you want to proceed? Can we move forward using this internal API until RTX has a public API which supports this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for the time being you can access this internal bit. But as it is not public API it might change in the future without being warned. So we should align the low power operation, soon. We need to address this privileged/unprivileged issue nonetheless talking about MPU protection.

@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 6, 2018

/morph build

@mbed-ci
Copy link

mbed-ci commented Apr 6, 2018

Build : FAILURE

Build number : 1668
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 6, 2018

mbedmicro-rtos-mbed needs an update?

@cmonr
Copy link
Contributor

cmonr commented Apr 9, 2018

@0xc0170 Would appear to be the case.

Update the SysTimer test to match the updated API. Changes are:
- increment_tick() renamed to _increment_tick() and explicitly
    synchronized
-update_tick() replaced with resume() and a call to suspend() was added
    before this
@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 10, 2018

Even with the test update, this PR intermittently fails on test_update_tick on the nrf51. From profiling this test, I found that the call to suspend is taking at times +500us. I suspect that a tick is passing and that is why 1 is returned from suspend rather than 0 - FAIL: Expected 0 Was 1.

If I pull in #5629 suspend takes around 133us to complete, which is still a lot, but the test consistently passes. @fkjagodzinski you were the author of this test right? Do you have any recommendations on how I can get this test to pass consistently?

@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 10, 2018

/morph build

@mbed-ci
Copy link

mbed-ci commented Apr 10, 2018

Build : SUCCESS

Build number : 1714
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci
Copy link

mbed-ci commented Apr 11, 2018

@fkjagodzinski
Copy link
Member

@c1728p9 test_update_tick() is supposed to check just the update_tick() method with, and without a delay in between calls. Since update_tick() functionality was moved to resume() it's enough to update method name in the test. No calls to suspend() needed.

With this patch 5bfaf528d571de6ae67fe50c21ff25af9bb6fecf all tests-mbedmicro-rtos-mbed-systimer consistently pass on NRF51_DK.

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 11, 2018

With this patch 5bfaf52 all tests-mbedmicro-rtos-mbed-systimer consistently pass on NRF51_DK.

Shall this be added here?

@fkjagodzinski
Copy link
Member

My fault -- I tested that patch with GCC toolchain only. This commit fkjagodzinski@b5dc7f4 fixes ARM and IAR compilation errors.

@c1728p9
Copy link
Contributor Author

c1728p9 commented Apr 17, 2018

Thanks for the fix @fkjagodzinski. I pulled it into this PR.

@cmonr
Copy link
Contributor

cmonr commented Apr 17, 2018

/morph build

@mbed-ci
Copy link

mbed-ci commented Apr 17, 2018

Build : SUCCESS

Build number : 1775
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci
Copy link

mbed-ci commented Apr 17, 2018

@mbed-ci
Copy link

mbed-ci commented Apr 18, 2018

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idle loop and systimer changes looks good.

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 18, 2018

Ah too slow to remove the comment to trigger another build :-) I invoked build instead of test to retest 😖

@mbed-ci
Copy link

mbed-ci commented Apr 18, 2018

Build : SUCCESS

Build number : 1789
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci
Copy link

mbed-ci commented Apr 19, 2018

@mbed-ci
Copy link

mbed-ci commented Apr 19, 2018

@fkjagodzinski
Copy link
Member

tests-mbed_drivers-timeout / 'Multiple timeouts running in parallel (attach)'test case compiled with ARM toolchain failed again on NRF52_DK board.

I'm investigating. So far, I can tell that the issue is also present on current master branch. #6515 might help here -- with this PR merged I couldn't reproduce the issue locally any more.

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 19, 2018

I'm investigating. So far, I can tell that the issue is also present on current master branch. #6515 might help here -- with this PR merged I couldn't reproduce the issue locally any more.

Interesting, and that fix is blocked by another failure 😕

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 19, 2018

I'm investigating. So far, I can tell that the issue is also present on current master branch.

@adbridge @cmonr Are you aware of this failure? I don't recall seeing this in CI, I'll look at the logs. @fkjagodzinski If this is present on master, and these tests were added not long time ago, were you aware of this issue?

@fkjagodzinski
Copy link
Member

@0xc0170 No, not until I saw morph failures for this PR. It didn't look related to this PR, that's why I checked master. The issue does not show up very often, but can be reproduced.

@cmonr
Copy link
Contributor

cmonr commented Apr 24, 2018

/morph test

@fkjagodzinski
Copy link
Member

I've created an issue #6732 based on the failing test case tests-mbed_drivers-timeout / 'Multiple timeouts running in parallel (attach)'.

@mbed-ci
Copy link

mbed-ci commented Apr 24, 2018

void SysTimer::cancel_tick()
{
remove();
schedule_tick(ticks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c1728p9 Why is there no remove() call here? I'm seeing issues on my system because of this.

The SysTimer is based on a TimerEvent class, which has a static member for the timer's event data struct. Additionally, the call to suspend may happen before the previously set timer expires (for example when all threads go idle and the idle handler gets activated). At that point, the timer event data is still present in the timer queue. Calling schedule_tick here will insert the same event structure again in the queue. And since the queue is by reference, that'll lead to a recursive timer that breaks the system.

Copy link
Contributor Author

@c1728p9 c1728p9 May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stevew817 I gave this a try but was unable to reproduce the failure. Do you have example code which reproduces the issue?

From both code analysis and running the code can I cant find a sequence which causes this. The basic flow of the idle loop should unconditionally call remove before each insert as indicated below:

while (true) {
    osKernelSuspend();      // removes event
    os_timer->suspend();    // inserts event
    sleep();
    os_timer->resume();     // removes event
    osKernelResume();       // inserts event
}

With a debugger I can confirm the following sequence is being met by setting a breakpoint in TimerEvent::remove, TimerEvent::insert and TimerEvent::insert_absolute. Below is the callstack from each:

osKernelSuspend()
    svcRtxKernelSuspend()
        KernelBlock()
            OS_Tick_Disable()
                os_timer->cancel_tick()
                    remove()

os_timer->suspend()
    schedule_tick()
        insert_absolute()

os_timer->resume()
    remove()

osKernelResume()
    svcRtxKernelResume()
        KernelUnblock()
            OS_Tick_Enable()
                schedule_tick()
                    insert_absolute();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran across this when running cloud-client-example using Thread as a sleepy node in tickless mode. The code would hang, and attaching a debugger confirmed it was stuck in the tick handler IRQ because somehow the tick event had managed to add itself recursively (the next pointer of the tick event data was pointing to the same event data, causing the remove call to not do anything since the head just always ands up being the tick event data).

I fixed the issue by adding a cancel_tick() call before the schedule_ticks call in SysTimer::suspend. I didn't investigate further after solving my issue, so I'm afraid I don't have a reproducer ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevew817 I'm still unable to reproduce this problem, but I created #7027 just to be safe. Could you provide further details on how to replicate this failure (such as git shas used in cloud-client-example and mbed-os along with config, any local mods you made and the hardware you were using) or an example application?

@cmonr
Copy link
Contributor

cmonr commented May 25, 2018

@c1728p9 Please take a look at this and if this needs a fix and/or update, we can take a fix in for RC2.

CC: @ARMmbed/mbed-os-maintainers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants