-
Notifications
You must be signed in to change notification settings - Fork 3k
Update idle loop to reduce calls to suspend #6534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change also allows #6536 to work correctly, as the low power timer no longer needs to written to on every wakeup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the SysTimer, but I've looked at the RTX end.
rtos/TARGET_CORTEX/mbed_rtx_idle.cpp
Outdated
if (!osRtxInfo.kernel.pendSV) { // check devices didn't run any interrupt handlers kicking the OS between suspend and critical section | ||
sleep(); // uses WFI - will wake as soon as any pending interrupts | ||
} | ||
core_util_critical_section_exit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be call for an ISB here - you're wanting pending interrupts to trigger between this line and the next one, and an ISB should make sure that happens.
In theory I think you could redisable interrupts before any have a chance to trigger - formally the sequence "enable IRQs / disable IRQs" needs an ISB somewhere in the middle to guarantee they have time to trigger.
Relevant ARM ARM quote:
If execution of a CPS instruction:
- Increases the execution priority, the CPS execution serializes that change to the instruction stream.
- Decreases the execution priority, the architecture guarantees only that the new priority is visible to instructions executed after either executing an ISB, or performing an exception entry or exception return.
(In practice they're likely to trigger before you hit the next disable, which means you'd see pendSV set by the next time you hit line 104.)
rtos/TARGET_CORTEX/mbed_rtx_idle.cpp
Outdated
os_timer->cancel_tick(); | ||
// calculate how long we slept | ||
elapsed_ticks = os_timer->update_tick(); | ||
if (osRtxInfo.kernel.pendSV) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be part of the while condition? Could be true on first entry to the while
Change tickless handling to use public RTX calls and to prevent unnecessary calls to suspend/resume by looping until the kernel needs to be resumed.
while (!os_timer->suspend_time_passed() && !event_pending) { | ||
|
||
core_util_critical_section_enter(); | ||
if (osRtxInfo.kernel.pendSV) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osRtxInfo.kernel.pendSV
is not public API. Perhaps @RobertRostohar can justify if this solution is solid and a small API change would be valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might help with the review, details are here #6273 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that there should be better way - pendSV peeking was the best I could think of with current APIs.
The "simple" API change I would suggest is to permit osKernelSuspend/Resume to be callable from critical section (which is what happens via direct svcKernelSuspend calls at the moment, and this replaces), or to have them or equivalents actually return from the suspend call such that the thread is now in a critical section.
I guess in general one also needs to think about priviliged/unprivileged operation, in case a system is using User mode - system suspension, critical sections and the WFI instruction are privileged operations anyway, so this whole "normal thread code does suspend" only works if the idle thread is privileged (or the suspend call makes you privileged?), or you have a WFE-only version (which I believe is less flexible and can reduce power saving opportunities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JonatanAntoni when I was initially working on tickless I proposed an extension to the RTOS to allow this, which you might be able to use:
https://github.com/ARMmbed/mbed-os/pull/5031/files#diff-415fd494349b5414064e0cc0d9f0d596R189
As for this PR, how do you want to proceed? Can we move forward using this internal API until RTX has a public API which supports this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, for the time being you can access this internal bit. But as it is not public API it might change in the future without being warned. So we should align the low power operation, soon. We need to address this privileged/unprivileged issue nonetheless talking about MPU protection.
/morph build |
Build : FAILUREBuild number : 1668 |
|
@0xc0170 Would appear to be the case. |
Update the SysTimer test to match the updated API. Changes are: - increment_tick() renamed to _increment_tick() and explicitly synchronized -update_tick() replaced with resume() and a call to suspend() was added before this
Even with the test update, this PR intermittently fails on test_update_tick on the nrf51. From profiling this test, I found that the call to suspend is taking at times +500us. I suspect that a tick is passing and that is why 1 is returned from suspend rather than 0 - If I pull in #5629 suspend takes around 133us to complete, which is still a lot, but the test consistently passes. @fkjagodzinski you were the author of this test right? Do you have any recommendations on how I can get this test to pass consistently? |
/morph build |
Build : SUCCESSBuild number : 1714 Triggering tests/morph test |
Exporter Build : SUCCESSBuild number : 1347 |
@c1728p9 With this patch 5bfaf528d571de6ae67fe50c21ff25af9bb6fecf all |
Shall this be added here? |
My fault -- I tested that patch with GCC toolchain only. This commit fkjagodzinski@b5dc7f4 fixes ARM and IAR compilation errors. |
Thanks for the fix @fkjagodzinski. I pulled it into this PR. |
/morph build |
Build : SUCCESSBuild number : 1775 Triggering tests/morph test |
Exporter Build : SUCCESSBuild number : 1416 |
Test : FAILUREBuild number : 1585 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idle loop and systimer changes looks good.
Ah too slow to remove the comment to trigger another build :-) I invoked build instead of test to retest 😖 |
Build : SUCCESSBuild number : 1789 Triggering tests/morph test |
Exporter Build : SUCCESSBuild number : 1437 |
Test : FAILUREBuild number : 1600 |
I'm investigating. So far, I can tell that the issue is also present on current master branch. #6515 might help here -- with this PR merged I couldn't reproduce the issue locally any more. |
Interesting, and that fix is blocked by another failure 😕 |
@adbridge @cmonr Are you aware of this failure? I don't recall seeing this in CI, I'll look at the logs. @fkjagodzinski If this is present on master, and these tests were added not long time ago, were you aware of this issue? |
@0xc0170 No, not until I saw morph failures for this PR. It didn't look related to this PR, that's why I checked master. The issue does not show up very often, but can be reproduced. |
/morph test |
I've created an issue #6732 based on the failing test case |
Test : SUCCESSBuild number : 1646 |
void SysTimer::cancel_tick() | ||
{ | ||
remove(); | ||
schedule_tick(ticks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c1728p9 Why is there no remove() call here? I'm seeing issues on my system because of this.
The SysTimer is based on a TimerEvent class, which has a static member for the timer's event data struct. Additionally, the call to suspend may happen before the previously set timer expires (for example when all threads go idle and the idle handler gets activated). At that point, the timer event data is still present in the timer queue. Calling schedule_tick here will insert the same event structure again in the queue. And since the queue is by reference, that'll lead to a recursive timer that breaks the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @stevew817 I gave this a try but was unable to reproduce the failure. Do you have example code which reproduces the issue?
From both code analysis and running the code can I cant find a sequence which causes this. The basic flow of the idle loop should unconditionally call remove before each insert as indicated below:
while (true) {
osKernelSuspend(); // removes event
os_timer->suspend(); // inserts event
sleep();
os_timer->resume(); // removes event
osKernelResume(); // inserts event
}
With a debugger I can confirm the following sequence is being met by setting a breakpoint in TimerEvent::remove, TimerEvent::insert and TimerEvent::insert_absolute. Below is the callstack from each:
osKernelSuspend()
svcRtxKernelSuspend()
KernelBlock()
OS_Tick_Disable()
os_timer->cancel_tick()
remove()
os_timer->suspend()
schedule_tick()
insert_absolute()
os_timer->resume()
remove()
osKernelResume()
svcRtxKernelResume()
KernelUnblock()
OS_Tick_Enable()
schedule_tick()
insert_absolute();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran across this when running cloud-client-example using Thread as a sleepy node in tickless mode. The code would hang, and attaching a debugger confirmed it was stuck in the tick handler IRQ because somehow the tick event had managed to add itself recursively (the next pointer of the tick event data was pointing to the same event data, causing the remove call to not do anything since the head just always ands up being the tick event data).
I fixed the issue by adding a cancel_tick() call before the schedule_ticks call in SysTimer::suspend. I didn't investigate further after solving my issue, so I'm afraid I don't have a reproducer ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevew817 I'm still unable to reproduce this problem, but I created #7027 just to be safe. Could you provide further details on how to replicate this failure (such as git shas used in cloud-client-example and mbed-os along with config, any local mods you made and the hardware you were using) or an example application?
@c1728p9 Please take a look at this and if this needs a fix and/or update, we can take a fix in for RC2. CC: @ARMmbed/mbed-os-maintainers |
Change tickless handling to use public RTX calls and to prevent unnecessary calls to suspend/resume by looping until the kernel needs to be resumed.