-
Notifications
You must be signed in to change notification settings - Fork 3k
mbed_error.c: Better HW fault exceptions and stack dump #11332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@andrewc-arm, thank you for your changes. |
Please fix astyle errors |
#if defined(__CORTEX_M) | ||
mbed_fault_context_t *mfc = (mbed_fault_context_t *)error_value; | ||
current_error_ctx.error_address = (uint32_t)mfc->PC_reg; | ||
current_error_ctx.thread_current_sp = (uint32_t)mfc->SP_reg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Thread current SP" would actually be PSP
.
SP_reg
will be either thread SP (PSP) if the exception happened in thread mode, or main SP (MSP) if the exception happened in handler mode. Second case is less likely but possible.
I think you want PSP
specifically here, as you are going to show it as part of the "current thread" line, next to the pointer for the thread base, as if intending to compare them.
(Although it doesn't matter that much, as the fault handler should have printed all of SP, MSP and PSP anyway before entering us, right?)
Not sure the warning is the way to go - we have no portable non-M fault handler, so how could any using a Cortex-A "implement a non Cortex-M handler" in this file to silence it? I would just let it use the existing code, and skip the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kjbracey-arm
I think you want
PSP
specifically here,
Please note that there is an adjustment of PSP/MSP on this assembly code. It looks like the assembly code is adjusting the PSP/MSP value offset caused by calling Fault_Handler
. I experimented with gdb and made sure that SP_reg
value is the exact same with SP
register value just before the crash. If you observe teraterm_20190826.1656.log, you can see that SP
and PSP
are different.
[2019-08-26 16:57:01.354] SP : 20002D30 <== correct value just before the crash
[2019-08-26 16:57:01.371] LR : 000017DF
[2019-08-26 16:57:01.389] PC : 000017C4
[2019-08-26 16:57:01.403] xPSR : 21000000
[2019-08-26 16:57:01.421] PSP : 20002CC8 <== a bit tainted value due to Fault_Handler
[2019-08-26 16:57:01.438] MSP : 2002FFC0
(Although it doesn't matter that much, as the fault handler should have printed all of SP, MSP and PSP anyway before entering us, right?)
Yes. We print all the context as shown above. And we print this mode information as well.
[2019-08-26 16:57:01.571] Mode : Thread
[2019-08-26 16:57:01.586] Priv : Privileged
[2019-08-26 16:57:01.604] Stack: PSP
I would just let it use the existing code, and skip the warning.
OK. I removed the warning in bdfef3f.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that complicates a bit. So SP
is a better value for "thread SP" than PSP
if the exception was from thread mode. But if the exception was from handler mode, then SP
is a better value for "handler SP" than MSP
.
Then if you're looking for "thread stack pointer" specifically, then you'd want to take SP
if from exception was from thread mode, or PSP
if from handler mode. So you need to look at the SPSEL
bit of the exception return value.
(I've been around all this stuff working on exception handling for pyOCD, which has to do the same sort of reconstruction logic, trying to cover all cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kjbracey-arm
Okay, that complicates a bit. So
SP
is a better value for "thread SP" thanPSP
if the exception was from thread mode. But if the exception was from handler mode, thenSP
is a better value for "handler SP" thanMSP
.
Yes. I believe SP
is the better value in both cases. If that's what you meant.
In case of thread mode, PSP
is provided as a source to SP
via except.S#L112. In case of handler mode, MSP
is provided as source to SP
via except.S#L107. Our assembly already checks EXC_RETURN
from except.S#L110.
For stack dump feature upcoming, I will check the mode and dump the right PSP/MSP. For this PR, I believe providing SP
as it is a good choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of handler mode,
MSP
is provided as source toSP
via except.S#L107.
So in that case SP
is not a thread stack pointer at all. If you're not able to do adaptive behaviour here, then a fixed choice of SP
is better to get precision for the common case of exception in thread mode, so it's the 1 to go for, even though it's worse for the rare case of exception in handler mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand your point. You mean this printout
Current Thread: main Id: 0x20000E50 Entry: 0x3D13 StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x20002D30
should also be changed to correctly show the handler stack instead of the thread stack. I will test with handler mode crashes and see if I can improve today.
@0xc0170 , please hold off merging. 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current thread information is still useful in that case.
With the thread dumping info on (I'm not sure why it isn't by default), it would be the missing (and most important piece) of the full thread dump. The thread information extracted from the OS is only up-to-date for the non-active threads - we have to get the current thread's from the registers.
But if the crash is actually in handler mode, it could either be due to a random interrupt thing, totally unconnected to the current thread, or it could be due to something the current thread did (maybe enabling interrupts at a bad time, or some bug in a supervisor call to RTOS).
So it would make sense to show both current thread and handler mode info when crashing in handler mode. Either or both could be important.
Again - look at pyOCD, and this is basically how it displays info with its thread awareness.
It shows:
- Current backtrace for current stack (by examining SP).
- If in handler mode (main stack), also show thread and backtrace for process stack (by examining
MSPPSP). - If RTOS awareness, also show thread and backtrace for other non-current RTOS threads (by examining RTOS global state structures)
So the handler mode stack appears whenever stopped in handler mode, else you don't see it and only see threads.
This then ties in with pyocd/pyOCD#430, which I really should get merged, which makes it show the aborting thread by default when stopping in the fault handler - at present, it shows you as having "SIGSEGV at fault handler", rather than the more logical "SIGSEGV at main thread location". The first is technically true at a hardware level, because it has entered the fault handler by the time the debugger gets there, but it's not the way that signals are supposed to be reported in GDB. (It's not as if your thing here is going to say hard fault at fault handler, is it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kjbracey-arm
Please review the original description of this request.
- Stack dump is enabled via
"platform.stack-dump-enabled": true
mbed_app.json configuration.- If stack dump is used in conjunction with
"platform.error-all-threads-info": true
, all the threads' stack will be dumped to provide holistic overview of all the threads.
Now with above two configs set to true, Mbed-OS will print the whole stack including the handler mode
's stack.
Also, note how handler mode
's printout is cleaned up like this.
Current Thread: <handler> Id: 0x0 Entry: 0x0 StackSize: 0xE4 StackMem: 0x2002FF18 SP: 0x2002FF18
There is no thread id in handler and there is no entry. Or if you know, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a handler pseudo-thread (and that's a reasonable way to present it), it should be in addition to the current OS thread, rather than replacing it (like pyOCD). If I read the current code correctly, if I pass a bad pointer to osMutexLock
, then we could enter the SVC handler, then hard fault inside it, and we wouldn't see the current thread that did that. We'd see "handler" and all other threads, and no backtrace showing the mutex lock at all.
(You should be able to test literally that scenario - do something like osMutexLock((osMutexId_t) 0xDEADDEAD)
).
In the handler crash scenario, we don't know whether it's the handler's fault (eg bad interrupt handler) or the current thread's fault, like that mutex lock. So we should try to get both backtraces.
bdfef3f
to
ba46e09
Compare
Let us know once ready |
ba46e09
to
a04f9a3
Compare
@0xc0170 and @kjbracey-arm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to catch interrupt handler crashes, but please consider the inside-RTOS-call crash scenario.
There are actually two possible scenarios there - fault inside an RTOS call as I describe above, or a deliberate error traps when an RTOS handler calls mbed_error, eg on mutex lock failure return.
platform/mbed_lib.json
Outdated
"stack-dump-enabled": { | ||
"macro_name": "MBED_STACK_DUMP_ENABLED", | ||
"help": "Set to 1 or true to enable stack dump.", | ||
"value": null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're following the bad pattern of the things above, but options like this should be true
or false
. null
is undefined. Let's get this new one right, and fix the others later. (They've come up an issue already - I think they actually use ifdef
so interpret false
as "on"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I changed with 1e601e5.
#if defined(__CORTEX_M) | ||
mbed_fault_context_t *mfc = (mbed_fault_context_t *)error_value; | ||
current_error_ctx.error_address = (uint32_t)mfc->PC_reg; | ||
current_error_ctx.thread_current_sp = (uint32_t)mfc->SP_reg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a handler pseudo-thread (and that's a reasonable way to present it), it should be in addition to the current OS thread, rather than replacing it (like pyOCD). If I read the current code correctly, if I pass a bad pointer to osMutexLock
, then we could enter the SVC handler, then hard fault inside it, and we wouldn't see the current thread that did that. We'd see "handler" and all other threads, and no backtrace showing the mutex lock at all.
(You should be able to test literally that scenario - do something like osMutexLock((osMutexId_t) 0xDEADDEAD)
).
In the handler crash scenario, we don't know whether it's the handler's fault (eg bad interrupt handler) or the current thread's fault, like that mutex lock. So we should try to get both backtraces.
a04f9a3
to
bde6ed0
Compare
@kjbracey-arm |
Hmm. pyOCD handler failed to get that right for SVC-calling thread - looks like correct PC was extracted but SP must be wrong. ("Info registers" after the thread command will show the SP+PC that pyOCD is telling GDB for that thread, and they're GDB's backtrace key). Existing issue: we shouldn't need to print the "next" thread - it should be a thread that's already referenced by another pointer. (That field is used as an input to the assembly thread switching code - it takes action when it sees next != curr). I note that the Mbed OS thread display isn't showing a "PC" for each thread, only an SP. I guess that is why the user would have to figure that out themselves? You should be able to concoct an SP+PC pair for each, so all can be backtraced. (Again, there are 4 different cases to get the SP+PC - handler, current thread, current thread when we're in a handler, non-current threads). Looking at your examples, I'm not clear where the current thread is in the ticker crash. Sure it's not relevant here, but I'm not sure how the dump is deciding that. Abort in IRQ handler shouldn't necessarily look different from fault in SVC handler - should be possible to see what we interrupted. Oh, is it that we're tickless and the kernel is suspended, so there is no "current" thread at the wake-from-sleep, just a ready one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have general comments, but this is fine as is to merge now, and further refinement can follow.
CI started |
Test run: FAILEDSummary: 3 of 4 test jobs failed Failed test jobs:
|
@kjbracey-arm @0xc0170 |
CI restarted |
Test run: FAILEDSummary: 4 of 4 test jobs failed Failed test jobs:
|
Test run: FAILEDSummary: 3 of 4 test jobs failed Failed test jobs:
|
@0xc0170 |
0195145
to
a769b7d
Compare
CI restarted |
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
@0xc0170 |
@0xc0170 |
CI restarted |
Test run: FAILEDSummary: 1 of 4 test jobs failed Failed test jobs:
|
CI restarted (internal fault) |
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
Summary
error-all-threads-info
.Description
Why required
Wrong
print_error_report()
resultWithout this fix, when
MBED_ERROR_MEMMANAGE_EXCEPTION
,MBED_ERROR_BUSFAULT_EXCEPTION
,MBED_ERROR_USAGEFAULT_EXCEPTION
andMBED_ERROR_HARDFAULT_EXCEPTION
happen,mbed_error()
printout is wrong. Specifically,Location:
andSP:
fields are wrong. Before this fix,Location:
is location of mbed_fault_handler() which is not the crashed point.SP:
is not the correct stack pointer of crashed stack.For your reference, here is an excerpt of wrong fault error report.
After the code change, I verified with gdb - pyOCD that I can get the correct
Location:
andSP:
field values.Original log file: teraterm_20190826.1656.log
Please note how
Location:
value leads to the correctmain.cpp
file instead ofmbed_fault_handler.c
file. Also note howLR:
value leads to the correct caller of crash point.Stack dump feature
arm-none-eabi-addr2line
tool. Please search this thread withaddr2line
for examples.What's the change
mbed_fault_handler()
will callmbed_error()
function with correctmbed_fault_context_t*
argument. [dependency]handle_error()
function will differentiateMBED_ERROR_MEMMANAGE_EXCEPTION
,MBED_ERROR_BUSFAULT_EXCEPTION
,MBED_ERROR_USAGEFAULT_EXCEPTION
andMBED_ERROR_HARDFAULT_EXCEPTION
cases and use the givenmbed_fault_context_t*
parameter to fill in correct values tocurrent_error_ctx
.handle_error()
function will handle thehandler mode
to store the correct stack/SP values tocurrent_error_ctx
."platform.stack-dump-enabled": true
mbed_app.json configuration."platform.error-all-threads-info": true
, all the threads' stack will be dumped to provide holistic overview of all the threads.Test Results
thread mode
HW crash testthread mode
HW crash is made with this main_thread_mode_crash.cpp.txt code.This is the true callstack trace acquired just before the crash.
And this is the stack dump result.
NOTE: If you search
(#_number_)
(which I manually added) you will be able to confirm that the stack dump matches with true callstack information.handler mode
HW crash testhandler mode
HW crash is made with this main_handler_mode_crash.cpp.txt code.This is the true callstack trace acquired just before the crash.
And this is the stack dump result.
NOTE: If you search
(#_number_)
(which I manually added) you will be able to confirm that the stack dump matches with true callstack information.handler mode
RTX error testhandler mode
RTX error is made with this main_rtx_handler_error.cpp.txt code.This is the callstack trace acquired just before the crash.
Please note how even gdb is not helpful much.
And this is the stack dump result after andrewc-arm@0195145.
This time, it's a bit tricky because MSP stack dump will not tell you where the exact failure is. The user will need to decode MSP stack and then try PSP stack decoding like following.
Please note how
main.cpp:34
is found with given main.cpp code listed above.Pull request type
Reviewers
Release Notes