Skip to content

mbed_error.c: Better HW fault exceptions and stack dump #11332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 6, 2019

Conversation

andrewc-arm
Copy link
Contributor

@andrewc-arm andrewc-arm commented Aug 26, 2019

Summary

  • If you enable mbed_app.json with following configs, your crash report will have stack dumps of every possible threads including the handler.
"platform.stack-dump-enabled": true,
"platform.error-all-threads-info": true
  • The stack dumps can assist your crash debugging holistically.
  • The possible downside is having too long Mbed crash log. If you want to view just the dump only, disable error-all-threads-info.

Description

Why required

Wrong print_error_report() result

Without this fix, when MBED_ERROR_MEMMANAGE_EXCEPTION, MBED_ERROR_BUSFAULT_EXCEPTION, MBED_ERROR_USAGEFAULT_EXCEPTION and MBED_ERROR_HARDFAULT_EXCEPTION happen, mbed_error() printout is wrong. Specifically, Location: and SP: fields are wrong. Before this fix,

  • Location: is location of mbed_fault_handler() which is not the crashed point.
  • SP: is not the correct stack pointer of crashed stack.

For your reference, here is an excerpt of wrong fault error report.

[2019-08-23 15:35:31.293] ++ MbedOS Error Info ++
[2019-08-23 15:35:31.317] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-08-23 15:35:31.365] Error Message: Fault exception
[2019-08-23 15:35:31.397] Location: 0x22BF
[2019-08-23 15:35:31.416] Error Value: 0x8038
[2019-08-23 15:35:31.435] Current Thread: main  Id: 0x20000E10 Entry: 0x374F StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x2002FF50 

arm-none-eabi-addr2line 0x22BF -e .\BUILD\K64F\GCC_ARM-DEBUG\mbed-os-example-blinky.elf
C:\Andrew\code\mbed-os-example-blinky/.\mbed-os\platform\source\TARGET_CORTEX_M/mbed_fault_handler.c:57

After the code change, I verified with gdb - pyOCD that I can get the correct Location: and SP: field values.

[2019-08-26 16:57:01.648] ++ MbedOS Error Info ++
[2019-08-26 16:57:01.673] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-08-26 16:57:01.722] Error Message: Fault exception
[2019-08-26 16:57:01.755] Location: 0x17C4
[2019-08-26 16:57:01.774] Error Value: 0x1FFF0400
[2019-08-26 16:57:01.797] Current Thread: main  Id: 0x20000E50 Entry: 0x3D13 StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x20002D30 

C:\Andrew\code\mbed-os-example-blinky>arm-none-eabi-addr2line 0x17C4 -e .\BUILD\K64F\GCC_ARM-DEBUG\mbed-os-example-blinky.elf
C:\Andrew\code\mbed-os-example-blinky/.\main.cpp:21

Original log file: teraterm_20190826.1656.log
Please note how Location: value leads to the correct main.cpp file instead of mbed_fault_handler.c file. Also note how LR: value leads to the correct caller of crash point.

Stack dump feature
  • Previously, Mbed users had to reproduce the crashes and errors to figure the root cause of failure. Now, with this code change, we will be able to look at the crash report log and tell where the root failure is with much higher probability.
  • For example, if the user has a very hard to reproduce crash (say 1 in a million), no one will be able to reproduce the problem to debug. Or it will require quite much of effort and time to reproduce the problem. In such case, this stack dump feature can help.
  • Of course, you will need callstack from the stack dump. Right now, there isn't a automation tool/script to recover the callstack from the stack dump. However, we can at least manually acquire exact code line using the arm-none-eabi-addr2line tool. Please search this thread with addr2line for examples.

What's the change

  • The mbed_fault_handler() will call mbed_error() function with correct mbed_fault_context_t* argument. [dependency]
  • The handle_error() function will differentiate MBED_ERROR_MEMMANAGE_EXCEPTION, MBED_ERROR_BUSFAULT_EXCEPTION, MBED_ERROR_USAGEFAULT_EXCEPTION and MBED_ERROR_HARDFAULT_EXCEPTION cases and use the given mbed_fault_context_t* parameter to fill in correct values to current_error_ctx.
    • Also, the handle_error() function will handle the handler mode to store the correct stack/SP values to current_error_ctx.
  • Stack dump is enabled via "platform.stack-dump-enabled": true mbed_app.json configuration.
  • If stack dump is used in conjunction with "platform.error-all-threads-info": true, all the threads' stack will be dumped to provide holistic overview of all the threads.

Test Results

thread mode HW crash test

thread mode HW crash is made with this main_thread_mode_crash.cpp.txt code.
This is the true callstack trace acquired just before the crash.

(gdb) bt
#0  0x00001874 in generate_bus_fault_unaligned_access () at .\main.cpp:24
#1  0x00001892 in func6 (debugP=debugP@entry=1, param2=param2@entry=777) at .\main.cpp:32
#2  0x00001862 in func5 (debugP=debugP@entry=1) at .\main.cpp:37
#3  0x000018a0 in func4 (str=str@entry=0x20002f40 <_main_stack+3808> "\250\v") at .\main.cpp:43
#4  0x000018ac in func3 (f=f@entry=0.333333343) at .\main.cpp:50
#5  0x000018d0 in func2 (a=a@entry=1, b=b@entry=3) at .\main.cpp:55
#6  0x000018dc in func1 () at .\main.cpp:60
#7  0x000019d6 in main () at .\main.cpp:80
(gdb) info registers
r0             0x1      1
r1             0x30a    778
r2             0xe000ed00       -536810240
r3             0xaaa3   43683
r4             0x0      0
r5             0x20000be8       536873960
r6             0x0      0
r7             0x0      0
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x0      0
r12            0xffffffff       -1
sp             0x20002f18       0x20002f18 <_main_stack+3768>
lr             0x1893   6291
pc             0x1874   0x1874 <generate_bus_fault_unaligned_access()+16>
xpsr           0x61070000       1627848704

And this is the stack dump result.

[2019-09-03 17:00:55.287] ++ MbedOS Fault Handler ++
[2019-09-03 17:00:55.315] 
[2019-09-03 17:00:55.332] FaultType: HardFault
[2019-09-03 17:00:55.338] 
[2019-09-03 17:00:55.353] Context:
[2019-09-03 17:00:55.355] R0   : 00000001
[2019-09-03 17:00:55.367] R1   : 0000030A
[2019-09-03 17:00:55.390] R2   : E000ED00
[2019-09-03 17:00:55.399] R3   : 0000AAA3
[2019-09-03 17:00:55.416] R4   : 00000000
[2019-09-03 17:00:55.441] R5   : 20000BE8
[2019-09-03 17:00:55.449] R6   : 00000000
[2019-09-03 17:00:55.475] R7   : 00000000
[2019-09-03 17:00:55.490] R8   : 00000000
[2019-09-03 17:00:55.500] R9   : 00000000
[2019-09-03 17:00:55.515] R10  : 00000000
[2019-09-03 17:00:55.540] R11  : 00000000
[2019-09-03 17:00:55.557] R12  : FFFFFFFF
[2019-09-03 17:00:55.565] SP   : 20002F18
[2019-09-03 17:00:55.585] LR   : 00001893
[2019-09-03 17:00:55.608] PC   : 00001874
[2019-09-03 17:00:55.616] xPSR : 61070000
[2019-09-03 17:00:55.632] PSP  : 20002EB0
[2019-09-03 17:00:55.649] MSP  : 2002FFC0
[2019-09-03 17:00:55.665] CPUID: 410FC241
[2019-09-03 17:00:55.682] HFSR : 40000000
[2019-09-03 17:00:55.708] MMFSR: 00000000
[2019-09-03 17:00:55.714] BFSR : 00000000
[2019-09-03 17:00:55.731] UFSR : 00000100
[2019-09-03 17:00:55.757] DFSR : 00000010
[2019-09-03 17:00:55.767] AFSR : 00000000
[2019-09-03 17:00:55.791] Mode : Thread
[2019-09-03 17:00:55.796] Priv : Privileged
[2019-09-03 17:00:55.825] Stack: PSP
[2019-09-03 17:00:55.827] 
[2019-09-03 17:00:55.843] -- MbedOS Fault Handler --
[2019-09-03 17:00:55.857] 
[2019-09-03 17:00:55.857] 
[2019-09-03 17:00:55.858] 
[2019-09-03 17:00:55.872] ++ MbedOS Error Info ++
[2019-09-03 17:00:55.885] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-09-03 17:00:55.942] Error Message: Fault exception
[2019-09-03 17:00:55.975] Location: 0x1874 (#0)
[2019-09-03 17:00:55.984] Error Value: 0x1FFF0400
[2019-09-03 17:00:56.008] Current Thread: main  Id: 0x20000E90 Entry: 0x40B3 StackSize: 0x1000 StackMem: 0x20002060 SP: 0x20002F18 
[2019-09-03 17:00:56.119] 
[2019-09-03 17:00:56.119] Stack Dump:
[2019-09-03 17:00:56.136] 0x20002F18:0x00000000 0x00001893 (#1) 0x40037000 0x00000000 0x00000000 0x00001863 (#2) 0x0000179D 0x000018A1 (#3)
[2019-09-03 17:00:56.236] 0x20002F38:0x0000179D 0x000018AD (#4) 0x20000BA8 0x20000BA8 0x00000000 0x00001C85 0x00000000 0x00003003 
[2019-09-03 17:00:56.339] 0x20002F58:0x00000000 0x00001CDB 0x00000004 0x000018D1 (#5) 0x0000179D 0x000018DD (#6) 0x0000179D 0x000019D7 (#7)
[2019-09-03 17:00:56.444] 0x20002F78:0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 
[2019-09-03 17:00:56.550] 0x20002F98:0xCCCCCCCC 0xCCCCCCCC 0x0000006C 0x0000630F 0x00000000 0x0000E514 0x20000C30 0x70000000 
[2019-09-03 17:00:56.661] 0x20002FB8:0x000F423F 0x410FC241 0x00000002 0x0000EB8C 0x20000000 0x1FFF0000 0x00000000 0x00000000 
[2019-09-03 17:00:56.759] 0x20002FD8:0x00030000 0x00010000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 17:00:56.861] 0x20002FF8:0x00100000 0x00000000 0x00000000 0x00000000 0x1FFF0510 0x200008A8 0x00002710 0x00001D17 
[2019-09-03 17:00:56.972] 0x20003018:0x000017A1 0x00000000 0x00000000 0x0000E190 0x00000002 0x00000000 0x00000000 0x00000000 
[2019-09-03 17:00:57.070] 0x20003038:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00004053 0x00000001 0x000040C5 
[2019-09-03 17:00:57.176] 0x20003058:0x00000000 0x00005925 0x00008000 
[2019-09-03 17:00:57.220] 
[2019-09-03 17:00:57.235] Next:
[2019-09-03 17:00:57.240] main  State: 0x2 Entry: 0x000040B3 Stack Size: 0x00001000 Mem: 0x20002060 SP: 0x20003020
[2019-09-03 17:00:57.322] 
[2019-09-03 17:00:57.323] Stack Dump:
[2019-09-03 17:00:57.334] 0x20003020:0x00000000 0x0000E190 0x00000002 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 17:00:57.439] 0x20003040:0x00000000 0x00000000 0x00000000 0x00004053 0x00000001 0x000040C5 0x00000000 0x00005925 
[2019-09-03 17:00:57.543] 0x20003060:0x00008000 
[2019-09-03 17:00:57.573] 
[2019-09-03 17:00:57.575] Ready:
[2019-09-03 17:00:57.575] rtx_idle  State: 0x1 Entry: 0x000041E9 Stack Size: 0x00000200 Mem: 0x20001328 SP: 0x200014E8
[2019-09-03 17:00:57.672] 
[2019-09-03 17:00:57.673] Stack Dump:
[2019-09-03 17:00:57.685] 0x200014E8:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 17:00:57.790] 0x20001508:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00005925 0x000041E9 0x01000000 
[2019-09-03 17:00:57.892] 0x20001528:0x00000000 
[2019-09-03 17:00:57.916] 
[2019-09-03 17:00:57.931] Wait:
[2019-09-03 17:00:57.933] rtx_timer  State: 0x83 Entry: 0x00006059 Stack Size: 0x00000300 Mem: 0x20001028 SP: 0x200012C0
[2019-09-03 17:00:58.027] 
[2019-09-03 17:00:58.027] Stack Dump:
[2019-09-03 17:00:58.041] 0x200012C0:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 17:00:58.141] 0x200012E0:0xFFFFFFFE 0x20000F9C 0x20001318 0x00000000 0x00004F11 0x00006081 0x000050EE 0x41000000 
[2019-09-03 17:00:58.248] 0x20001300:0xFFFFFFFF 0x00000000 0x00000000 0x00000000 0x00000000 0x00006081 0x00000000 0x00005925 
[2019-09-03 17:00:58.353] 0x20001320:0x00006059 0x00005925 0xE25A2EA5 
[2019-09-03 17:00:58.406] 
[2019-09-03 17:00:58.406] Delay:
[2019-09-03 17:00:58.409] For more info, visit: https://mbed.com/s/error?error=0x80FF013D&osver=999999&core=0x410FC241&comp=2&ver=60300&tgt=K64F
[2019-09-03 17:00:58.529] -- MbedOS Error Info --

NOTE: If you search (#_number_) (which I manually added) you will be able to confirm that the stack dump matches with true callstack information.

handler mode HW crash test

handler mode HW crash is made with this main_handler_mode_crash.cpp.txt code.
This is the true callstack trace acquired just before the crash.

(gdb) bt
#0  0x00001858 in generate_bus_fault_unaligned_access () at .\main.cpp:24
#1  0x00001876 in func6 (debugP=debugP@entry=1, param2=param2@entry=777) at .\main.cpp:32
#2  0x00001846 in func5 (debugP=debugP@entry=1) at .\main.cpp:37
#3  0x00001884 in func4 (str=str@entry=0x2002ff40 "") at .\main.cpp:43
#4  0x00001890 in func3 (f=f@entry=0.333333343) at .\main.cpp:50
#5  0x000018b4 in func2 (a=a@entry=1, b=b@entry=3) at .\main.cpp:55
#6  0x000018c0 in func1 () at .\main.cpp:60
#7  0x000018c8 in flip () at .\main.cpp:66
#8  0x00001792 in mbed::Callback<void ()>::function_call<void (*)()>(void const*) (p=<optimized out>)
    at ./mbed-os/platform/Callback.h:655
#9  0x00001b90 in mbed::Callback<void ()>::call() const (this=0x20000bd0 <flipper+40>)
    at ./mbed-os/platform/Callback.h:573
#10 mbed::Callback<void ()>::operator()() const (this=0x20000bd0 <flipper+40>) at ./mbed-os/platform/Callback.h:580
#11 mbed::Ticker::handler (this=0x20000ba8 <flipper>) at .\mbed-os\drivers\source\Ticker.cpp:62
#12 0x00001cda in mbed::TimerEvent::irq (id=<optimized out>) at .\mbed-os\drivers\source\TimerEvent.cpp:37
#13 0x000023a4 in ticker_irq_handler (ticker=0xe4fc <us_data>) at .\mbed-os\hal\mbed_ticker_api.c:314
#14 0x000024b8 in us_ticker_irq_handler () at .\mbed-os\hal\mbed_us_ticker_api.c:101
#15 0x0000746a in pit_isr ()
    at .\mbed-os\targets\TARGET_Freescale\TARGET_MCUXpresso_MCUS\TARGET_MCU_K64F\us_ticker.c:41
#16 <signal handler called>
   0xfffffffd:  Cannot access memory at address 0xfffffffc
#17 osKernelStart () at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_kernel.c:565
#18 0x00004102 in mbed_rtos_start () at .\mbed-os\rtos\source\TARGET_CORTEX\mbed_rtos_rtx.c:116
#19 0x00003fba in software_init_hook ()
    at .\mbed-os\rtos\source\TARGET_CORTEX\TOOLCHAIN_GCC_ARM\mbed_boot_gcc_arm.c:52
#20 0x000004da in _start ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) info registers
r0             0x1      1
r1             0x30a    778
r2             0xe000ed00       -536810240
r3             0xaaa3   43683
r4             0x22f457 2290775
r5             0x0      0
r6             0x20000ba8       536873896
r7             0x42bd   17085
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x0      0
r12            0x3d07   15623
sp             0x2002ff18       0x2002ff18
lr             0x1877   6263
pc             0x1858   0x1858 <generate_bus_fault_unaligned_access()+16>
xpsr           0x61000043       1627390019

And this is the stack dump result.

[2019-09-03 16:43:55.102] ++ MbedOS Fault Handler ++
[2019-09-03 16:43:55.128] 
[2019-09-03 16:43:55.129] FaultType: HardFault
[2019-09-03 16:43:55.151] 
[2019-09-03 16:43:55.152] Context:
[2019-09-03 16:43:55.163] R0   : 00000001
[2019-09-03 16:43:55.179] R1   : 0000030A
[2019-09-03 16:43:55.195] R2   : E000ED00
[2019-09-03 16:43:55.212] R3   : 0000AAA3
[2019-09-03 16:43:55.228] R4   : 0022F457
[2019-09-03 16:43:55.245] R5   : 00000000
[2019-09-03 16:43:55.263] R6   : 20000BA8
[2019-09-03 16:43:55.278] R7   : 000042BD
[2019-09-03 16:43:55.295] R8   : 00000000
[2019-09-03 16:43:55.313] R9   : 00000000
[2019-09-03 16:43:55.328] R10  : 00000000
[2019-09-03 16:43:55.345] R11  : 00000000
[2019-09-03 16:43:55.363] R12  : 00003D07
[2019-09-03 16:43:55.378] SP   : 2002FF18
[2019-09-03 16:43:55.396] LR   : 00001877
[2019-09-03 16:43:55.414] PC   : 00001858
[2019-09-03 16:43:55.428] xPSR : 61000043
[2019-09-03 16:43:55.448] PSP  : 20001498
[2019-09-03 16:43:55.461] MSP  : 2002FEB0
[2019-09-03 16:43:55.477] CPUID: 410FC241
[2019-09-03 16:43:55.497] HFSR : 40000000
[2019-09-03 16:43:55.514] MMFSR: 00000000
[2019-09-03 16:43:55.529] BFSR : 00000000
[2019-09-03 16:43:55.546] UFSR : 00000100
[2019-09-03 16:43:55.562] DFSR : 00000010
[2019-09-03 16:43:55.580] AFSR : 00000000
[2019-09-03 16:43:55.595] Mode : Handler
[2019-09-03 16:43:55.614] Priv : Privileged
[2019-09-03 16:43:55.629] Stack: MSP
[2019-09-03 16:43:55.644] 
[2019-09-03 16:43:55.644] -- MbedOS Fault Handler --
[2019-09-03 16:43:55.670] 
[2019-09-03 16:43:55.671] 
[2019-09-03 16:43:55.671] 
[2019-09-03 16:43:55.675] ++ MbedOS Error Info ++
[2019-09-03 16:43:55.699] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-09-03 16:43:55.747] Error Message: Fault exception
[2019-09-03 16:43:55.781] Location: 0x1858 (#0)
[2019-09-03 16:43:55.798] Error Value: 0x1FFF0400
[2019-09-03 16:43:55.823] Current Thread: <handler>  Id: 0x0 Entry: 0x0 StackSize: 0xE4 StackMem: 0x2002FF18 SP: 0x2002FF18 
[2019-09-03 16:43:55.927] 
[2019-09-03 16:43:55.927] Stack Dump:
[2019-09-03 16:43:55.939] 0x2002FF18:0x00000000 0x00001877 (#1) 0x00000000 0x00000000 0x2002FF30 0x00001847 (#2) 0x000018C3 0x00001885 (#3) 
[2019-09-03 16:43:56.043] 0x2002FF38:0x000018C3 0x00001891 (#4) 0x00000000 0x20000BB0 0x0000E4FC 0x00000000 0x004178D7 0x00000000 
[2019-09-03 16:43:56.151] 0x2002FF58:0x00000001 0x00002417 0x00000000 0x000018B5 (#5) 0x000018C3 0x000018C1 (#6) 0x000018C3 0x000018C9 (#7) 
[2019-09-03 16:43:56.252] 0x2002FF78:0x000018C3 0x00001793 (#8) 0x0000178D 0x00001B91 (#9) 0x0000E4FC 0x20000C30 0x20000BB0 0x00001CDB (#12)
[2019-09-03 16:43:56.354] 0x2002FF98:0x00001CD3 0x000023A5 (#13) 0x200014D4 0x00000000 0x000001F4 0x000024B9 (#14) 0x40037000 0x0000746B (#15)
[2019-09-03 16:43:56.459] 0x2002FFB8:0x00000000 0xFFFFFFFD (#16) 0x00000000 0x00000000 0x20002034 0x00000000 0x00004615 0x00004103 (#18)
[2019-09-03 16:43:56.564] 0x2002FFD8:0x0000483E 0x41000000 0x20002058 0x00004103 0x40000001 0x00004097 0x0002CAA8 0x00003FBB (#19)
[2019-09-03 16:43:56.668] 0x2002FFF8:0x00003F95 0x000004DB (#20)
[2019-09-03 16:43:56.703] 
[2019-09-03 16:43:56.704] Next:
[2019-09-03 16:43:56.712] rtx_idle  State: 0x2 Entry: 0x000041D1 Stack Size: 0x00000200 Mem: 0x20001328 SP: 0x200014C8
[2019-09-03 16:43:56.807] 
[2019-09-03 16:43:56.821] Stack Dump:
[2019-09-03 16:43:56.823] 0x200014C8:0x200014E0 0x00002839 0x00000000 0x00002FEB 0x0000070B 0x000028B3 0x000008FF 0x00000000 
[2019-09-03 16:43:56.926] 0x200014E8:0x000042BD 0x00000000 0x0000070B 0x000028FB 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:57.029] 0x20001508:0x00000000 0x000042DB 0x000042CD 0x000043AD 0x00000000 0x000041D7 0x00000000 0x0000590D 
[2019-09-03 16:43:57.133] 0x20001528:0x00000000 
[2019-09-03 16:43:57.158] 
[2019-09-03 16:43:57.160] Ready:
[2019-09-03 16:43:57.169] Wait:
[2019-09-03 16:43:57.179] rtx_timer  State: 0x83 Entry: 0x00006041 Stack Size: 0x00000300 Mem: 0x20001028 SP: 0x200012C0
[2019-09-03 16:43:57.273] 
[2019-09-03 16:43:57.273] Stack Dump:
[2019-09-03 16:43:57.294] 0x200012C0:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:57.389] 0x200012E0:0xFFFFFFFE 0x20000F9C 0x20001318 0x00000000 0x00004EF9 0x00006069 0x000050D6 0x41000000 
[2019-09-03 16:43:57.497] 0x20001300:0xFFFFFFFF 0x00000000 0x00000000 0x00000000 0x00000000 0x00006069 0x00000000 0x0000590D 
[2019-09-03 16:43:57.598] 0x20001320:0x00006041 0x0000590D 0xE25A2EA5 
[2019-09-03 16:43:57.644] 
[2019-09-03 16:43:57.658] Delay:
[2019-09-03 16:43:57.661] main  State: 0x13 Entry: 0x0000409B Stack Size: 0x00001000 Mem: 0x20002060 SP: 0x20002E88
[2019-09-03 16:43:57.747] 
[2019-09-03 16:43:57.749] Stack Dump:
[2019-09-03 16:43:57.760] 0x20002E88:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:57.864] 0x20002EA8:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:57.968] 0x20002EC8:0x00000000 0x000001F4 0x02DB2B60 0x00000000 0x20000BA0 0x00000003 0x00079DC9 0x00000000 
[2019-09-03 16:43:58.095] 0x20002EE8:0x00000000 0x20000E90 0x00000001 0x00000000 0x000043B5 0x00006087 0x000043EA 0x210F0000 
[2019-09-03 16:43:58.177] 0x20002F08:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:58.280] 0x20002F28:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:58.384] 0x20002F48:0x00000000 0x00000000 0x00000000 0x00006087 0x00000000 0x00003E73 0x00079DC9 0x1FFF0540 
[2019-09-03 16:43:58.489] 0x20002F68:0x00000003 0x00000002 0x20000BE8 0x000019C7 0x000004C0 0x000004C0 0x000004C0 0x0002CAA8 
[2019-09-03 16:43:58.595] 0x20002F88:0x00000002 0x00000000 0x00000020 0xCCCCCCCC 0x000CC550 0x00000000 0x00079DC9 0x00000000 
[2019-09-03 16:43:58.697] 0x20002FA8:0x00079DC9 0x00000000 0x00000000 0x00000000 0x000F423F 0x410FC241 0x00000002 0x0000EB8C 
[2019-09-03 16:43:58.803] 0x20002FC8:0x20000000 0x1FFF0000 0x00000000 0x00000000 0x00030000 0x00010000 0x00000000 0x00000000 
[2019-09-03 16:43:58.906] 0x20002FE8:0x00000000 0x00000000 0x00000000 0x00000000 0x00100000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:59.010] 0x20003008:0x1FFF0510 0x200008A8 0x00002710 0x00001CFF 0x000018C3 0x00000000 0x00000000 0x0000E178 
[2019-09-03 16:43:59.114] 0x20003028:0x00000002 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-03 16:43:59.220] 0x20003048:0x00000000 0x0000403B 0x00000001 0x000040AD 0x00000000 0x0000590D 0x00008000 
[2019-09-03 16:43:59.310] 
[2019-09-03 16:43:59.311] For more info, visit: https://mbed.com/s/error?error=0x80FF013D&osver=999999&core=0x410FC241&comp=2&ver=60300&tgt=K64F
[2019-09-03 16:43:59.436] -- MbedOS Error Info ?

NOTE: If you search (#_number_) (which I manually added) you will be able to confirm that the stack dump matches with true callstack information.

handler mode RTX error test

handler mode RTX error is made with this main_rtx_handler_error.cpp.txt code.
This is the callstack trace acquired just before the crash.

(gdb) bt
#0  0x00005276 in svcRtxMutexAcquire (mutex_id=0xdeaddead, timeout=4294967295)
    at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_mutex.c:187
#1  0x0000060e in SVC_Handler () at irq_cm4f.S:63
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info threads
  Id   Target Id         Frame
* 2    Thread 2 "Handler mode" (SVCall) 0x00005276 in svcRtxMutexAcquire (mutex_id=0xdeaddead, timeout=4294967295)
    at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_mutex.c:187
  3    Thread 536874640 "main" (Running; Priority 24) osMutexAcquire (mutex_id=<optimized out>, timeout=4294967295)
    at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_mutex.c:489
  4    Thread 536874976 "rtx_idle" (Ready; Priority 1) osRtxIdleThread (argument=0x0 <__isr_vector>)
    at .\mbed-os\rtos\source\TARGET_CORTEX\mbed_rtx_handlers.c:50
  5    Thread 536874908 "rtx_timer" (Waiting[MsgGet]; Priority 40) osMessageQueueGet (mq_id=<optimized out>,
    msg_ptr=0x20000f9c <os_timer_thread_cb>, msg_ptr@entry=0x20001318 <os_timer_thread_stack+752>,
    msg_prio=0x20001318 <os_timer_thread_stack+752> "", msg_prio@entry=0x0 <__isr_vector>, timeout=0,
    timeout@entry=4294967295) at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_msgqueue.c:872
(gdb) thread 3
[Switching to thread 3 (Thread 536874640)]
#0  osMutexAcquire (mutex_id=<optimized out>, timeout=4294967295)
    at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_mutex.c:489
489     }
=> 0x000054a2 <osMutexAcquire+34>:      08 bd   pop     {r3, pc}
(gdb) bt
#0  osMutexAcquire (mutex_id=<optimized out>, timeout=4294967295)
    at .\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source\rtx_mutex.c:489
#1  0x21010000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Please note how even gdb is not helpful much.

And this is the stack dump result after andrewc-arm@0195145.

[2019-09-04 17:37:27.873] ++ MbedOS Fault Handler ++
[2019-09-04 17:37:27.900] 
[2019-09-04 17:37:27.901] FaultType: HardFault
[2019-09-04 17:37:27.931] 
[2019-09-04 17:37:27.932] Context:
[2019-09-04 17:37:27.941] R0   : DEADDEAD
[2019-09-04 17:37:27.952] R1   : FFFFFFFF
[2019-09-04 17:37:27.969] R2   : 00000001
[2019-09-04 17:37:27.983] R3   : 20000118
[2019-09-04 17:37:27.999] R4   : DEADDEAD
[2019-09-04 17:37:28.016] R5   : 20000E90
[2019-09-04 17:37:28.034] R6   : FFFFFFFF
[2019-09-04 17:37:28.049] R7   : 00000000
[2019-09-04 17:37:28.067] R8   : 00000000
[2019-09-04 17:37:28.085] R9   : 00000000
[2019-09-04 17:37:28.099] R10  : 00000000
[2019-09-04 17:37:28.117] R11  : 00000000
[2019-09-04 17:37:28.135] R12  : 00005269
[2019-09-04 17:37:28.150] SP   : 2002FFA8
[2019-09-04 17:37:28.171] LR   : 0000060F
[2019-09-04 17:37:28.183] PC   : 00005276
[2019-09-04 17:37:28.199] xPSR : 2101000B
[2019-09-04 17:37:28.217] PSP  : 20002EB0
[2019-09-04 17:37:28.233] MSP  : 2002FF88
[2019-09-04 17:37:28.251] CPUID: 410FC241
[2019-09-04 17:37:28.266] HFSR : 40000000
[2019-09-04 17:37:28.285] MMFSR: 00000000
[2019-09-04 17:37:28.300] BFSR : 00000082
[2019-09-04 17:37:28.319] UFSR : 00000000
[2019-09-04 17:37:28.333] DFSR : 00000010
[2019-09-04 17:37:28.350] AFSR : 00000000
[2019-09-04 17:37:28.368] BFAR : DEADDEAD
[2019-09-04 17:37:28.384] Mode : Handler
[2019-09-04 17:37:28.399] Priv : Privileged
[2019-09-04 17:37:28.419] Stack: MSP
[2019-09-04 17:37:28.431] 
[2019-09-04 17:37:28.431] -- MbedOS Fault Handler --
[2019-09-04 17:37:28.459] 
[2019-09-04 17:37:28.474] 
[2019-09-04 17:37:28.474] 
[2019-09-04 17:37:28.475] ++ MbedOS Error Info ++
[2019-09-04 17:37:28.488] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-09-04 17:37:28.540] Error Message: Fault exception
[2019-09-04 17:37:28.568] Location: 0x5276
[2019-09-04 17:37:28.585] Error Value: 0x1FFF0400
[2019-09-04 17:37:28.611] Current Thread: main <handler> Id: 0x20000E90 Entry: 0x413B StackSize: 0x1000 StackMem: 0x20002060 SP: 0x2002FFA8 
[2019-09-04 17:37:28.731] Stack Dump: MSP
[2019-09-04 17:37:28.749] 0x2002FFA8:0x00000000 0x20000BE8 0x00000000 0x0000060F 0x20002EB0 0xFFFFFFED 0x00000000 0x00000000 
[2019-09-04 17:37:28.854] 0x2002FFC8:0x20002034 0x00000000 0x000046B5 0x000041A3 0x000048DE 0x41000000 0x20002058 0x000041A3 
[2019-09-04 17:37:28.955] 0x2002FFE8:0x40000001 0x00004137 0x0002CAA8 0x0000405B 0x00004035 0x000004DB 
[2019-09-04 17:37:29.037] 
[2019-09-04 17:37:29.038] Stack Dump: PSP
[2019-09-04 17:37:29.055] 0x20002EB0:0xDEADDEAD 0xFFFFFFFF 0x00000001 0x00000000 0x00005269 0x00001871 0x000054A2 0x61010000 
[2019-09-04 17:37:29.160] 0x20002ED0:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-04 17:37:29.264] 0x20002EF0:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x3EAAAAAB 0x3F800000 0x40400000 
[2019-09-04 17:37:29.371] 0x20002F10:0x00000010 0x000021C3 0x00000000 0x00001871 0x40037000 0x00000000 0x00000000 0x00001887 
[2019-09-04 17:37:29.470] 0x20002F30:0x0000179D 0x00001891 0x0000179D 0x0000189D 0x20000BA8 0x20000BA8 0x00000000 0x00001C75 
[2019-09-04 17:37:29.575] 0x20002F50:0x00000000 0x00002FF3 0x00000000 0x00001CCB 0x00000004 0x000018C1 0x0000179D 0x000018CD 
[2019-09-04 17:37:29.680] 0x20002F70:0x0000179D 0x000019C7 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 
[2019-09-04 17:37:29.784] 0x20002F90:0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0x0000006C 0x00006397 0x00000000 0x0000E59C 
[2019-09-04 17:37:29.891] 0x20002FB0:0x20000C30 0x70000000 0x000F423F 0x410FC241 0x00000002 0x0000EB8C 0x20000000 0x1FFF0000 
[2019-09-04 17:37:29.992] 0x20002FD0:0x00000000 0x00000000 0x00030000 0x00010000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-04 17:37:30.097] 0x20002FF0:0x00000000 0x00000000 0x00100000 0x00000000 0x00000000 0x00000000 0x1FFF0510 0x200008A8 
[2019-09-04 17:37:30.201] 0x20003010:0x00002710 0x00001D07 0x000017A1 0x00000000 0x00000000 0x0000E218 0x00000002 0x00000000 
[2019-09-04 17:37:30.306] 0x20003030:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x000040DB 
[2019-09-04 17:37:30.410] 0x20003050:0x00000001 0x0000414D 0x00000000 0x000059AD 0x00008000 
[2019-09-04 17:37:30.477] 
[2019-09-04 17:37:30.493] Next:
[2019-09-04 17:37:30.495] main  State: 0x2 Entry: 0x0000413B Stack Size: 0x00001000 Mem: 0x20002060 SP: 0x20003020
[2019-09-04 17:37:30.580] Stack Dump: PSP
[2019-09-04 17:37:30.598] 0x20003020:0x00000000 0x0000E218 0x00000002 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-04 17:37:30.700] 0x20003040:0x00000000 0x00000000 0x00000000 0x000040DB 0x00000001 0x0000414D 0x00000000 0x000059AD 
[2019-09-04 17:37:30.802] 0x20003060:0x00008000 
[2019-09-04 17:37:30.837] 
[2019-09-04 17:37:30.837] Ready:
[2019-09-04 17:37:30.853] rtx_idle  State: 0x1 Entry: 0x00004271 Stack Size: 0x00000200 Mem: 0x20001328 SP: 0x200014E8
[2019-09-04 17:37:30.935] Stack Dump: PSP
[2019-09-04 17:37:30.952] 0x200014E8:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-04 17:37:31.052] 0x20001508:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x000059AD 0x00004271 0x01000000 
[2019-09-04 17:37:31.159] 0x20001528:0x00000000 
[2019-09-04 17:37:31.182] 
[2019-09-04 17:37:31.196] Wait:
[2019-09-04 17:37:31.200] rtx_timer  State: 0x83 Entry: 0x000060E1 Stack Size: 0x00000300 Mem: 0x20001028 SP: 0x200012C0
[2019-09-04 17:37:31.295] Stack Dump: PSP
[2019-09-04 17:37:31.303] 0x200012C0:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-04 17:37:31.410] 0x200012E0:0xFFFFFFFE 0x20000F9C 0x20001318 0x00000000 0x00004F99 0x00006109 0x00005176 0x41000000 
[2019-09-04 17:37:31.513] 0x20001300:0xFFFFFFFF 0x00000000 0x00000000 0x00000000 0x00000000 0x00006109 0x00000000 0x000059AD 
[2019-09-04 17:37:31.617] 0x20001320:0x000060E1 0x000059AD 0xE25A2EA5 
[2019-09-04 17:37:31.668] 
[2019-09-04 17:37:31.668] Delay:
[2019-09-04 17:37:31.688] For more info, visit: https://mbed.com/s/error?error=0x80FF013D&osver=999999&core=0x410FC241&comp=2&ver=60300&tgt=K64F
[2019-09-04 17:37:31.795] -- MbedOS Error Info --

This time, it's a bit tricky because MSP stack dump will not tell you where the exact failure is. The user will need to decode MSP stack and then try PSP stack decoding like following.

arm-none-eabi-addr2line 0x00005268 -e .\BUILD\K64F\GCC_ARM-DEBUG\mbed-os-example-blinky.elf
C:\Andrew\code\mbed-os-example-blinky/.\mbed-os\rtos\source\TARGET_CORTEX\rtx5\RTX\Source/rtx_mutex.c:173
arm-none-eabi-addr2line 0x00001870 -e .\BUILD\K64F\GCC_ARM-DEBUG\mbed-os-example-blinky.elf
	C:\Andrew\code\mbed-os-example-blinky/.\main.cpp:34

Please note how main.cpp:34 is found with given main.cpp code listed above.

Pull request type

[x] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

Release Notes

@ciarmcom ciarmcom requested review from a team August 26, 2019 09:00
@ciarmcom
Copy link
Member

@andrewc-arm, thank you for your changes.
@ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 26, 2019

Please fix astyle errors

#if defined(__CORTEX_M)
mbed_fault_context_t *mfc = (mbed_fault_context_t *)error_value;
current_error_ctx.error_address = (uint32_t)mfc->PC_reg;
current_error_ctx.thread_current_sp = (uint32_t)mfc->SP_reg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Thread current SP" would actually be PSP.

SP_reg will be either thread SP (PSP) if the exception happened in thread mode, or main SP (MSP) if the exception happened in handler mode. Second case is less likely but possible.

I think you want PSP specifically here, as you are going to show it as part of the "current thread" line, next to the pointer for the thread base, as if intending to compare them.

(Although it doesn't matter that much, as the fault handler should have printed all of SP, MSP and PSP anyway before entering us, right?)

Not sure the warning is the way to go - we have no portable non-M fault handler, so how could any using a Cortex-A "implement a non Cortex-M handler" in this file to silence it? I would just let it use the existing code, and skip the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjbracey-arm

I think you want PSP specifically here,

Please note that there is an adjustment of PSP/MSP on this assembly code. It looks like the assembly code is adjusting the PSP/MSP value offset caused by calling Fault_Handler. I experimented with gdb and made sure that SP_reg value is the exact same with SP register value just before the crash. If you observe teraterm_20190826.1656.log, you can see that SP and PSP are different.

[2019-08-26 16:57:01.354] SP   : 20002D30 <== correct value just before the crash
[2019-08-26 16:57:01.371] LR   : 000017DF
[2019-08-26 16:57:01.389] PC   : 000017C4
[2019-08-26 16:57:01.403] xPSR : 21000000
[2019-08-26 16:57:01.421] PSP  : 20002CC8 <== a bit tainted value due to Fault_Handler
[2019-08-26 16:57:01.438] MSP  : 2002FFC0

(Although it doesn't matter that much, as the fault handler should have printed all of SP, MSP and PSP anyway before entering us, right?)

Yes. We print all the context as shown above. And we print this mode information as well.

[2019-08-26 16:57:01.571] Mode : Thread
[2019-08-26 16:57:01.586] Priv : Privileged
[2019-08-26 16:57:01.604] Stack: PSP

I would just let it use the existing code, and skip the warning.

OK. I removed the warning in bdfef3f.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that complicates a bit. So SP is a better value for "thread SP" than PSP if the exception was from thread mode. But if the exception was from handler mode, then SP is a better value for "handler SP" than MSP.

Then if you're looking for "thread stack pointer" specifically, then you'd want to take SP if from exception was from thread mode, or PSP if from handler mode. So you need to look at the SPSEL bit of the exception return value.

(I've been around all this stuff working on exception handling for pyOCD, which has to do the same sort of reconstruction logic, trying to cover all cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjbracey-arm

Okay, that complicates a bit. So SP is a better value for "thread SP" than PSP if the exception was from thread mode. But if the exception was from handler mode, then SP is a better value for "handler SP" than MSP.

Yes. I believe SP is the better value in both cases. If that's what you meant.

In case of thread mode, PSP is provided as a source to SP via except.S#L112. In case of handler mode, MSP is provided as source to SP via except.S#L107. Our assembly already checks EXC_RETURN from except.S#L110.

For stack dump feature upcoming, I will check the mode and dump the right PSP/MSP. For this PR, I believe providing SP as it is a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of handler mode, MSP is provided as source to SP via except.S#L107.

So in that case SP is not a thread stack pointer at all. If you're not able to do adaptive behaviour here, then a fixed choice of SP is better to get precision for the common case of exception in thread mode, so it's the 1 to go for, even though it's worse for the rare case of exception in handler mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand your point. You mean this printout

Current Thread: main  Id: 0x20000E50 Entry: 0x3D13 StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x20002D30

should also be changed to correctly show the handler stack instead of the thread stack. I will test with handler mode crashes and see if I can improve today.

@0xc0170 , please hold off merging. 😊

Copy link
Contributor

@kjbracey kjbracey Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current thread information is still useful in that case.

With the thread dumping info on (I'm not sure why it isn't by default), it would be the missing (and most important piece) of the full thread dump. The thread information extracted from the OS is only up-to-date for the non-active threads - we have to get the current thread's from the registers.

But if the crash is actually in handler mode, it could either be due to a random interrupt thing, totally unconnected to the current thread, or it could be due to something the current thread did (maybe enabling interrupts at a bad time, or some bug in a supervisor call to RTOS).

So it would make sense to show both current thread and handler mode info when crashing in handler mode. Either or both could be important.

Again - look at pyOCD, and this is basically how it displays info with its thread awareness.

It shows:

  • Current backtrace for current stack (by examining SP).
  • If in handler mode (main stack), also show thread and backtrace for process stack (by examining MSPPSP).
  • If RTOS awareness, also show thread and backtrace for other non-current RTOS threads (by examining RTOS global state structures)

So the handler mode stack appears whenever stopped in handler mode, else you don't see it and only see threads.

This then ties in with pyocd/pyOCD#430, which I really should get merged, which makes it show the aborting thread by default when stopping in the fault handler - at present, it shows you as having "SIGSEGV at fault handler", rather than the more logical "SIGSEGV at main thread location". The first is technically true at a hardware level, because it has entered the fault handler by the time the debugger gets there, but it's not the way that signals are supposed to be reported in GDB. (It's not as if your thing here is going to say hard fault at fault handler, is it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjbracey-arm
Please review the original description of this request.

  • Stack dump is enabled via "platform.stack-dump-enabled": true mbed_app.json configuration.
  • If stack dump is used in conjunction with "platform.error-all-threads-info": true, all the threads' stack will be dumped to provide holistic overview of all the threads.

Now with above two configs set to true, Mbed-OS will print the whole stack including the handler mode's stack.

Also, note how handler mode's printout is cleaned up like this.

Current Thread: <handler>  Id: 0x0 Entry: 0x0 StackSize: 0xE4 StackMem: 0x2002FF18 SP: 0x2002FF18 

There is no thread id in handler and there is no entry. Or if you know, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want a handler pseudo-thread (and that's a reasonable way to present it), it should be in addition to the current OS thread, rather than replacing it (like pyOCD). If I read the current code correctly, if I pass a bad pointer to osMutexLock, then we could enter the SVC handler, then hard fault inside it, and we wouldn't see the current thread that did that. We'd see "handler" and all other threads, and no backtrace showing the mutex lock at all.

(You should be able to test literally that scenario - do something like osMutexLock((osMutexId_t) 0xDEADDEAD)).

In the handler crash scenario, we don't know whether it's the handler's fault (eg bad interrupt handler) or the current thread's fault, like that mutex lock. So we should try to get both backtraces.

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 2, 2019

@0xc0170 , please hold off merging. 😊

Let us know once ready

@andrewc-arm andrewc-arm changed the title handle_error: fix the wrong handling of HW fault exceptions mbed_error.c: Better HW fault exceptions and stack dump Sep 3, 2019
@andrewc-arm
Copy link
Contributor Author

@0xc0170 and @kjbracey-arm
I believe the current code is thoroughly tested and worth review. Please review. Thanks.

Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to catch interrupt handler crashes, but please consider the inside-RTOS-call crash scenario.

There are actually two possible scenarios there - fault inside an RTOS call as I describe above, or a deliberate error traps when an RTOS handler calls mbed_error, eg on mutex lock failure return.

"stack-dump-enabled": {
"macro_name": "MBED_STACK_DUMP_ENABLED",
"help": "Set to 1 or true to enable stack dump.",
"value": null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're following the bad pattern of the things above, but options like this should be true or false. null is undefined. Let's get this new one right, and fix the others later. (They've come up an issue already - I think they actually use ifdef so interpret false as "on"?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I changed with 1e601e5.

#if defined(__CORTEX_M)
mbed_fault_context_t *mfc = (mbed_fault_context_t *)error_value;
current_error_ctx.error_address = (uint32_t)mfc->PC_reg;
current_error_ctx.thread_current_sp = (uint32_t)mfc->SP_reg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want a handler pseudo-thread (and that's a reasonable way to present it), it should be in addition to the current OS thread, rather than replacing it (like pyOCD). If I read the current code correctly, if I pass a bad pointer to osMutexLock, then we could enter the SVC handler, then hard fault inside it, and we wouldn't see the current thread that did that. We'd see "handler" and all other threads, and no backtrace showing the mutex lock at all.

(You should be able to test literally that scenario - do something like osMutexLock((osMutexId_t) 0xDEADDEAD)).

In the handler crash scenario, we don't know whether it's the handler's fault (eg bad interrupt handler) or the current thread's fault, like that mutex lock. So we should try to get both backtraces.

@andrewc-arm
Copy link
Contributor Author

@kjbracey-arm
Could you please review again? My test result is listed as handler mode RTX error test (using osMutexAcquire((osMutexId_t)0xDEADDEAD, osWaitForever);) in #11332 (comment) and got the right error points even pyOCD + gdb could not find.

@kjbracey
Copy link
Contributor

kjbracey commented Sep 4, 2019

got the right error points even pyOCD + gdb could not find.

Hmm. pyOCD handler failed to get that right for SVC-calling thread - looks like correct PC was extracted but SP must be wrong. ("Info registers" after the thread command will show the SP+PC that pyOCD is telling GDB for that thread, and they're GDB's backtrace key).

Existing issue: we shouldn't need to print the "next" thread - it should be a thread that's already referenced by another pointer. (That field is used as an input to the assembly thread switching code - it takes action when it sees next != curr).

I note that the Mbed OS thread display isn't showing a "PC" for each thread, only an SP. I guess that is why the user would have to figure that out themselves? You should be able to concoct an SP+PC pair for each, so all can be backtraced. (Again, there are 4 different cases to get the SP+PC - handler, current thread, current thread when we're in a handler, non-current threads).

Looking at your examples, I'm not clear where the current thread is in the ticker crash. Sure it's not relevant here, but I'm not sure how the dump is deciding that. Abort in IRQ handler shouldn't necessarily look different from fault in SVC handler - should be possible to see what we interrupted.

Oh, is it that we're tickless and the kernel is suspended, so there is no "current" thread at the wake-from-sleep, just a ready one?

Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have general comments, but this is fine as is to merge now, and further refinement can follow.

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 4, 2019

CI started

@mbed-ci
Copy link

mbed-ci commented Sep 4, 2019

Test run: FAILED

Summary: 3 of 4 test jobs failed
Build number : 1
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_build-ARM
  • jenkins-ci/mbed-os-ci_build-GCC_ARM
  • jenkins-ci/mbed-os-ci_build-IAR

@andrewc-arm
Copy link
Contributor Author

@kjbracey-arm
Thanks for the feedback and approval.
Yes. There are rooms for improvement, but I agree with you that this is better than before.

@0xc0170
I looked at the current CI test failures and they look like one of the CI hiccups we discussed previously. Or do you think I should fix something?

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 5, 2019

CI restarted

@mbed-ci
Copy link

mbed-ci commented Sep 5, 2019

Test run: FAILED

Summary: 4 of 4 test jobs failed
Build number : 2
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_unittests
  • jenkins-ci/mbed-os-ci_build-ARM
  • jenkins-ci/mbed-os-ci_build-GCC_ARM
  • jenkins-ci/mbed-os-ci_build-IAR

@mbed-ci
Copy link

mbed-ci commented Sep 5, 2019

Test run: FAILED

Summary: 3 of 4 test jobs failed
Build number : 3
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_build-ARM
  • jenkins-ci/mbed-os-ci_build-GCC_ARM
  • jenkins-ci/mbed-os-ci_build-IAR

@andrewc-arm
Copy link
Contributor Author

@0xc0170
I found the reason why it failed. There were compile errors for those targets. I will come back to you again.

@andrewc-arm
Copy link
Contributor Author

@0xc0170
Sorry about the inconvenience. I just submitted the code change a769b7d and now it should compile every targets well.

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 5, 2019

CI restarted

@mbed-ci
Copy link

mbed-ci commented Sep 5, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 4
Build artifacts

@andrewc-arm
Copy link
Contributor Author

@0xc0170
Sorry for the big inconvenience. I found a bug in the code and I am going to push in. Could you please wait for me and later rerun the test? 🤦‍♂

@andrewc-arm
Copy link
Contributor Author

@0xc0170
Now the bugs are fixed and this PR is ready for CI test again. Thanks for the support. 😁

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 6, 2019

CI restarted

@mbed-ci
Copy link

mbed-ci commented Sep 6, 2019

Test run: FAILED

Summary: 1 of 4 test jobs failed
Build number : 5
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_build-IAR

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 6, 2019

CI restarted (internal fault)

@mbed-ci
Copy link

mbed-ci commented Sep 6, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 6
Build artifacts

@0xc0170 0xc0170 merged commit d8c2c6f into ARMmbed:master Sep 6, 2019
@adbridge adbridge added release-version: 6.0.0-alpha-1 First pre-release version of 6.0.0 and removed release-version: 5.15.0-rc1 labels Oct 3, 2019
@0xc0170 0xc0170 added release-version: 5.15.0-rc1 and removed release-version: 6.0.0-alpha-1 First pre-release version of 6.0.0 labels Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants