Skip to content

Enable stack dump #10872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

andrewc-arm
Copy link
Contributor

@andrewc-arm andrewc-arm commented Jun 20, 2019

Description

In the current mbed-os, there is no stack dump feature as far as I know.
With this code change, now we can have a stack dump as shown in the Release Notes.

In the near future we will come up on how to parse the stack and reconstruct the back-trace and call arguments. But even without the parser, the assembly readers can manually decode the stack dump to find the failure.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

Release Notes

For better post-crash debugging with Mbed-OS, you can enable stack dump.
In order to use, the application should add mbed_app.json this line.

            "platform.stack-dump-enabled": true

It is strongly recommended that you should save the *.elf file along with your *.bin file. In the future Mbed-OS debugging tutorial, we are planning to introduce how to debug with Mbed-OS stack dump and *.elf file. Without *.elf file, you will need to disassemble the code and manually decode the meaning of the stack.

This is the example of the stack dump from MBED_ASSERT(false);.

++ MbedOS Error Info ++
Error Status: 0x80FF0144 Code: 324 Module: 255
Error Message: Assertion failed: false
Location: 0x3037
File: ./main.cpp+35
Error Value: 0x0
Current Thread: main  Id: 0x20000E1C Entry: 0x4A5F StackSize: 0x1000 StackMem: 0x20001E08 SP: 0x20002CA4 
For more info, visit: https://mbed.com/s/error?error=0x80FF0144&osver=999999&core=0x410FC241&comp=2&ver=70300&tgt=K64F

Stack Dump:
0x20002998:0xCCCCCCCC 0x200001C8 0x00000008 0x0000000B 0x20002B10 0x00000008 0x00000001 0x20002C48 
0x200029B8:0x00000000 0x00000000 0x00000000 0x00009F53 0x00000080 0x00000000 0x00000000 0x00000000 
0x200029D8:0x00000000 0x00000000 0x00000000 0x0001486C 0x00000001 0x00014840 0x00000000 0x00000000 
0x200029F8:0x00000000 0x00000000 0xFFFFFFFF 0x00009F53 0x00009AB8 0x41010000 0xFFFFFFFF 0x00009ECD 
0x20002A18:0x00000000 0x00000074 0x00000000 0x00000000 0xFFFFFFFF 0x00009F53 0x00009ECC 0x61010000 
0x20002A38:0x20002A44 0x00000000 0x00000000 0x00012A5B 0x00000001 0x0001489C 0x00000003 0x20002AE7 
0x20002A58:0x00000001 0x4006A000 0x0000000A 0xC0000080 0x0000000A 0x20002E68 0x20003030 0xC0000080 
0x20002A78:0x0A003030 0x4006A000 0x0000000A 0x20002E68 0x20003030 0x00003927 0x20003030 0x0000000B 
0x20002A98:0x20003030 0x20000D94 0x20003030 0x0000000B 0x00000016 0x00003D41 0x00000000 0x0000000B 
0x20002AB8:0x20003030 0x00000001 0x0000000B 0x20000D94 0x00000013 0x00003CB1 0x00000015 0x00000013 
0x20002AD8:0x00000000 0x0000000B 0x30303032 0x32333033 0x0000000B 0x00000083 0x200001C8 0x0000FFFF 
0x20002AF8:0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x0001021D 0x00000000 0x00000020 
0x20002B18:0x00000000 0x00000001 0xFFFFFFFF 0x00001F67 0x00001F88 0x210C0000 0x20000000 0x4006A000 
0x20002B38:0x0000003A 0x20002E68 0x20002C3C 0x00003927 0x00000003 0x0000000B 0x20002BA0 0x20000D94 
0x20002B58:0x20002BA0 0x0000000B 0x2000051C 0x00003D41 0x20002C64 0x0000000B 0x20002BA0 0x00000001 
0x20002B78:0x00000000 0x000031E5 0x0000000B 0x0000314F 0x20002C44 0x20002BA0 0x00000002 0x00003119 
0x20002B98:0x20002C44 0x00012A54 0x30307830 0x38373033 0x00203333 0x202C0000 0x69736976 0x68203A74 
0x20002BB8:0x73707474 0x6D2F2F3A 0x2E646562 0x2F6D6F63 0x72652F73 0x3F726F72 0x6F727265 0x78303D72 
0x20002BD8:0x46463038 0x34343130 0x76736F26 0x393D7265 0x39393939 0x6F632639 0x303D6572 0x30313478 
0x20002BF8:0x34324346 0x6F632631 0x323D706D 0x72657626 0x3330373D 0x74263030 0x4B3D7467 0x00463436 
0x20002C18:0x0000000A 0x00003D41 0x0000646D 0x0000000B 0x00011FD0 0x000030D3 0x00004A5F 0x20002C44 
0x20002C38:0x00000000 0x00003797 0x00012A54 0x00012A54 0x20002C44 0x20002C44 0x00000002 0x0001129C 
0x20002C58:0x20002CA4 0x00011FCA 0x00000023 0x00011FD0 0x00011FDC 0x20000CC4 0x00000000 0x000F423F 
0x20002C78:0x410FC241 0x00000002 0x0001129C 0x20000000 0x1FFF0000 0x00000000 0x00000000 0x00030000 
0x20002C98:0x00010000 0x00000000 0x00000000 0x00000000 

-- MbedOS Error Info --

This dump routine is smart and only dumps the critical area containing following information.

  • backtrace of callers
  • call arguments of callers

NOTE: Otherwise, if we dump too much, it will choke up the crash log.

Also, note that this does not require the application to build with -funwind-tables option or have GCC dependency.

Reference

(CC: @ohadhawk, @JanneKiiskila , @kjbracey-arm , @TeroJaasko , @bulislaw , @kivaisan )

@ciarmcom
Copy link
Member

@andrewc-arm, thank you for your changes.
@bulislaw @ohadhawk @kivaisan @TeroJaasko @kjbracey-arm @JanneKiiskila @ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

@40Grit
Copy link

40Grit commented Jun 20, 2019

Thank you!

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Could you add short description in our handbook in debugging section?

@adbridge
Copy link
Contributor

@andrewc-arm could you move the 'how to use this' to the release notes section please along with any additional information on how to use it that can go into the next release note. Thanks

@andrewc-arm
Copy link
Contributor Author

andrewc-arm commented Jun 24, 2019

@adbridge

could you move the 'how to use this' to the release notes section please along with any additional information on how to use it that can go into the next release note.

Done following items.

  • I moved the Release Note information to Release Notes section and edited it a little to be more formal and brief.
  • Code is rebased to the latest mbed-os master's tip.
  • astyle --style=allman applied to pass astyle test

@@ -74,14 +74,19 @@ static unsigned int compute_crc32(const void *data, int datalen)
unsigned int crc = 0; /* CRC value is 32bit */
unsigned char *buf = (unsigned char *)data;//use a temp variable to make code readable and to avoid typecasting issues.

for (; datalen > 0; datalen--) {
for (; datalen > 0; datalen--)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style requires 1 TBS for if, else, while, for statements, see https://os.mbed.com/teams/SDK-Development/wiki/mbed-sdk-coding-style
Please revert this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astyle --style=allman applied to pass astyle test

Use astylerc file in the root, defines all rules needed - just run it on your changes

Copy link
Contributor Author

@andrewc-arm andrewc-arm Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xc0170
I ran atstyle based on the rc file and this is what I got.

andcho01@bc-a1-1-3:~/code/mbed-os$ astyle -n --options=.astylerc platform/mbed_error.c
Invalid option file options:
max-continuation-indent=120

For help on options type 'astyle -h'

Artistic Style has terminated

So I commented out max-continuation-indent and got the astyle fixed. Please review d0ca786.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I saw this error earlier - what version of astyle you use? Update if older, they do not provide this option

@@ -556,6 +590,51 @@ static void print_error_report(const mbed_error_ctx *ctx, const char *error_msg,
mbed_stats_sys_get(&sys_stats);
mbed_error_printf("\nFor more info, visit: https://mbed.com/s/error?error=0x%08X&osver=%" PRId32 "&core=0x%08" PRIX32 "&comp=%d&ver=%" PRIu32 "&tgt=" GET_TARGET_NAME(TARGET_NAME), ctx->error_status, sys_stats.os_version, sys_stats.cpu_id, sys_stats.compiler_id, sys_stats.compiler_version);
#endif

#if MBED_STACK_DUMP_ENABLED && defined(MBED_CONF_RTOS_PRESENT)
#define STACK_END_MARK 0xCCCCCCCC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we use already RTX symbols (they are public) but not yet defined in CMSIS RTOS, lets use these here not redefine again. Use osRtxStackFillPattern, same as we use in Thread.cpp file (add #if defined(MBED_OS_BACKEND_RTX5) to be able to find these easy).

We do not care about watermark (if it is enabled)? There was an issue previously for this: #7545, we use both symbols in Thread object when checking stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used osRtxStackFillPattern in the new commit d0ca786.
We do not care about watermark. Without the watermark, osRtxStackMagicWord was not visible and this code does not depend on it.


#if MBED_STACK_DUMP_ENABLED && defined(MBED_CONF_RTOS_PRESENT)
#define STACK_END_MARK 0xCCCCCCCC
#define STACK_END_MARK_CNT 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this number do (why is it 3) ? line 613 is not that clear to me - might be worth adding a comment there what we do there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@0xc0170 0xc0170 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the update

@0xc0170 0xc0170 requested a review from evedon June 25, 2019 12:03
// Find the stack end.
int stack_end_cnt = 0;
uint32_t st_end = ctx->thread_current_sp;
for (; st_end >= ctx->thread_stack_mem; st_end -= sizeof(int)) {
Copy link
Contributor

@TeroJaasko TeroJaasko Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current stack pointer verified somewhere before stored to context? It might be worth checking that it is within RAM at least to avoid faulting again here. Also the value should be checked for misalingment to avoid accessing data from non-32bit aligned addresses. A bit of caution on the data collected from crashed thread will not hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Good point. I will add this feature in the next candidate code for this PR.

if (stack_end_cnt >= STACK_END_MARK_CNT) {
st_end += (STACK_END_MARK_CNT - 1) * sizeof(int);
break;
}
Copy link
Contributor

@TeroJaasko TeroJaasko Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack size calculation logic differs from RTX one: https://github.com/ARMmbed/mbed-os/blob/master/rtos/TARGET_CORTEX/rtx5/RTX/Source/rtx_thread.c#L901-L908. Why not just stop at the first non-osRtxStackFillPattern value? Nevermind, this makes more sense.

@kjbracey
Copy link
Contributor

kjbracey commented Jun 26, 2019

Not following the logic on the area you're dumping. If I follow correctly, you're dumping all ever-used stack space up to the current stack pointer. It's all stale data, and nothing from the current callstack.

I would be inclined to just dump starting from the current stack pointer (thread_current_sp) up to the top of the stack (thread_stack_mem + thread_stack_size).

Possibly you could still choose to show used stuff below thread_current_sp too (although it's rarely going to be useful), but you do need to go all the way up to thread_stack_mem + thread_stack_size

Although there's a further complication in that thread_current_sp isn't actually the thread stack pointer in the case of an fault; it will be on the wrong stack - you would need to extract SP_reg from mbed_fault_context.

You should test this out with a hard fault, as well as the easier MBED_ASSERT(false).

@kjbracey
Copy link
Contributor

Another thing to consider to make any analysis tool's life easier is ensuring that we have a good (PC,SP) pair for it to lock on to. A precise backtrace uses PC->SP-offset mappings from the ELF/DWARF debug info, and getting that to work correctly means we need to start from a matched (PC,SP) readout.

In the fault case we can get a precisely matched pair from mbed_fault_context, but for the MBED_ASSERT and MBED_ERROR cases, we currently don't - our SP is just the address of a local variable. We'd need to check what's possible in each toolchain; I believe for ARM+GCC a good pair might be (__builtin_return_address(), __builtin_frame_address(0)+4) .

@0xc0170
Copy link
Contributor

0xc0170 commented Jun 26, 2019

Good feedback @kjbracey-arm 👍

@andrewc-arm
Copy link
Contributor Author

andrewc-arm commented Jun 27, 2019

Hi, @kjbracey-arm

I think you are right! Thanks for pointing it out.

I was dumping the other way around. You are right about that we need the region of thread_current_sp up to the top of the stack thread_stack_mem + thread_stack_size.

The information I acquired was this one from my experimentation.

(gdb) bt
#0  func5 (debugP=1) at .\main.cpp:18
#1  0x00001970 in func4 (str=0x20002f58 <_main_stack+3840> "0.333333") at .\main.cpp:24
#2  0x000019a0 in func3 (f=0.333333343) at .\main.cpp:31
#3  0x000019d4 in func2 (a=1, b=3) at .\main.cpp:36
#4  0x000019e6 in func1 () at .\main.cpp:41
#5  0x00001a32 in main () at .\main.cpp:56
(gdb) display /x *(int*)($sp)
1: /x *(int*)($sp) = 0x10958
4: /x *(int*)($sp+12) = 0x1 (debugP = 1) func5
6: /x *(int*)($sp+20) = 0x20002f58 (“0.33333”) func4
8: /x *(int*)($sp+28) = 0x19a1 (func3)
20: /x *(int*)($sp+76) = 0x19d5 (func2)
24: /x *(int*)($sp+92) = 0x19e7 (func1)
26: /x *(int*)($sp+100) = 0x1a33 (main)

I will amend this PR and come back to you after the full verification.


Although there's a further complication in that thread_current_sp isn't actually the thread stack pointer in the case of an fault; it will be on the wrong stack - you would need to extract SP_reg from mbed_fault_context.

I think the stack pointer of the failure is correct one since we are extracting it from the ctx (reference: https://github.com/ARMmbed/mbed-os/pull/10872/files#diff-ae57d671f3af44d01b722ed277444e27R570). But do you think it's not the right one?

@kjbracey
Copy link
Contributor

I think the stack pointer of the failure is correct one since we are extracting it from the ctx (reference: https://github.com/ARMmbed/mbed-os/pull/10872/files#diff-ae57d671f3af44d01b722ed277444e27R570). But do you think it's not the right one?

As far as I can see, handle_error just sets that to the address of a local variable, which is only fine when coming in from MBED_ASSERT or mbed_error. It should probably have a special case for getting it from the fault handler register dump when coming from a fault handler.

@andrewc-arm
Copy link
Contributor Author

Hi,
FYI. I have a candidate code prepared on https://github.com/andrewc-arm/mbed-os/commits/mbed_crash_dump2 for this PR. I verified that it works.
I am waiting for dependent PR of #11332 (This PR fixes the ctx->thread_current_sp to be the correct value). After the depending PR gets approved, this PR will be reopened.

BTW, I cannot reopen this PR. It's grayed out and it says The mbed_crash_dump2 branch was force-pushed or recreated. How can I reopen this PR later?

(CC: @ohadhawk )

@andrewc-arm
Copy link
Contributor Author

@kjbracey-arm

Regarding your comment, I experimented and found out that our MBED_ASSERT()'s PC and SP values are wrong.

This is true value from gdb.

(gdb) n
29              MBED_ASSERT(false);
(gdb) bt
#0  func6 (debugP=debugP@entry=1, param2=param2@entry=777) at .\main.cpp:29
#1  0x000017de in func5 (debugP=debugP@entry=1) at .\main.cpp:35
#2  0x000017f0 in func4 (str=str@entry=0x20002d70 <_main_stack+3856> "0.333333") at .\main.cpp:41
#3  0x0000180e in func3 (f=f@entry=0.333333343) at .\main.cpp:48
#4  0x00001832 in func2 (a=a@entry=1, b=b@entry=3) at .\main.cpp:53
#5  0x0000183c in func1 () at .\main.cpp:58
#6  0x0000190a in main () at .\main.cpp:73
(gdb) info registers
r0             0x1      1
r1             0x30a    778
r2             0x22400000       574619648
r3             0x0      0
r4             0x20000b9c       536873884
r5             0x0      0
r6             0x0      0
r7             0x0      0
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x0      0
r12            0xffffffff       -1
sp             0x20002d50       0x20002d50 <_main_stack+3824>
lr             0x17df   6111
pc             0x17c2   0x17c2 <func6(int, int)+14>
xpsr           0x61000000       1627389952
(gdb)

Please note that true PC value is 0x17c2 while true SP value is 0x20002d50.

This is acquired value from print_error_report().

[2019-08-27 18:06:47.384] Error Message: Assertion failed: false
[2019-08-27 18:06:47.428] Location: 0x29DF
[2019-08-27 18:06:47.442] File: .\main.cpp+29
[2019-08-27 18:06:47.463] Error Value: 0x0
[2019-08-27 18:06:47.482] Current Thread: main  Id: 0x20000E48 Entry: 0x39CF StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x20002CE0 

Log: teraterm_20190827.1806.log
You can observe that Location: value (0x29DF) is far off from true value of 0x17c2. Well it's OK since we provide File:. However, the acquired SP: value of 0x20002CE0 is also different with true value of 0x20002d50.

Looks like we need some attention to MBED_ASSERT() failure printout as well.

@kjbracey
Copy link
Contributor

I don't think you're going to get a solid answer from assert - there's an extra wrapper around the mbed_error. But do you really need to? It's giving you the message and the filename and line number!

Register contents aren't as relevant when it's a high-level error. The SP will at least be good enough to indicate overflow.

@andrewc-arm
Copy link
Contributor Author

@kjbracey-arm
Thanks for the feedback and please refer to the separate JIRA feedback about MBED_ASSERT.


I would like to share my recent experiment to double check the correctness of the current stack dump code. And it's exactly correct for thread-mode fault crashes.

The true callstack acquired by gdb

#0  0x000017c4 in generate_bus_fault_unaligned_access () at .\main.cpp:21
#1  0x000017ea in func6 (debugP=debugP@entry=1, param2=param2@entry=777) at .\main.cpp:30
#2  0x000017fa in func5 (debugP=debugP@entry=1) at .\main.cpp:35
#3  0x0000180c in func4 (str=str@entry=0x20002d50 <_main_stack+3824> "0.333333") at .\main.cpp:41
#4  0x0000182e in func3 (f=f@entry=0.333333343) at .\main.cpp:48
#5  0x00001856 in func2 (a=a@entry=1, b=b@entry=3) at .\main.cpp:53
#6  0x00001862 in func1 () at .\main.cpp:58
#7  0x00001944 in main () at .\main.cpp:73

The stack dump

[2019-09-02 16:31:34.932] LR   : 000017EB (#1)
[2019-09-02 16:31:35.211] ++ MbedOS Error Info ++
[2019-09-02 16:31:35.233] Error Status: 0x80FF013D Code: 317 Module: 255
[2019-09-02 16:31:35.283] Error Message: Fault exception
[2019-09-02 16:31:35.314] Location: 0x17C4 (#0)
[2019-09-02 16:31:35.338] Error Value: 0x1FFF0400
[2019-09-02 16:31:35.359] Current Thread: main  Id: 0x20000E50 Entry: 0x3D7B StackSize: 0x1000 StackMem: 0x20001E60 SP: 0x20002D28 
[2019-09-02 16:31:35.468] For more info, visit: https://mbed.com/s/error?error=0x80FF013D&osver=999999&core=0x410FC241&comp=2&ver=60300&tgt=K64F
[2019-09-02 16:31:35.594] 
[2019-09-02 16:31:35.594] Stack Dump:
[2019-09-02 16:31:35.606] 0x20002D28:0x00000000 0x000017EB (#1) 0x00000000 0x00000000 0x0000DEB8 0x000017FB (#2) 0x12000000 0x0000180D (#3)
[2019-09-02 16:31:35.710] 0x20002D48:0x20002D58 0x0000182F (#4) 0x33332E30 0x33333333 0x00000000 0x00000000 0x00000000 0x00005D67 
[2019-09-02 16:31:35.819] 0x20002D68:0x00000000 0x00003B53 0x200001C8 0x00001857 (#5) 0x00000000 0x00001863 (#6) 0x00000000 0x00001945 (#7)
[2019-09-02 16:31:35.918] 0x20002D88:0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 
[2019-09-02 16:31:36.022] 0x20002DA8:0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 0xCCCCCCCC 
[2019-09-02 16:31:36.127] 0x20002DC8:0x000F423F 0x410FC241 0x00000002 0x0000EB8C 0x20000000 0x1FFF0000 0x00000000 0x00000000 
[2019-09-02 16:31:36.231] 0x20002DE8:0x00030000 0x00010000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-02 16:31:36.335] 0x20002E08:0x00100000 0x00000000 0x00000000 0x00000000 0x1FFF0510 0x00000827 0x00002710 0x000017AD 
[2019-09-02 16:31:36.441] 0x20002E28:0x00001A81 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 
[2019-09-02 16:31:36.543] 0x20002E48:0x00000000 0x00003D1B 0x00000001 0x00003D8D 0x00000000 0x000055ED 0x00008000 

I marked the important callback points with (#_number) for easier reading.

(CC: @ohadhawk )

@andrewc-arm
Copy link
Contributor Author

Hi,
Since I cannot reopen this case, I merged this dump feature to this ongoing PR (#11332). Let's discuss there.
Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants