Design document for Crash Reporting feature in MbedOS #8561

SenRamakri · 2018-10-26T18:13:32Z

Description

Design document for Crash Reporting feature in MbedOS. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably for example to file system or to be send to ARM Pelion cloud.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

kegilbert · 2018-10-26T21:57:22Z

Nice! Glad to see this taking shape. Few small questions from my end:

The config option to prevent an endless reboot cycle is a solid idea, but would be nice if there was a time based mechanism to it as well.

The current design is great for a target that is rebooting every 30 seconds to not clog whatever delivery mechanism they have for retrieving the error state, but is weaker if someone wants a deployed device to always reboot on a failure in the case of a potential crash every few weeks but not spam their network with crash logs if an update causes the device to get stuck in a reboot loop.

I don't have a particular implementation in mind, but something like a watchdog that would clear the reset counter at some configurable time period could be helpful.

I'm not a huge fan of having the reboot error detection logic be something a user places in their main program, it feels a little intrusive. Was there a reason the logic in the main program wasn't handled in the reboot callback?

SenRamakri · 2018-10-29T16:24:15Z

@kegilbert - Thanks for your review and please see my comments below.

Yes your opinion is valid, there may be a requirement to reset the reboot count. So the current implementation provides this API to reset the count, old crash info using mbed_reset_reboot_error_info().
So, the idea here is to provide a mechanism to reset that, but the actual policy on when to reset that is application dependent. So we provide that API and application implementation can clear it as required, this may be a day, week or month(or based on something else) and that would depend on application and is outside scope of MbedOS error handling. One thing to note here is there is no specific API to reset the reboot count alone. Do you think that's valuable?
Actually the reboot error logic is completely done by Mbed-OS error handling implementation. The only thing the user get is a callback with the error-info during reboot(please see the flowchart). This is to provide the application an opportunity to record the fact that a reboot happened due to fatal error and the app may record the context as well. No reboot handling mechanism is part of application. Once the callback is completed the MbedOS error handling implementation will continue to process the captured error according to the flowchart.

deepikabhavnani · 2018-10-29T22:23:53Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+The function should return MBED_ERROR_NOT_FOUND if there is no fault context currently stored.
+```C
+//Call this function to retrieve the last reboot fault context
+mbed_error_status_t mbed_get_reboot_fault_context (mbed_fault_context_t *fault_context);


Why specific enum in return ? We can have int32_t as return like other API's with 0 - Success (Error found) -1 (Error not found). Will there be any other special case in mbed_error_status_t

That's correct, there could be Invalid argument(fault == NULL), Item not found because there is no context or MBED_SUCCESS. In addition we can capture the module reporting the error using mbed_error_status_t(in this case its MBED_MODULE_PLATFORM).

Even when we return error codes, it is good to have return type as int to avoid type casting have the flexible API. Since error codes are used extensively in storage picked up one example here
https://github.com/ARMmbed/mbed-os/blob/master/features/storage/filesystem/littlefs/LittleFileSystem.cpp#L287

0xc0170

Just styling issues

0xc0170 · 2018-10-30T13:38:31Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+The below diagram shows overall architecture of crash-reporting implementation.
+
+![System architecture and component interaction](./diagrams/crash-report-seq.jpg)


Excetion Handler -> Exception typo (second blue rectangle from the top here)

Exmaple -> example - yellow rectangle about WEAK attribute

Would specify that reboot is specifically a warm-reboot.

0xc0170 · 2018-10-30T13:42:12Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+// main() runs in its own thread in the OS
+int main() {
+
+    if(reboot_error_detected == 1) {


coding style update (run it through astyle) - same for other code examples here

kjbracey · 2018-10-30T15:35:50Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
+```CS
+void mbed_error_reboot_callback(mbed_error_ctx *error_context);


Slightly lost on the twin callback + read APIs here. Not seeing why this callback is necessary. Can't the application just call mbed_get_reboot_fault_context if they want to find out if it was a reboot-due-to-error? One example suggests that, but another example shows it taking a copy in the callback, as if it's going to be lost.

The callback is necessary for few reasons. It provides opportunity for the application to clear any error situation based on error codes before the main starts. Its also required in the case where a reboot-max is configured where in the system will halt at the maximum reboot count, but even in this case callback will be invoked, so the application side can know the cause of error. Also making calls to retrieve the error context and fault context could be expensive as well as they take time during main() initialization and also the caller needs to allocate memory. So having this callback to set a flag(or something like that) can bring some optimization as well. And, if the application design doesn't require the callback, they don't have to override that, so its flexible.

I think the callback is a good idea as it's disconnect error reporting from main application flow.

SenRamakri · 2018-10-30T22:31:58Z

@0xc0170 - I have fixed the review comments, please review.

Will review

bulislaw

Looks good! Some queries below.

bulislaw · 2018-10-31T16:54:14Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+### Requirements and assumptions
+
+This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.


How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.

bulislaw · 2018-11-02T11:54:44Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
+```CS
+void mbed_error_reboot_callback(mbed_error_ctx *error_context);


I think the callback is a good idea as it's disconnect error reporting from main application flow.

bulislaw · 2018-11-02T11:59:07Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+void mbed_error_reboot_callback(mbed_error_ctx *error_context);
+```
+
+### System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached


How will we handle mix of errors and successfull reboots over a time? Eg. The device is crashing once a day due to memory leak, but between the runs it works fine and the crashes are not fatal from the functionality point of view. As a developer I'd like to avoid tight loop of consecutive reboots, but once in a while crash is not something that should, eventually, trigger device halt.

@bulislaw If I'm understanding your point correctly, it sounds like my first concern (#8561 (comment)). A method to reset the reboot count that could be periodically called would prevent an eventual device halt unless the device is rebooting faster than the reset rate which could clog whatever delivery mechanism is being used to report the errors logs.

Who will be "periodically calling" this reset?

@SenRamakri ?

I was thinking we can either leave the implementation to the user and have them call the reset function at whatever frequency they'd like, or have the reboot handler have its own timer/watchdog that is configured through the config settings.

@bulislaw and @kegilbert - These are great questions and I was thinking about adding the functionality for periodic reset initially but I also thought each application might have different policies and rules around when to do this periodic reset and we also have to lock out resources like memory, CPU if we are use evt_q, timer etc(we should also think about sleep behavior). So, I think its better for application design to implement it if they need as they know what resources they have at their disposal.

SenRamakri · 2018-11-09T00:15:53Z

@0xc0170 @kegilbert @bulislaw @deepikabhavnani @kjbracey-arm - I think I have addressed/answered all of the queries. Please review/approve if you are ok with this.

cmonr · 2018-11-10T00:23:48Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+### Requirements and assumptions
+
+This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.


Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.

@cmonr - This is captured in detail in Detailed Design section.

cmonr · 2018-11-10T00:26:33Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots**
+
+System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.


Presumedly by stopping the invocation of the main application firmware?

Yes, thats correct, it. Once the limit is reached we would be stopping the system from entering main(). Let me clarify that in the doc.

cmonr · 2018-11-10T00:40:42Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.
+
+**Implementation should provide following configuration options**


Open-ended question. How do we expect this to be tested?

Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.

cmonr · 2018-11-10T00:41:46Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+The below diagram shows overall architecture of crash-reporting implementation.
+
+![System architecture and component interaction](./diagrams/crash-report-seq.jpg)


Would specify that reboot is specifically a warm-reboot.

cmonr · 2018-11-10T00:43:32Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+![System architecture and component interaction](./diagrams/crash-report-seq.jpg)
+
+As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.


At what point is the section of RAM zero'd?

It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.

cmonr · 2018-11-10T00:44:03Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+![System architecture and component interaction](./diagrams/crash-report-seq.jpg)
+
+As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.


Another silly question. Is this able to live side by side with mbed-trace?

Yes, of course, this doesn't have any impact or conflict with mbed-trace.

cmonr · 2018-11-10T00:48:10Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified.
+
+```


cmonr · 2018-11-10T00:51:39Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+    //Handle the error just as we do now and then do the following to save the context into Crash-Report RAM and reset
+
+    Read the current Crash Report and calculate CRC
+	If CRC matches what's in Crash-Report RAM: 


Stored CRC? Of what? Generated by what?

Yes, there is a CRC stored as part of error context. This is calculated in mbed_error() function as explained above, but I'll update it to capture more details.

cmonr · 2018-11-10T00:57:03Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+Below is the list of new configuration options needed to configure error reporting functionality. All of these configuration options should be captured in mbed_lib.json file in platform directory.
+
+**crash-capture-enabled**


General question. Is there a specific reason that this feature is being referred to as "crash reporting"?
Imo, "crash" is a harsh word. Mainly wondering why something liike "exception reporting" wouldn't also work.

That's good question, the reason is "exception" is an ARM terminology specifically used for processor exceptions and we also have fatal errors(which are different from fatal exceptions) and the current implementation works for both, thats why we have the word "crash" which comprises both. Also the original requirement is also written with crash terminology.

cmonr · 2018-11-10T00:58:37Z

docs/design-documents/platform/crash-reporting/crash_reporting.md

+
+Enables crash context capture when the system enters a fatal error/crash. When this is disabled it should also disable other dependent options.
+
+**fatal-error-auto-reboot-enabled**


Maybe it's just me, but across the document, exception, crash, and error are used interchangably. Would prefer if only one were used, otherwise when we look back at the config options outside of the context of this PR, the config options will appear disparate.

I see the point, the crash word comes from requirement which comprises both fatal exceptions and fatal errors. Exceptions are specifically referring to processor exceptions and errors refer to fatal errors in the system which ends up calling mbed_error() interface. Let me clarify these terminologies in assumption section. Hope that helps.

SenRamakri · 2018-11-12T17:21:14Z

@cmonr - I have updated the doc with your review comments fixes, please review.

…logies used

…onflicts with current implementation of mbed_error_printf

0xc0170 · 2018-11-19T10:12:37Z

@cmonr Please review. Once approved, this shall go to rollup (I'll label it now)

0xc0170 · 2018-11-19T14:09:32Z

Entering CI (rollup inclusion)

Info: This PR has been re-bundled into a new rollup PR (#8800).

No further work is needed here, as once that PR is merged, this PR will also be closed and marked as merged.
If any more commits are made in this PR, this PR will remain open and have to go through CI on its own.

cmonr · 2018-11-22T02:02:04Z

CI started.

SenRamakri requested review from kjbracey and a team October 26, 2018 18:14

cmonr added the needs: review label Oct 27, 2018

deepikabhavnani reviewed Oct 29, 2018

View reviewed changes

0xc0170 previously requested changes Oct 30, 2018

View reviewed changes

0xc0170 requested a review from bulislaw October 30, 2018 13:54

kjbracey reviewed Oct 30, 2018

View reviewed changes

SenRamakri force-pushed the sen_CrashReportingDesign branch from 4baeb93 to 2682904 Compare October 30, 2018 22:31

bulislaw reviewed Nov 2, 2018

View reviewed changes

SenRamakri force-pushed the sen_CrashReportingDesign branch from 476bd50 to 60677d2 Compare November 7, 2018 16:30

kegilbert approved these changes Nov 9, 2018

View reviewed changes

0xc0170 approved these changes Nov 9, 2018

View reviewed changes

deepikabhavnani approved these changes Nov 9, 2018

View reviewed changes

cmonr reviewed Nov 10, 2018

View reviewed changes

SenRamakri mentioned this pull request Nov 10, 2018

Crash Reporting implementation #8702

Merged

SenRamakri force-pushed the sen_CrashReportingDesign branch from 60677d2 to 12f1894 Compare November 12, 2018 17:19

SenRamakri added 8 commits November 18, 2018 20:40

Crash Reporting Design

d4fc8fe

Crash reporting design doc

5489eac

Updated design doc

28a0b45

Adding usage scenarios

a6e7604

Fixed sentences and context

c3d2c44

Boot sequence diagram added

a721158

Updated TOC

108483d

Fix tab issues in TOC

8c48a24

SenRamakri added 6 commits November 18, 2018 20:40

Updated with crash report region info and new diagrams added

0b9cd60

Fix code style issues and fix typos in diagrams

340099c

Change phrasing and tense

3ffa78e

Add function to reset the reboot count

a87043f

Added more details around crc field and updated document with termino…

2d58f23

…logies used

Removing config option to print report to terminal on reboot, as it c…

a0e42fa

…onflicts with current implementation of mbed_error_printf

SenRamakri force-pushed the sen_CrashReportingDesign branch from 12f1894 to a0e42fa Compare November 19, 2018 02:46

0xc0170 added release-version: 5.11.0-rc1 rollup PR labels Nov 19, 2018

0xc0170 mentioned this pull request Nov 19, 2018

Rollup PR: Rerun falsely-failed PRs for 5.11-RC1 #8800

Closed

bulislaw approved these changes Nov 21, 2018

View reviewed changes

cmonr added needs: CI and removed needs: review labels Nov 21, 2018

cmonr added ready for merge and removed needs: CI rollup PR labels Nov 22, 2018

0xc0170 merged commit fadaa65 into ARMmbed:master Nov 22, 2018

0xc0170 removed the ready for merge label Nov 22, 2018


		The below diagram shows overall architecture of crash-reporting implementation.

		![System architecture and component interaction](./diagrams/crash-report-seq.jpg)


		### Requirements and assumptions

		This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.


		Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots

		System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.


		System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.

		Implementation should provide following configuration options


		![System architecture and component interaction](./diagrams/crash-report-seq.jpg)

		As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.


		The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified.

		```


		Below is the list of new configuration options needed to configure error reporting functionality. All of these configuration options should be captured in mbed_lib.json file in platform directory.

		crash-capture-enabled


		Enables crash context capture when the system enters a fatal error/crash. When this is disabled it should also disable other dependent options.

		fatal-error-auto-reboot-enabled

Design document for Crash Reporting feature in MbedOS #8561

Design document for Crash Reporting feature in MbedOS #8561

Uh oh!

Conversation

SenRamakri commented Oct 26, 2018

Description

Pull request type

Uh oh!

kegilbert commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SenRamakri commented Oct 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xc0170 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SenRamakri Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SenRamakri commented Oct 30, 2018

Uh oh!

bulislaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SenRamakri Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SenRamakri commented Nov 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kegilbert commented Oct 26, 2018 •

edited

Loading

SenRamakri Oct 30, 2018 •

edited

Loading

SenRamakri Nov 6, 2018 •

edited

Loading