Skip to content

Design document for Crash Reporting feature in MbedOS #8561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 22, 2018

Conversation

SenRamakri
Copy link
Contributor

Description

Design document for Crash Reporting feature in MbedOS. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably for example to file system or to be send to ARM Pelion cloud.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

@SenRamakri SenRamakri requested review from kjbracey and a team October 26, 2018 18:14
@kegilbert
Copy link
Contributor

kegilbert commented Oct 26, 2018

Nice! Glad to see this taking shape. Few small questions from my end:

The config option to prevent an endless reboot cycle is a solid idea, but would be nice if there was a time based mechanism to it as well.

The current design is great for a target that is rebooting every 30 seconds to not clog whatever delivery mechanism they have for retrieving the error state, but is weaker if someone wants a deployed device to always reboot on a failure in the case of a potential crash every few weeks but not spam their network with crash logs if an update causes the device to get stuck in a reboot loop.

I don't have a particular implementation in mind, but something like a watchdog that would clear the reset counter at some configurable time period could be helpful.

I'm not a huge fan of having the reboot error detection logic be something a user places in their main program, it feels a little intrusive. Was there a reason the logic in the main program wasn't handled in the reboot callback?

@SenRamakri
Copy link
Contributor Author

@kegilbert - Thanks for your review and please see my comments below.

  1. Yes your opinion is valid, there may be a requirement to reset the reboot count. So the current implementation provides this API to reset the count, old crash info using mbed_reset_reboot_error_info().
    So, the idea here is to provide a mechanism to reset that, but the actual policy on when to reset that is application dependent. So we provide that API and application implementation can clear it as required, this may be a day, week or month(or based on something else) and that would depend on application and is outside scope of MbedOS error handling. One thing to note here is there is no specific API to reset the reboot count alone. Do you think that's valuable?

  2. Actually the reboot error logic is completely done by Mbed-OS error handling implementation. The only thing the user get is a callback with the error-info during reboot(please see the flowchart). This is to provide the application an opportunity to record the fact that a reboot happened due to fatal error and the app may record the context as well. No reboot handling mechanism is part of application. Once the callback is completed the MbedOS error handling implementation will continue to process the captured error according to the flowchart.

The function should return MBED_ERROR_NOT_FOUND if there is no fault context currently stored.
```C
//Call this function to retrieve the last reboot fault context
mbed_error_status_t mbed_get_reboot_fault_context (mbed_fault_context_t *fault_context);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specific enum in return ? We can have int32_t as return like other API's with 0 - Success (Error found) -1 (Error not found). Will there be any other special case in mbed_error_status_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, there could be Invalid argument(fault == NULL), Item not found because there is no context or MBED_SUCCESS. In addition we can capture the module reporting the error using mbed_error_status_t(in this case its MBED_MODULE_PLATFORM).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we return error codes, it is good to have return type as int to avoid type casting have the flexible API. Since error codes are used extensively in storage picked up one example here
https://github.com/ARMmbed/mbed-os/blob/master/features/storage/filesystem/littlefs/LittleFileSystem.cpp#L287

Copy link
Contributor

@0xc0170 0xc0170 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just styling issues


The below diagram shows overall architecture of crash-reporting implementation.

![System architecture and component interaction](./diagrams/crash-report-seq.jpg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excetion Handler -> Exception typo (second blue rectangle from the top here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exmaple -> example - yellow rectangle about WEAK attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specify that reboot is specifically a warm-reboot.

// main() runs in its own thread in the OS
int main() {

if(reboot_error_detected == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding style update (run it through astyle) - same for other code examples here

@0xc0170 0xc0170 requested a review from bulislaw October 30, 2018 13:54

The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
```CS
void mbed_error_reboot_callback(mbed_error_ctx *error_context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly lost on the twin callback + read APIs here. Not seeing why this callback is necessary. Can't the application just call mbed_get_reboot_fault_context if they want to find out if it was a reboot-due-to-error? One example suggests that, but another example shows it taking a copy in the callback, as if it's going to be lost.

Copy link
Contributor Author

@SenRamakri SenRamakri Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback is necessary for few reasons. It provides opportunity for the application to clear any error situation based on error codes before the main starts. Its also required in the case where a reboot-max is configured where in the system will halt at the maximum reboot count, but even in this case callback will be invoked, so the application side can know the cause of error. Also making calls to retrieve the error context and fault context could be expensive as well as they take time during main() initialization and also the caller needs to allocate memory. So having this callback to set a flag(or something like that) can bring some optimization as well. And, if the application design doesn't require the callback, they don't have to override that, so its flexible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback is a good idea as it's disconnect error reporting from main application flow.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 4baeb93 to 2682904 Compare October 30, 2018 22:31
@SenRamakri
Copy link
Contributor Author

@0xc0170 - I have fixed the review comments, please review.

@0xc0170 0xc0170 dismissed their stale review October 31, 2018 12:48

Will review

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some queries below.


### Requirements and assumptions

This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.


The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
```CS
void mbed_error_reboot_callback(mbed_error_ctx *error_context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback is a good idea as it's disconnect error reporting from main application flow.

void mbed_error_reboot_callback(mbed_error_ctx *error_context);
```

### System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we handle mix of errors and successfull reboots over a time? Eg. The device is crashing once a day due to memory leak, but between the runs it works fine and the crashes are not fatal from the functionality point of view. As a developer I'd like to avoid tight loop of consecutive reboots, but once in a while crash is not something that should, eventually, trigger device halt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bulislaw If I'm understanding your point correctly, it sounds like my first concern (#8561 (comment)). A method to reset the reboot count that could be periodically called would prevent an eventual device halt unless the device is rebooting faster than the reset rate which could clog whatever delivery mechanism is being used to report the errors logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will be "periodically calling" this reset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SenRamakri ?

I was thinking we can either leave the implementation to the user and have them call the reset function at whatever frequency they'd like, or have the reboot handler have its own timer/watchdog that is configured through the config settings.

Copy link
Contributor Author

@SenRamakri SenRamakri Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bulislaw and @kegilbert - These are great questions and I was thinking about adding the functionality for periodic reset initially but I also thought each application might have different policies and rules around when to do this periodic reset and we also have to lock out resources like memory, CPU if we are use evt_q, timer etc(we should also think about sleep behavior). So, I think its better for application design to implement it if they need as they know what resources they have at their disposal.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 476bd50 to 60677d2 Compare November 7, 2018 16:30
@SenRamakri
Copy link
Contributor Author

@0xc0170 @kegilbert @bulislaw @deepikabhavnani @kjbracey-arm - I think I have addressed/answered all of the queries. Please review/approve if you are ok with this.


### Requirements and assumptions

This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmonr - This is captured in detail in Detailed Design section.


**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots**

System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumedly by stopping the invocation of the main application firmware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats correct, it. Once the limit is reached we would be stopping the system from entering main(). Let me clarify that in the doc.


System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.

**Implementation should provide following configuration options**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open-ended question. How do we expect this to be tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.


The below diagram shows overall architecture of crash-reporting implementation.

![System architecture and component interaction](./diagrams/crash-report-seq.jpg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specify that reboot is specifically a warm-reboot.


![System architecture and component interaction](./diagrams/crash-report-seq.jpg)

As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what point is the section of RAM zero'd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.


![System architecture and component interaction](./diagrams/crash-report-seq.jpg)

As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another silly question. Is this able to live side by side with mbed-trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, this doesn't have any impact or conflict with mbed-trace.


The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified.

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

//Handle the error just as we do now and then do the following to save the context into Crash-Report RAM and reset

Read the current Crash Report and calculate CRC
If CRC matches what's in Crash-Report RAM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stored CRC? Of what? Generated by what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a CRC stored as part of error context. This is calculated in mbed_error() function as explained above, but I'll update it to capture more details.


Below is the list of new configuration options needed to configure error reporting functionality. All of these configuration options should be captured in mbed_lib.json file in platform directory.

**crash-capture-enabled**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question. Is there a specific reason that this feature is being referred to as "crash reporting"?
Imo, "crash" is a harsh word. Mainly wondering why something liike "exception reporting" wouldn't also work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good question, the reason is "exception" is an ARM terminology specifically used for processor exceptions and we also have fatal errors(which are different from fatal exceptions) and the current implementation works for both, thats why we have the word "crash" which comprises both. Also the original requirement is also written with crash terminology.


Enables crash context capture when the system enters a fatal error/crash. When this is disabled it should also disable other dependent options.

**fatal-error-auto-reboot-enabled**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but across the document, exception, crash, and error are used interchangably. Would prefer if only one were used, otherwise when we look back at the config options outside of the context of this PR, the config options will appear disparate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, the crash word comes from requirement which comprises both fatal exceptions and fatal errors. Exceptions are specifically referring to processor exceptions and errors refer to fatal errors in the system which ends up calling mbed_error() interface. Let me clarify these terminologies in assumption section. Hope that helps.

@SenRamakri
Copy link
Contributor Author

@cmonr - I have updated the doc with your review comments fixes, please review.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 12f1894 to a0e42fa Compare November 19, 2018 02:46
@0xc0170
Copy link
Contributor

0xc0170 commented Nov 19, 2018

@cmonr Please review. Once approved, this shall go to rollup (I'll label it now)

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 19, 2018

Entering CI (rollup inclusion)

Info: This PR has been re-bundled into a new rollup PR (#8800).

No further work is needed here, as once that PR is merged, this PR will also be closed and marked as merged.
If any more commits are made in this PR, this PR will remain open and have to go through CI on its own.

@cmonr
Copy link
Contributor

cmonr commented Nov 22, 2018

CI started.

@0xc0170 0xc0170 merged commit fadaa65 into ARMmbed:master Nov 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants