-
Notifications
You must be signed in to change notification settings - Fork 3k
Design document for Crash Reporting feature in MbedOS #8561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice! Glad to see this taking shape. Few small questions from my end: The config option to prevent an endless reboot cycle is a solid idea, but would be nice if there was a time based mechanism to it as well. The current design is great for a target that is rebooting every 30 seconds to not clog whatever delivery mechanism they have for retrieving the error state, but is weaker if someone wants a deployed device to always reboot on a failure in the case of a potential crash every few weeks but not spam their network with crash logs if an update causes the device to get stuck in a reboot loop. I don't have a particular implementation in mind, but something like a watchdog that would clear the reset counter at some configurable time period could be helpful. I'm not a huge fan of having the reboot error detection logic be something a user places in their main program, it feels a little intrusive. Was there a reason the logic in the main program wasn't handled in the reboot callback? |
@kegilbert - Thanks for your review and please see my comments below.
|
The function should return MBED_ERROR_NOT_FOUND if there is no fault context currently stored. | ||
```C | ||
//Call this function to retrieve the last reboot fault context | ||
mbed_error_status_t mbed_get_reboot_fault_context (mbed_fault_context_t *fault_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why specific enum in return ? We can have int32_t as return like other API's with 0 - Success (Error found) -1 (Error not found). Will there be any other special case in mbed_error_status_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, there could be Invalid argument(fault == NULL), Item not found because there is no context or MBED_SUCCESS. In addition we can capture the module reporting the error using mbed_error_status_t(in this case its MBED_MODULE_PLATFORM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when we return error codes, it is good to have return type as int to avoid type casting have the flexible API. Since error codes are used extensively in storage picked up one example here
https://github.com/ARMmbed/mbed-os/blob/master/features/storage/filesystem/littlefs/LittleFileSystem.cpp#L287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just styling issues
|
||
The below diagram shows overall architecture of crash-reporting implementation. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excetion Handler -> Exception typo (second blue rectangle from the top here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exmaple -> example - yellow rectangle about WEAK attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would specify that reboot is specifically a warm-reboot.
// main() runs in its own thread in the OS | ||
int main() { | ||
|
||
if(reboot_error_detected == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coding style update (run it through astyle) - same for other code examples here
|
||
The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation. | ||
```CS | ||
void mbed_error_reboot_callback(mbed_error_ctx *error_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly lost on the twin callback + read APIs here. Not seeing why this callback is necessary. Can't the application just call mbed_get_reboot_fault_context
if they want to find out if it was a reboot-due-to-error? One example suggests that, but another example shows it taking a copy in the callback, as if it's going to be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback is necessary for few reasons. It provides opportunity for the application to clear any error situation based on error codes before the main starts. Its also required in the case where a reboot-max is configured where in the system will halt at the maximum reboot count, but even in this case callback will be invoked, so the application side can know the cause of error. Also making calls to retrieve the error context and fault context could be expensive as well as they take time during main() initialization and also the caller needs to allocate memory. So having this callback to set a flag(or something like that) can bring some optimization as well. And, if the application design doesn't require the callback, they don't have to override that, so its flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback is a good idea as it's disconnect error reporting from main application flow.
4baeb93
to
2682904
Compare
@0xc0170 - I have fixed the review comments, please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some queries below.
|
||
### Requirements and assumptions | ||
|
||
This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.
|
||
The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation. | ||
```CS | ||
void mbed_error_reboot_callback(mbed_error_ctx *error_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback is a good idea as it's disconnect error reporting from main application flow.
void mbed_error_reboot_callback(mbed_error_ctx *error_context); | ||
``` | ||
|
||
### System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we handle mix of errors and successfull reboots over a time? Eg. The device is crashing once a day due to memory leak, but between the runs it works fine and the crashes are not fatal from the functionality point of view. As a developer I'd like to avoid tight loop of consecutive reboots, but once in a while crash is not something that should, eventually, trigger device halt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bulislaw If I'm understanding your point correctly, it sounds like my first concern (#8561 (comment)). A method to reset the reboot count that could be periodically called would prevent an eventual device halt unless the device is rebooting faster than the reset rate which could clog whatever delivery mechanism is being used to report the errors logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who will be "periodically calling" this reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we can either leave the implementation to the user and have them call the reset function at whatever frequency they'd like, or have the reboot handler have its own timer/watchdog that is configured through the config settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bulislaw and @kegilbert - These are great questions and I was thinking about adding the functionality for periodic reset initially but I also thought each application might have different policies and rules around when to do this periodic reset and we also have to lock out resources like memory, CPU if we are use evt_q, timer etc(we should also think about sleep behavior). So, I think its better for application design to implement it if they need as they know what resources they have at their disposal.
476bd50
to
60677d2
Compare
@0xc0170 @kegilbert @bulislaw @deepikabhavnani @kjbracey-arm - I think I have addressed/answered all of the queries. Please review/approve if you are ok with this. |
|
||
### Requirements and assumptions | ||
|
||
This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmonr - This is captured in detail in Detailed Design section.
|
||
**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots** | ||
|
||
System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumedly by stopping the invocation of the main application firmware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats correct, it. Once the limit is reached we would be stopping the system from entering main(). Let me clarify that in the doc.
|
||
System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached. | ||
|
||
**Implementation should provide following configuration options** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open-ended question. How do we expect this to be tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.
|
||
The below diagram shows overall architecture of crash-reporting implementation. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would specify that reboot is specifically a warm-reboot.
|
||
 | ||
|
||
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what point is the section of RAM zero'd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.
|
||
 | ||
|
||
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another silly question. Is this able to live side by side with mbed-trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course, this doesn't have any impact or conflict with mbed-trace.
|
||
The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified. | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
//Handle the error just as we do now and then do the following to save the context into Crash-Report RAM and reset | ||
|
||
Read the current Crash Report and calculate CRC | ||
If CRC matches what's in Crash-Report RAM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stored CRC? Of what? Generated by what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a CRC stored as part of error context. This is calculated in mbed_error() function as explained above, but I'll update it to capture more details.
|
||
Below is the list of new configuration options needed to configure error reporting functionality. All of these configuration options should be captured in mbed_lib.json file in platform directory. | ||
|
||
**crash-capture-enabled** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question. Is there a specific reason that this feature is being referred to as "crash reporting"?
Imo, "crash" is a harsh word. Mainly wondering why something liike "exception reporting" wouldn't also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good question, the reason is "exception" is an ARM terminology specifically used for processor exceptions and we also have fatal errors(which are different from fatal exceptions) and the current implementation works for both, thats why we have the word "crash" which comprises both. Also the original requirement is also written with crash terminology.
|
||
Enables crash context capture when the system enters a fatal error/crash. When this is disabled it should also disable other dependent options. | ||
|
||
**fatal-error-auto-reboot-enabled** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me, but across the document, exception, crash, and error are used interchangably. Would prefer if only one were used, otherwise when we look back at the config options outside of the context of this PR, the config options will appear disparate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point, the crash word comes from requirement which comprises both fatal exceptions and fatal errors. Exceptions are specifically referring to processor exceptions and errors refer to fatal errors in the system which ends up calling mbed_error() interface. Let me clarify these terminologies in assumption section. Hope that helps.
60677d2
to
12f1894
Compare
@cmonr - I have updated the doc with your review comments fixes, please review. |
…onflicts with current implementation of mbed_error_printf
12f1894
to
a0e42fa
Compare
@cmonr Please review. Once approved, this shall go to rollup (I'll label it now) |
Entering CI (rollup inclusion) Info: This PR has been re-bundled into a new rollup PR (#8800). No further work is needed here, as once that PR is merged, this PR will also be closed and marked as merged. |
CI started. |
Description
Design document for Crash Reporting feature in MbedOS. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably for example to file system or to be send to ARM Pelion cloud.
Pull request type