Skip to content

Halt to enforce reboot limit once only #9544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 30, 2019

Conversation

kjbracey
Copy link
Contributor

@kjbracey kjbracey commented Jan 30, 2019

Description

The pre-main check of error_reboot_count was applied repeatedly on every boot, meaning that once the reboot limit was hit, every subsequent reset would halt before main, until something managed to clear or corrupt the error context.

Set the is_error_processed flag before halting, so that when an external agent resets us while we're halted, we do not report the error and halt again.

Pull request type

[X] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@SenRamakri, @TeemuKultala

The pre-main check of error_reboot_count was applied repeatedly on
every boot, meaning that once the reboot limit was hit, every subsequent
reset would halt before main, until something managed to clear
or corrupt the error context.

Set the is_error_processed flag before halting, so that when an external
agent resets us while we're halted, we do not report the error and halt
again.
@kjbracey kjbracey mentioned this pull request Jan 30, 2019
@0xc0170 0xc0170 requested review from SenRamakri and a user January 30, 2019 11:26
@0xc0170
Copy link
Contributor

0xc0170 commented Jan 30, 2019

CI started while the reviews are ongoing (critical to get the fix in)

@kjbracey
Copy link
Contributor Author

I think this still might not quite be perfect - there could be issues when changing configuration or when platform.error-reboot-max is 0.

It's possible that mbed_error chooses not to reset itself (because max is 0 or reboot is disabled altogether), but does still increment error_reboot_count. Then when someone causes a manual reset (possibly with a new image that now has reboot enabled), it may see reboot_count higher than max, and halt on entry. We don't want to halt on manual resets, only automatic ones.

I think error_reboot_count should only be being incremented or set to 1 if we are actually going to do a system_reset. We don't want any of the limit checks being activated if the reset was manual, so we should set it to 0 if just halting instead of rebooting.

But that interacts with the way error_reboot_count is used as a sort of flag elsewhere, so bit wary about changing anything further. I think we're probably okay in a setup where the config isn't changing and max > 0.

@mbed-ci
Copy link

mbed-ci commented Jan 30, 2019

Test run: SUCCESS

Summary: 12 of 12 test jobs passed
Build number : 1
Build artifacts

@cmonr
Copy link
Contributor

cmonr commented Jan 30, 2019

@SenRamakri @deepikabhavnani Please review.

As a note, only #9544 or #9545 should come in.

@deepikabhavnani
Copy link

CI started while the reviews are ongoing (critical to get the fix in)

I understand the fix is critical but I have few queries related to implementation and fix

  1. Config option - fatal-error-auto-reboot-enabled

        "fatal-error-auto-reboot-enabled": {
            "help": "Setting this to true enables auto-reboot on a fatal error.",
            "value": false
        },

* NOTE: If MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED is enabled and if the current reboot count exceeds MBED_CONF_PLATFORM_ERROR_REBOOT_MAX the system will halt when this function is called and in such cases the caller will not get the control back.

  • As per my understanding this config option should disable auto reboot completely when false, i.e. always halt if false for critical errors (default mbed_error behavior).
    In case of true - It should allow auto reboot only for MBED_CONF_PLATFORM_ERROR_REBOOT_MAX times.

Please correct me if this is wrong understanding

@deepikabhavnani
Copy link

if (report_error_ctx->error_reboot_count >= MBED_CONF_PLATFORM_ERROR_REBOOT_MAX) {


#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED && (MBED_CONF_PLATFORM_ERROR_REBOOT_MAX > 0)

-- Reboot enabled and max should be checked

#ifndef NDEBUG
    mbed_error_printf("\n= System will be rebooted due to a fatal error =\n");
    if (report_error_ctx->error_reboot_count >= MBED_CONF_PLATFORM_ERROR_REBOOT_MAX) {
    
    -- Max checked for print
        //We have rebooted more than enough, hold the system here.
        mbed_error_printf("= Reboot count(=%ld) reached maximum, system will halt after rebooting =\n", report_error_ctx->error_reboot_count);
    }
#endif

    -- Reset done always???? Shouldnt it be if (report_error_ctx->error_reboot_count >= MBED_CONF_PLATFORM_ERROR_REBOOT_MAX) { system_reset(); }
    system_reset();//do a system reset to get the system rebooted
#endif
#endif
    mbed_halt_system();
}

@deepikabhavnani
Copy link

OK, Regarding the comment #9544 (comment)

Reset always is intended if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED is true. Don't bother to read my previous comments..

@deepikabhavnani
Copy link

It's possible that mbed_error chooses not to reset itself (because max is 0 or reboot is disabled altogether), but does still increment error_reboot_count. Then when someone causes a manual reset (possibly with a new image that now has reboot enabled), it may see reboot_count higher than max, and halt on entry. We don't want to halt on manual resets, only automatic ones.

All the information is in RAM, so manual reset with new image will not have old RAM data. Does this scenario stand true in that case?

report_error_ctx->is_error_processed = 1;//Set the flag that we already processed this error
crc_val = compute_crc32(report_error_ctx, offsetof(mbed_error_ctx, crc_error_ctx));
report_error_ctx->crc_error_ctx = crc_val;

//Enforce max-reboot only if auto reboot is enabled
#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED
if (report_error_ctx->error_reboot_count >= MBED_CONF_PLATFORM_ERROR_REBOOT_MAX) {
mbed_halt_system();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clear all RAM data here before halting? So next reset - which has to be manual since system is in halt will not have impact of this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to preserve the contents of the error RAM, in case someone wants to attach a debugger or something. Setting the "processed" flag should be sufficient to stop anything else happening on next reset.

@SenRamakri
Copy link
Contributor

SenRamakri commented Jan 30, 2019

@kjbracey-arm - Thanks for making this change, I think it should fix the case where a manual (warm)reset is intended to get the system executing the main (cold reset will reset RAM contents anyway), again. I was also thinking if we should reset the error_reboot_count as well so that the system can go through as many configured auto-reboots after a manual (warm)reset. That's something to think about.
I also liked the solution @deepikabhavnani mentioned to clear the Crash-Data-RAM area before halting the system that would clear both is_error_processed and error_reboot_count. But with just setting is_error_processed to 1 we can still retrieve the reboot context just in case. So, there is pros/cons for both. I'll go ahead approve this now. By the way I don't think we should do the change in #9545.

@cmonr cmonr merged commit cc94690 into ARMmbed:master Jan 30, 2019
cmonr pushed a commit that referenced this pull request Jan 30, 2019
Halt to enforce reboot limit once only
@cmonr
Copy link
Contributor

cmonr commented Jan 30, 2019

#9508 (comment)

@kjbracey
Copy link
Contributor Author

kjbracey commented Jan 31, 2019

I was also thinking if we should reset the error_reboot_count as well so that the system can go through as many configured auto-reboots after a manual (warm)reset

If you've cleared error_processed, nothing more should be required. But I think it's important that mbed_error set reboot_count to 0 in the case where it prepares a new error context and is not calling system_reset itself. We only want the reboot count set if it actually was a reboot, not a halt.

One possible failure scenario as it is would be running an image without reboot enabled - that would set or increment the count on each run, then halt, but not check it itself. Next time you flash an image with reboot enabled and max 1 - it would not boot the first time.

But you have some other code looking at the count as a sort of "valid" flag from the callback - that would need to be checked and adjusted. I'm not quite clear on whether the "reboot callback" is supposed to be called only if it is a reboot, or also if it is a "halt followed by manual reset".

@kjbracey
Copy link
Contributor Author

kjbracey commented Jan 31, 2019

All the information is in RAM, so manual reset with new image will not have old RAM data.

Manual resets (button or JTAG or serial break or whatever) do not reset the RAM any more than a system_reset call does.

The error block is preserved until power is off long enough for the RAM to lose data or some code deliberately or accidentally wipes it.

Flashing the device might wipe it. If the error block is at or near the bottom of RAM, it's would be quite possible that the pyOCD or built-in flash programming algorithm uses the area as workspace, so it might get trashed. But it might not. Flashing algorithms wouldn't usually deliberately clear the RAM.

@kjbracey kjbracey deleted the reboot_limit_fix branch January 31, 2019 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants