-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash Reporting implementation #8702
Conversation
...RGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_K66F/device/TOOLCHAIN_GCC_ARM/MK66FN2M0xxx18.ld
Outdated
Show resolved
Hide resolved
..._Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_GCC_ARM/MK64FN1M0xxx12.ld
Outdated
Show resolved
Hide resolved
..._Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_GCC_ARM/MK64FN1M0xxx12.ld
Outdated
Show resolved
Hide resolved
..._Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_GCC_ARM/MK64FN1M0xxx12.ld
Outdated
Show resolved
Hide resolved
...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct
Outdated
Show resolved
Hide resolved
...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct
Show resolved
Hide resolved
...GET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_K66F/device/TOOLCHAIN_ARM_STD/MK66FN2M0xxx18.sct
Show resolved
Hide resolved
...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct
Outdated
Show resolved
Hide resolved
...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct
Outdated
Show resolved
Hide resolved
...GET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_IAR/MK64FN1M0xxx12.icf
Outdated
Show resolved
Hide resolved
26d4643
to
d4a0239
Compare
@deepikabhavnani and @kegilbert - I have cleaned up the PR, conflict issues you pointed out have been removed. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, I have one query related to public API's - mbed_reset_reboot_error_info
and mbed_reset_reboot_count
- Do we have any limitation on when and where they can be used? Else we need to consider the thread safety here.
Example: Clearing an error count when error is in progress, during initialization of multiple threads / modules in system could cause issue.
platform/mbed_error.c
Outdated
const unsigned int polynomial = 0x04C11DB7; /* divisor is 32bit */ | ||
unsigned int crc = 0; /* CRC value is 32bit */ | ||
|
||
for( ;datalen>=0; datalen-- ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between for and '('
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
platform/mbed_error.c
Outdated
uint32_t crc_val = 0; | ||
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) ); | ||
//Read report_error_ctx and check if CRC is correct for report_error_ctx | ||
if(report_error_ctx->crc_error_ctx == crc_val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between if
and (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
platform/mbed_error.c
Outdated
mbed_error_status_t mbed_reset_reboot_count() | ||
{ | ||
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED | ||
if(is_reboot_error_valid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between if
and (
platform/mbed_error.c
Outdated
mbed_error_status_t status = MBED_ERROR_ITEM_NOT_FOUND; | ||
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED | ||
if (is_reboot_error_valid) { | ||
if(error_info != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between if
and (
platform/mbed_error.c
Outdated
|
||
//Enforce max-reboot only if auto reboot is enabled | ||
#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED | ||
if( report_error_ctx->error_reboot_count > MBED_CONF_PLATFORM_ERROR_REBOOT_MAX ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between if
and (
platform/mbed_error.c
Outdated
uint32_t crc_val = 0; | ||
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) ); | ||
//Read report_error_ctx and check if CRC is correct for report_error_ctx | ||
if((report_error_ctx->crc_error_ctx == crc_val) && (report_error_ctx->is_error_processed == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Style: Space between if
and (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform changes LGTM except for some very minor style nits above (didn't check out the linker script changes much).
mbed_fault_context_t mbed_fault_context; | ||
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED | ||
//Global for populating the context in exception handler | ||
mbed_fault_context_t *mbed_fault_context=(mbed_fault_context_t *)((uint32_t)FAULT_CONTEXT_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointers should be const to save RAM (mbed_fault_context * const mbed_fault_context
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
* @param fault_context Pointer to mbed_fault_context_t struct allocated by the caller. This is the mbed_fault_context_t info captured as part of the fatal exception which triggered the reboot. | ||
* @return 0 or MBED_SUCCESS on success. | ||
* MBED_ERROR_INVALID_ARGUMENT in case of invalid error_info pointer | ||
* MBED_ERROR_ITEM_NOT_FOUND if no reboot context is currently captured by teh system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the
platform/mbed_error.c
Outdated
|
||
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED | ||
//Global for populating the context in exception handler | ||
static mbed_error_ctx *report_error_ctx=(mbed_error_ctx *)((uint32_t)ERROR_CONTEXT_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(uint32_t) cast isn't doing anything I can see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
platform/mbed_error.c
Outdated
{ | ||
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED | ||
uint32_t crc_val = 0; | ||
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complicated bit here is offsetof(mbed_error_ctx, crc_error_ctx)
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, thanks for pointing that out and offsetof is much easier for readability.
platform/mbed_error.c
Outdated
//we dont have many uses cases to create a C wrapper for MbedCRC and the data | ||
//we calculate CRC on in this context is very less we will use a local | ||
//implementation here. | ||
static unsigned int compute_crc32(unsigned char *data, int datalen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this took const void *
you'd only need one cast inside here, not 1 at every callsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
platform/mbed_error.c
Outdated
//We need not call delete_mbed_crc(crc_obj) here as we are going to reset the system anyway, and calling delete while handling a fatal error may cause nested exception | ||
#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED | ||
system_reset();//do a system reset to get the system rebooted | ||
while(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while(1) is dead here. system_reset is MBED_NORETURN.
I would be strongly inclined to have an exponential backoff delay on each reboot. 5 seconds first time, say, and double each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while(1) was an attempt to stop the system in case system_reset() ever changed behavior. But I think its confusing, so will remove it.
I initially thought of having delay before reboot, but later I decided it wont help much, unless we want to provide opportunity to attach a debugger or something, is that why we should add the backoff delay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more a stability principle - each time the device boots, it may well be using extra network resources to reinitialise itself. The problem leading to crash may even be exacerbated by overall system load caused by other rebooting devices. A network could collapse as a result.
A general networking principle is that retries should be exponentially backed off to guarantee that the network is stable, and doesn't undergo congestion collapse. (Many protocols and standards use parameters based on chapter 14 of RFC 3315 - initial time, maximum time, maximum count...)
Obviously we would require someone at some point to reset the count, and hence the backoff, but you've left that application dependent. Were that to be reset too readily, there might be a problem.
In this case, maybe you don't want to literally double the "delay after crashing before rebooting", but rather double the "minimum time since we last rebooted". Not sure what your best timer for that would be - can't safely use Kernel::get_ms_count
, so might need to have a boot-started `LowPowerTimer'?
If you're initially configuring the count very low though, this is less necessary - you'd want this with high or unlimited counts. So it's maybe an extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation and pointers, currently I'm setting the count to 1 for now, so I would leave things as it is. But I would capture this as a note in my documentation about how to enable/configure crash-reporting. Hope that helps. And in future if we are increasing the default count, I'll add something to delay the reboot on every subsequent reboot.
"DISCO_L475VG_IOT01A": { | ||
"crash-capture-enabled": true, | ||
"reboot-crash-report-enabled": true, | ||
"fatal-error-auto-reboot-enabled": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reboot is a pretty big default behaviour functional change. Are we sure on that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its a big behavior change. But since the default config is to reboot only once(error_reboot_max = 1) it will reboot and halt, that way it shouldn't impact any tests or other tools. Do you still think we should keep the auto-reboot disabled, any issues you are seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any particular issues myself, just seems like it might be a bit unexpected to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that, I would leave that as it is for now unless I get a request to change it explicitly, hope that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure that gets into the release notes, talk to Mohit.
platform/mbed_error.c
Outdated
@@ -369,17 +519,19 @@ static void print_error_report(const mbed_error_ctx *ctx, const char *error_msg, | |||
#endif | |||
|
|||
#if MBED_CONF_PLATFORM_ERROR_ALL_THREADS_INFO && defined(MBED_CONF_RTOS_PRESENT) | |||
mbed_error_printf("\nNext:"); | |||
print_thread(osRtxInfo.thread.run.next); | |||
if(print_thread_info == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use if (boolean == true)
. Just if (boolean)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
"help": "Enables crash context capture when the system enters a fatal error/crash.", | ||
"value": false | ||
}, | ||
"error-reboot-max": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what 0 means here. Sounds as if it means don't reboot, but it's actually reboot once? 1 would mean reboot twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, 0 means it still causes 1 reboot. May be its confusing as you see it, its also tricky to document that I guess. I'm going to change that to - 0 means no reboot, 1 means 1 reboot etc. Hope that helps.
@bulislaw size change is reported by part of ci-morph-test which is not executed yet |
@bulislaw The size change report as it exists now is run after the Test CI job completes. |
@bulislaw In addition to the morph test comment, the benchmark tests are not being run as of now due to the office migration (new office network is not fully setup). Everything else should be running as expected. Will have the benchmark tests running ASAP. |
While we are waiting for finalizing the review (the benchmark would be nice to have here!), we run the build/exporters stage /morph build |
Build : SUCCESSBuild number : 3681 Triggering tests/morph test |
Test : FAILUREBuild number : 3457 |
Exporter Build : SUCCESSBuild number : 3284 |
CI started. |
Test run: SUCCESSSummary: 4 of 4 test jobs passed |
From the latest job, these are the numbers:
|
@bulislaw Ready for final approval? |
Are the numbers above bytes? |
@OPpuolitaival percentage ? The important bit I missed is the status reports the numbers :
|
well if it's 17% then we can't enable it by default... |
@ARMmbed/mbed-os-test Can you answer above question ^^ @SenRamakri Please review the logs (it tests dynamic usage - running some apps). From the docs: -
@ARMmbed/mbed-os-test Please update the docs to describe the numbers more in detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, but I'd like us to get to the bottom of the memory reporting before approving.
I marked this as "ready for merge" but not merging yet. @OPpuolitaival Can you provide some numbers to backup earlier reports? We experienced issues with dynamic reports, and it's currently disabled in the CI pipeline. Therefore I question this numbers as well. @bulislaw Please advise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved as the number provided by Senthil are sensible
Description
This PR implements Crash-reporting feature. Please see the design-document for this feature at #8561
There is also example application for this feature at - https://github.com/ARMmbed/mbed-os-example-crash-reporting. The example application(mbed-os.lib) will be updated to point to right version once this PR is merged.
This PR is targeted for 5.11 release.
Pull request type