Crash Reporting implementation #8702

SenRamakri · 2018-11-10T15:48:10Z

Description

This PR implements Crash-reporting feature. Please see the design-document for this feature at #8561

There is also example application for this feature at - https://github.com/ARMmbed/mbed-os-example-crash-reporting. The example application(mbed-os.lib) will be updated to point to right version once this PR is merged.

This PR is targeted for 5.11 release.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

...RGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_K66F/device/TOOLCHAIN_GCC_ARM/MK66FN2M0xxx18.ld

..._Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_GCC_ARM/MK64FN1M0xxx12.ld

cmsis/TARGET_CORTEX_M/mbed_fault_handler.c

platform/mbed_error.c

...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct

...GET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_K66F/device/TOOLCHAIN_ARM_STD/MK66FN2M0xxx18.sct

...Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_ARM_STD/MK64FN1M0xxx12.sct

...GET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/device/TOOLCHAIN_IAR/MK64FN1M0xxx12.icf

SenRamakri · 2018-11-13T04:44:01Z

@deepikabhavnani and @kegilbert - I have cleaned up the PR, conflict issues you pointed out have been removed. Please review.

deepikabhavnani

Changes look good to me, I have one query related to public API's - mbed_reset_reboot_error_info and mbed_reset_reboot_count - Do we have any limitation on when and where they can be used? Else we need to consider the thread safety here.

Example: Clearing an error count when error is in progress, during initialization of multiple threads / modules in system could cause issue.

kegilbert · 2018-11-14T00:43:13Z

platform/mbed_error.c

+ const unsigned int polynomial = 0x04C11DB7; /* divisor is 32bit */
+ unsigned int crc = 0; /* CRC value is 32bit */
+
+ for( ;datalen>=0; datalen-- ) {


Minor Style: Space between for and '('

kegilbert · 2018-11-14T00:43:41Z

platform/mbed_error.c

+ uint32_t crc_val = 0;
+ crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );
+ //Read report_error_ctx and check if CRC is correct for report_error_ctx
+ if(report_error_ctx->crc_error_ctx == crc_val) {


Minor Style: Space between if and (

kegilbert · 2018-11-14T00:43:49Z

platform/mbed_error.c

+mbed_error_status_t mbed_reset_reboot_count()
+{
+#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
+ if(is_reboot_error_valid) {


Minor Style: Space between if and (

kegilbert · 2018-11-14T00:43:57Z

platform/mbed_error.c

+ mbed_error_status_t status = MBED_ERROR_ITEM_NOT_FOUND;
+#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
+ if (is_reboot_error_valid) {
+ if(error_info != NULL) {


Minor Style: Space between if and (

kegilbert · 2018-11-14T00:44:24Z

platform/mbed_error.c

+
+ //Enforce max-reboot only if auto reboot is enabled
+#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED 
+ if( report_error_ctx->error_reboot_count > MBED_CONF_PLATFORM_ERROR_REBOOT_MAX ) {


Minor Style: Space between if and (

kegilbert · 2018-11-14T00:44:26Z

platform/mbed_error.c

+ uint32_t crc_val = 0;
+ crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );
+ //Read report_error_ctx and check if CRC is correct for report_error_ctx
+ if((report_error_ctx->crc_error_ctx == crc_val) && (report_error_ctx->is_error_processed == 0)) {


Minor Style: Space between if and (

kegilbert

Platform changes LGTM except for some very minor style nits above (didn't check out the linker script changes much).

kjbracey · 2018-11-14T11:23:39Z

cmsis/TARGET_CORTEX_M/mbed_fault_handler.c

-mbed_fault_context_t mbed_fault_context;
+#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
+ //Global for populating the context in exception handler
+ mbed_fault_context_t *mbed_fault_context=(mbed_fault_context_t *)((uint32_t)FAULT_CONTEXT_LOCATION);


Pointers should be const to save RAM (mbed_fault_context * const mbed_fault_context)

kjbracey · 2018-11-14T11:23:58Z

cmsis/TARGET_CORTEX_M/mbed_fault_handler.h

+ * @param fault_context Pointer to mbed_fault_context_t struct allocated by the caller. This is the mbed_fault_context_t info captured as part of the fatal exception which triggered the reboot.
+ * @return 0 or MBED_SUCCESS on success.
+ * MBED_ERROR_INVALID_ARGUMENT in case of invalid error_info pointer
+ * MBED_ERROR_ITEM_NOT_FOUND if no reboot context is currently captured by teh system 


kjbracey · 2018-11-14T11:25:06Z

platform/mbed_error.c

+
+#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
+ //Global for populating the context in exception handler
+ static mbed_error_ctx *report_error_ctx=(mbed_error_ctx *)((uint32_t)ERROR_CONTEXT_LOCATION);


(uint32_t) cast isn't doing anything I can see?

kjbracey · 2018-11-14T11:27:15Z

platform/mbed_error.c

+{
+#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
+ uint32_t crc_val = 0;
+ crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );


The complicated bit here is offsetof(mbed_error_ctx, crc_error_ctx), no?

Ah yes, thanks for pointing that out and offsetof is much easier for readability.

kjbracey · 2018-11-14T11:28:32Z

platform/mbed_error.c

+//we dont have many uses cases to create a C wrapper for MbedCRC and the data
+//we calculate CRC on in this context is very less we will use a local 
+//implementation here.
+static unsigned int compute_crc32(unsigned char *data, int datalen)


If this took const void * you'd only need one cast inside here, not 1 at every callsite.

kjbracey · 2018-11-14T11:30:33Z

platform/mbed_error.c

+ //We need not call delete_mbed_crc(crc_obj) here as we are going to reset the system anyway, and calling delete while handling a fatal error may cause nested exception
+#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED
+ system_reset();//do a system reset to get the system rebooted
+ while(1);


while(1) is dead here. system_reset is MBED_NORETURN.

I would be strongly inclined to have an exponential backoff delay on each reboot. 5 seconds first time, say, and double each time.

while(1) was an attempt to stop the system in case system_reset() ever changed behavior. But I think its confusing, so will remove it.

I initially thought of having delay before reboot, but later I decided it wont help much, unless we want to provide opportunity to attach a debugger or something, is that why we should add the backoff delay?

It's more a stability principle - each time the device boots, it may well be using extra network resources to reinitialise itself. The problem leading to crash may even be exacerbated by overall system load caused by other rebooting devices. A network could collapse as a result.

A general networking principle is that retries should be exponentially backed off to guarantee that the network is stable, and doesn't undergo congestion collapse. (Many protocols and standards use parameters based on chapter 14 of RFC 3315 - initial time, maximum time, maximum count...)

Obviously we would require someone at some point to reset the count, and hence the backoff, but you've left that application dependent. Were that to be reset too readily, there might be a problem.

In this case, maybe you don't want to literally double the "delay after crashing before rebooting", but rather double the "minimum time since we last rebooted". Not sure what your best timer for that would be - can't safely use Kernel::get_ms_count, so might need to have a boot-started `LowPowerTimer'?

If you're initially configuring the count very low though, this is less necessary - you'd want this with high or unlimited counts. So it's maybe an extension.

Thanks for the explanation and pointers, currently I'm setting the count to 1 for now, so I would leave things as it is. But I would capture this as a note in my documentation about how to enable/configure crash-reporting. Hope that helps. And in future if we are increasing the default count, I'll add something to delay the reboot on every subsequent reboot.

kjbracey · 2018-11-14T11:34:20Z

platform/mbed_lib.json

+ "DISCO_L475VG_IOT01A": {
+ "crash-capture-enabled": true,
+ "reboot-crash-report-enabled": true,
+ "fatal-error-auto-reboot-enabled": true


The reboot is a pretty big default behaviour functional change. Are we sure on that one?

Yeah its a big behavior change. But since the default config is to reboot only once(error_reboot_max = 1) it will reboot and halt, that way it shouldn't impact any tests or other tools. Do you still think we should keep the auto-reboot disabled, any issues you are seeing?

I'm not seeing any particular issues myself, just seems like it might be a bit unexpected to end users.

I see that, I would leave that as it is for now unless I get a request to change it explicitly, hope that's ok.

Can we make sure that gets into the release notes, talk to Mohit.

kjbracey · 2018-11-14T11:35:57Z

platform/mbed_error.c

@@ -369,17 +519,19 @@ static void print_error_report(const mbed_error_ctx *ctx, const char *error_msg,
 #endif

 #if MBED_CONF_PLATFORM_ERROR_ALL_THREADS_INFO && defined(MBED_CONF_RTOS_PRESENT)
- mbed_error_printf("\nNext:");
- print_thread(osRtxInfo.thread.run.next);
+ if(print_thread_info == true) {


Please don't use if (boolean == true). Just if (boolean).

kjbracey · 2018-11-14T11:43:57Z

platform/mbed_lib.json

+ "help": "Enables crash context capture when the system enters a fatal error/crash.",
+ "value": false
+ },
+ "error-reboot-max": {


It's not clear what 0 means here. Sounds as if it means don't reboot, but it's actually reboot once? 1 would mean reboot twice?

That's correct, 0 means it still causes 1 reboot. May be its confusing as you see it, its also tricky to document that I guess. I'm going to change that to - 0 means no reboot, 1 means 1 reboot etc. Hope that helps.

OPpuolitaival · 2018-11-19T13:13:48Z

@bulislaw size change is reported by part of ci-morph-test which is not executed yet

cmonr · 2018-11-19T19:59:45Z

@bulislaw The size change report as it exists now is run after the Test CI job completes.
@kegilbert @studavekar can give more information.

kegilbert · 2018-11-19T22:30:24Z

@bulislaw In addition to the morph test comment, the benchmark tests are not being run as of now due to the office migration (new office network is not fully setup). Everything else should be running as expected. Will have the benchmark tests running ASAP.

0xc0170 · 2018-11-20T09:31:34Z

While we are waiting for finalizing the review (the benchmark would be nice to have here!), we run the build/exporters stage

/morph build

mbed-ci · 2018-11-20T10:14:02Z

Build : SUCCESS

Build number : 3681
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/8702/

Triggering tests

/morph test
/morph export-build
/morph mbed2-build

mbed-ci · 2018-11-20T11:05:06Z

Test : FAILURE

Build number : 3457
Test logs :http://mbed-os-logs.s3-website-us-west-1.amazonaws.com/?prefix=logs/8702/3457

mbed-ci · 2018-11-20T11:46:05Z

Exporter Build : SUCCESS

Build number : 3284
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/exporter/8702/

cmonr · 2018-11-22T02:09:25Z

CI started.

mbed-ci · 2018-11-22T08:17:23Z

Test run: SUCCESS

Summary: 4 of 4 test jobs passed
Build number : 13
Build artifacts
Build logs

0xc0170 · 2018-11-22T08:30:20Z

From the latest job, these are the numbers:

INFO 11/22/2018 03:46:47 AM 
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|
| APPLICATION |  TEXT | DATA |  BSS  | HEAP | STACK | RESERVED_HEAP | RESERVED_STACK |  ROM  |  RAM  |
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|
|   ethernet  |  7.61 | 0.0  |  -0.1 | 0.0  |  0.0  |      0.04     |      0.0       |  7.4  | -0.09 |
|  filesystem | 10.52 | 0.0  | -0.77 | 0.0  |  0.0  |      0.03     |      0.0       | 10.12 | -0.14 |
|     wifi    |  9.73 | 0.0  | -0.53 | 0.0  |  0.0  |      0.03     |      0.0       |  9.38 | -0.33 |
|     rtos    | 17.34 | 0.0  | -0.49 | 0.0  |  0.0  |      0.03     |      0.0       | 16.31 |  -0.2 |
|   baseline  | 19.16 | 0.0  |  -0.8 | 0.0  |  0.0  |      0.03     |      0.0       | 17.91 | -0.52 |
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|

0xc0170 · 2018-11-22T08:31:29Z

@bulislaw Ready for final approval?

bulislaw · 2018-11-22T12:48:16Z

Are the numbers above bytes?

0xc0170 · 2018-11-22T13:17:18Z

@OPpuolitaival percentage ?

The important bit I missed is the status reports the numbers :

jenkins-ci/dynamic-memory-usage — Success, RTOS ROM(+8106 bytes) RAM(-56 bytes)

bulislaw · 2018-11-22T13:59:18Z

well if it's 17% then we can't enable it by default...

SenRamakri · 2018-11-22T14:57:44Z

@0xc0170 @bulislaw - I'm little bit confused about those numbers. When I did local builds for GCC-RELEASE, I get totally different results. +8K bytes shown here is inconceivable. was this ever part of roll-up?

0xc0170 · 2018-11-22T17:16:16Z

@ARMmbed/mbed-os-test Can you answer above question ^^

@SenRamakri Please review the logs (it tests dynamic usage - running some apps).

From the docs: -

jenkins-ci/dynamic-memory-usage - Report dynamic memory use compared to the master branch.

@ARMmbed/mbed-os-test Please update the docs to describe the numbers more in detail

bulislaw

Code looks good, but I'd like us to get to the bottom of the memory reporting before approving.

0xc0170 · 2018-11-23T19:18:12Z

I marked this as "ready for merge" but not merging yet. @OPpuolitaival Can you provide some numbers to backup earlier reports?

We experienced issues with dynamic reports, and it's currently disabled in the CI pipeline. Therefore I question this numbers as well. @bulislaw Please advise

bulislaw

Approved as the number provided by Senthil are sensible

SenRamakri requested review from kjbracey, deepikabhavnani and bulislaw November 10, 2018 15:48