Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash Reporting implementation #8702

Merged
merged 9 commits into from
Nov 23, 2018

Conversation

SenRamakri
Copy link
Contributor

@SenRamakri SenRamakri commented Nov 10, 2018

Description

This PR implements Crash-reporting feature. Please see the design-document for this feature at #8561

There is also example application for this feature at - https://github.com/ARMmbed/mbed-os-example-crash-reporting. The example application(mbed-os.lib) will be updated to point to right version once this PR is merged.

This PR is targeted for 5.11 release.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

@SenRamakri
Copy link
Contributor Author

@deepikabhavnani and @kegilbert - I have cleaned up the PR, conflict issues you pointed out have been removed. Please review.

Copy link

@deepikabhavnani deepikabhavnani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, I have one query related to public API's - mbed_reset_reboot_error_info and mbed_reset_reboot_count - Do we have any limitation on when and where they can be used? Else we need to consider the thread safety here.

Example: Clearing an error count when error is in progress, during initialization of multiple threads / modules in system could cause issue.

const unsigned int polynomial = 0x04C11DB7; /* divisor is 32bit */
unsigned int crc = 0; /* CRC value is 32bit */

for( ;datalen>=0; datalen-- ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between for and '('

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

uint32_t crc_val = 0;
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );
//Read report_error_ctx and check if CRC is correct for report_error_ctx
if(report_error_ctx->crc_error_ctx == crc_val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between if and (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

mbed_error_status_t mbed_reset_reboot_count()
{
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
if(is_reboot_error_valid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between if and (

mbed_error_status_t status = MBED_ERROR_ITEM_NOT_FOUND;
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
if (is_reboot_error_valid) {
if(error_info != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between if and (


//Enforce max-reboot only if auto reboot is enabled
#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED
if( report_error_ctx->error_reboot_count > MBED_CONF_PLATFORM_ERROR_REBOOT_MAX ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between if and (

uint32_t crc_val = 0;
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );
//Read report_error_ctx and check if CRC is correct for report_error_ctx
if((report_error_ctx->crc_error_ctx == crc_val) && (report_error_ctx->is_error_processed == 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Style: Space between if and (

Copy link
Contributor

@kegilbert kegilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform changes LGTM except for some very minor style nits above (didn't check out the linker script changes much).

mbed_fault_context_t mbed_fault_context;
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
//Global for populating the context in exception handler
mbed_fault_context_t *mbed_fault_context=(mbed_fault_context_t *)((uint32_t)FAULT_CONTEXT_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointers should be const to save RAM (mbed_fault_context * const mbed_fault_context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* @param fault_context Pointer to mbed_fault_context_t struct allocated by the caller. This is the mbed_fault_context_t info captured as part of the fatal exception which triggered the reboot.
* @return 0 or MBED_SUCCESS on success.
* MBED_ERROR_INVALID_ARGUMENT in case of invalid error_info pointer
* MBED_ERROR_ITEM_NOT_FOUND if no reboot context is currently captured by teh system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the


#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
//Global for populating the context in exception handler
static mbed_error_ctx *report_error_ctx=(mbed_error_ctx *)((uint32_t)ERROR_CONTEXT_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(uint32_t) cast isn't doing anything I can see?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

{
#if MBED_CONF_PLATFORM_CRASH_CAPTURE_ENABLED
uint32_t crc_val = 0;
crc_val = compute_crc32( (unsigned char *)report_error_ctx, ((uint32_t)&(report_error_ctx->crc_error_ctx) - (uint32_t)report_error_ctx) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complicated bit here is offsetof(mbed_error_ctx, crc_error_ctx), no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks for pointing that out and offsetof is much easier for readability.

//we dont have many uses cases to create a C wrapper for MbedCRC and the data
//we calculate CRC on in this context is very less we will use a local
//implementation here.
static unsigned int compute_crc32(unsigned char *data, int datalen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this took const void * you'd only need one cast inside here, not 1 at every callsite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

//We need not call delete_mbed_crc(crc_obj) here as we are going to reset the system anyway, and calling delete while handling a fatal error may cause nested exception
#if MBED_CONF_PLATFORM_FATAL_ERROR_AUTO_REBOOT_ENABLED
system_reset();//do a system reset to get the system rebooted
while(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while(1) is dead here. system_reset is MBED_NORETURN.

I would be strongly inclined to have an exponential backoff delay on each reboot. 5 seconds first time, say, and double each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while(1) was an attempt to stop the system in case system_reset() ever changed behavior. But I think its confusing, so will remove it.

I initially thought of having delay before reboot, but later I decided it wont help much, unless we want to provide opportunity to attach a debugger or something, is that why we should add the backoff delay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more a stability principle - each time the device boots, it may well be using extra network resources to reinitialise itself. The problem leading to crash may even be exacerbated by overall system load caused by other rebooting devices. A network could collapse as a result.

A general networking principle is that retries should be exponentially backed off to guarantee that the network is stable, and doesn't undergo congestion collapse. (Many protocols and standards use parameters based on chapter 14 of RFC 3315 - initial time, maximum time, maximum count...)

Obviously we would require someone at some point to reset the count, and hence the backoff, but you've left that application dependent. Were that to be reset too readily, there might be a problem.

In this case, maybe you don't want to literally double the "delay after crashing before rebooting", but rather double the "minimum time since we last rebooted". Not sure what your best timer for that would be - can't safely use Kernel::get_ms_count, so might need to have a boot-started `LowPowerTimer'?

If you're initially configuring the count very low though, this is less necessary - you'd want this with high or unlimited counts. So it's maybe an extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and pointers, currently I'm setting the count to 1 for now, so I would leave things as it is. But I would capture this as a note in my documentation about how to enable/configure crash-reporting. Hope that helps. And in future if we are increasing the default count, I'll add something to delay the reboot on every subsequent reboot.

"DISCO_L475VG_IOT01A": {
"crash-capture-enabled": true,
"reboot-crash-report-enabled": true,
"fatal-error-auto-reboot-enabled": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reboot is a pretty big default behaviour functional change. Are we sure on that one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its a big behavior change. But since the default config is to reboot only once(error_reboot_max = 1) it will reboot and halt, that way it shouldn't impact any tests or other tools. Do you still think we should keep the auto-reboot disabled, any issues you are seeing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any particular issues myself, just seems like it might be a bit unexpected to end users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that, I would leave that as it is for now unless I get a request to change it explicitly, hope that's ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure that gets into the release notes, talk to Mohit.

@@ -369,17 +519,19 @@ static void print_error_report(const mbed_error_ctx *ctx, const char *error_msg,
#endif

#if MBED_CONF_PLATFORM_ERROR_ALL_THREADS_INFO && defined(MBED_CONF_RTOS_PRESENT)
mbed_error_printf("\nNext:");
print_thread(osRtxInfo.thread.run.next);
if(print_thread_info == true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use if (boolean == true). Just if (boolean).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"help": "Enables crash context capture when the system enters a fatal error/crash.",
"value": false
},
"error-reboot-max": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what 0 means here. Sounds as if it means don't reboot, but it's actually reboot once? 1 would mean reboot twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, 0 means it still causes 1 reboot. May be its confusing as you see it, its also tricky to document that I guess. I'm going to change that to - 0 means no reboot, 1 means 1 reboot etc. Hope that helps.

@OPpuolitaival
Copy link
Contributor

@bulislaw size change is reported by part of ci-morph-test which is not executed yet

@cmonr
Copy link
Contributor

cmonr commented Nov 19, 2018

@bulislaw The size change report as it exists now is run after the Test CI job completes.
@kegilbert @studavekar can give more information.

@kegilbert
Copy link
Contributor

@bulislaw In addition to the morph test comment, the benchmark tests are not being run as of now due to the office migration (new office network is not fully setup). Everything else should be running as expected. Will have the benchmark tests running ASAP.

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 20, 2018

While we are waiting for finalizing the review (the benchmark would be nice to have here!), we run the build/exporters stage

/morph build

@mbed-ci
Copy link

mbed-ci commented Nov 20, 2018

Build : SUCCESS

Build number : 3681
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/8702/

Triggering tests

/morph test
/morph export-build
/morph mbed2-build

@mbed-ci
Copy link

mbed-ci commented Nov 20, 2018

@mbed-ci
Copy link

mbed-ci commented Nov 20, 2018

@cmonr
Copy link
Contributor

cmonr commented Nov 22, 2018

CI started.

@mbed-ci
Copy link

mbed-ci commented Nov 22, 2018

Test run: SUCCESS

Summary: 4 of 4 test jobs passed
Build number : 13
Build artifacts
Build logs

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 22, 2018

From the latest job, these are the numbers:

INFO 11/22/2018 03:46:47 AM 
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|
| APPLICATION |  TEXT | DATA |  BSS  | HEAP | STACK | RESERVED_HEAP | RESERVED_STACK |  ROM  |  RAM  |
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|
|   ethernet  |  7.61 | 0.0  |  -0.1 | 0.0  |  0.0  |      0.04     |      0.0       |  7.4  | -0.09 |
|  filesystem | 10.52 | 0.0  | -0.77 | 0.0  |  0.0  |      0.03     |      0.0       | 10.12 | -0.14 |
|     wifi    |  9.73 | 0.0  | -0.53 | 0.0  |  0.0  |      0.03     |      0.0       |  9.38 | -0.33 |
|     rtos    | 17.34 | 0.0  | -0.49 | 0.0  |  0.0  |      0.03     |      0.0       | 16.31 |  -0.2 |
|   baseline  | 19.16 | 0.0  |  -0.8 | 0.0  |  0.0  |      0.03     |      0.0       | 17.91 | -0.52 |
|-------------|-------|------|-------|------|-------|---------------|----------------|-------|-------|

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 22, 2018

@bulislaw Ready for final approval?

@bulislaw
Copy link
Member

Are the numbers above bytes?

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 22, 2018

@OPpuolitaival percentage ?

The important bit I missed is the status reports the numbers :

jenkins-ci/dynamic-memory-usage — Success, RTOS ROM(+8106 bytes) RAM(-56 bytes)

@bulislaw
Copy link
Member

well if it's 17% then we can't enable it by default...

@SenRamakri
Copy link
Contributor Author

@0xc0170 @bulislaw - I'm little bit confused about those numbers. When I did local builds for GCC-RELEASE, I get totally different results. +8K bytes shown here is inconceivable. was this ever part of roll-up?

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 22, 2018

@ARMmbed/mbed-os-test Can you answer above question ^^

@SenRamakri Please review the logs (it tests dynamic usage - running some apps).

From the docs: -

jenkins-ci/dynamic-memory-usage - Report dynamic memory use compared to the master branch.

@ARMmbed/mbed-os-test Please update the docs to describe the numbers more in detail

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but I'd like us to get to the bottom of the memory reporting before approving.

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 23, 2018

I marked this as "ready for merge" but not merging yet. @OPpuolitaival Can you provide some numbers to backup earlier reports?

We experienced issues with dynamic reports, and it's currently disabled in the CI pipeline. Therefore I question this numbers as well. @bulislaw Please advise

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as the number provided by Senthil are sensible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.