Consistent error handling #4008

cepetr · 2024-07-09T08:14:21Z

This PR proposes a consistent approach to error handling in the C codebase. The proposal includes the following:

Introduction of system-wide error codes (ts_t type) with predefined errors prefixed with TS_ (Trezor status) and built-in fault injection hardening.
A straightforward mechanism/framework for error handling within functions, utilizing goto statements to facilitate cleanup at the end of the function.
Addition of new ensure() macro types to simplify top-level code.
Simplification of STM32 HAL error handling with the hal_status_to_ts() mapping function and the TS_CHECK_HAL_OK() macro.

This PR consists of three commits. Two of them serve as examples: a refactored haptic driver and a refactored SD card driver.

Error handling pattern:

ts_t module_function(int arg1, int arg2) {
  // Initialize the `__status` variable and set it to `TS_OK`.
  // This should be on the first line inside the function (rule 1).
  TS_INIT; 

  // Check arguments to ensure they are greater than 0.
  TS_CHECK_ARG(arg1 > 0);
  TS_CHECK_ARG(arg2 > 0);
  
  // Check the result of another function with a `ts_t` return value.
  // The `status` value is properly propagated.
  // Splitting it into two lines makes debugging easier (rule 2).
  ts_t status = another_function();
  TS_CHECK_OK(status);
  
  // Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`.
  // The HAL_xxx error is properly mapped to TS_xxx and propagated.
  HAL_StatusTypeDef hal_status = HAL_function();
  TS_CHECK_HAL_OK(hal_status);
  
  // Check a generic condition inside the function and propagate the status code
  // if the condition is not fulfilled.
  TS_CHECK(arg1 * arg2 < 1000, TS_ERROR);

  // Alternatively, use the secure variant working with `secbool`.
  secbool ready = usb_ready();
  TS_CHECK_SEC(ready);
  
  // In some cases it's useful to return before the cleanup code, (rule 3: never use `return ...` directly)
  // TS_RETURN;
  
cleanup:
  // Potential cleanup code goes here.
  // ...

  // Return the content of the `__status` variable.
  // In case of no error, it's `TS_OK`.
  TS_RETURN;
}

We can now use the ensure_xxx() set of macros similar to verify_xxx() for handling fatal errors. The ensure_xxx() macros display an error message on the screen and shut down the device as before. (Note: This feature is not yet complete in this PR.)

  // The generic condition must be `true`
  //  TS_CHECK(condition, ts_t) ->  
  ensure(condition, message);

  // Status must be `TS_OK`
  // TS_CHECK_OK(status) -> 
  ensure_ok(status, message);

  // The generic condition (secbool) must be `sectrue`
  // TS_CHECK_SEC(secbool, status) -> 
  ensure_sec(secbool, message);

I understand that this is a completely new approach, which is not very common and requires some time to get used to. However, it is very simple and quickly understandable. I will not merge this until we reach a consensus that this is the right approach. Please feel free to provide any comments or suggestions.

[no changelog]

github-actions · 2024-07-09T08:39:58Z

core UI changes	device test	click test	persistence test
T2T1 Model T	test(screens) main(screens)	test(screens) main(screens)	test(screens) main(screens)
T2B1 Safe 3	test(screens) main(screens)	test(screens) main(screens)	2724
T3T1	test(screens) main(screens)	test(screens) main(screens)	test(screens) main(screens)
All	main(screens)

hiviah · 2024-07-09T13:00:24Z

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function.
In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

andrewkozlik · 2024-07-09T15:10:41Z

core/embed/lib/error_handling.h

+ ts_t _status = status; \
+ if (ts_error(_status)) { \
+ __status = _status; \
+ goto error; \


I don't like the idea of using macros that perform jumps or return. It looks like a function, but has an unexpected side effect for someone who is not intimately familiar with it.

I understand, it's a lot about getting used to it. People are always confused when they see it for the first time. However, this way of coding is very compact, and my experience is that it ends up being very easy to read.

In past I've used more weird name instead of verify_xxx() so the side effect was expected;-) Maybe different name could help.

Another perspective is that using this macro enables us to log all errors in the debug or emulator build.

I don't doubt that it is very compact and I am sure that if all devs in the team would work with it daily, it would be very easy to read and they would find nothing unexpected about the side effects. However, I think that most of the development now gets done in Python and in Rust. So if a dev needs to work with the C code only occasionally, then they have to relearn the idioms again each time. Furthermore, since we are open-source, anyone who wants to look at our code would have to familiarize themselves with this idiom before being able to understand what's going on.

One example of macros with side effects that we have in our code are those in legacy/firmware/fsm.c: CHECK_PIN, CHECK_PIN_UNCACHED etc. I feel it's always a real pain to touch the code that uses these macros. They make the code compact, but it would be easier for me if they were inlined.

That being said I am not strictly against this idiom. I am just very hesitant.

I understand your point of view. Unexpected side effects are almost always inappropriate - CHECK_PIN() and its counterparts are good examples. But I don't believe that verify_xxx() falls into this category since it's a cross-cutting concept used throughout the entire codebase (hopefully in the future:-), and the side effect is quite apparent and local.

Our code is open source and should be readable for anybody without special 'training'. However, I think it already contains a lot of complexity, so it's not true that it's accessible to everybody for a long time.

But yes, I understand your concerns.

If we do decide to go ahead with this idiom, then we should at least use uppercase for the macros that have side effects to emphasize to the reader that they are macros. verify_ok() -> TS_CHECK()

core/embed/lib/error_handling.h

cepetr · 2024-07-09T15:19:02Z

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

Yes, good point. Hamming distance between TS_OK and TS_ERROR is low. There are certainly several ways to improve it. I'll think about it and try to suggest something.

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function. In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

I'm not sure that the __attribute__((cleanup(func))) approach is what we are looking for, although it's interesting. Cleanup code is not just about freeing resources, but also about performing some reverting actions, setting output arguments, etc. This might not be easy to do in a separate function. In some situations, you don't want to call any cleanup code if the function succeeds.

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

Yes, it's a bit unusual and requires some getting used to.

error or cleanup... I have no problem with either. Both work well, but neither is perfect (finally changed in 7b70f0a).

andrewkozlik · 2024-07-09T15:25:15Z

core/embed/lib/error_handling.h

+// Status code is hardened against fault injections
+// by storing the negated value in the upper 16 bits.
+typedef enum {
+ TS_OK = TS_BUILD(0),
+ TS_ERROR = TS_BUILD(1), // Generic error
+ TS_ERROR_BUSY = TS_BUILD(2), // Busy
+ TS_ERROR_TIMEOUT = TS_BUILD(3), // Timeout
+ TS_ERROR_NOTINIT = TS_BUILD(4), // Not initialized
+ TS_ERROR_ARG = TS_BUILD(4), // Invalid argument
+ TS_ERROR_IO = TS_BUILD(5), // I/O error
+} ts_t;


I wonder what the set of error codes should be here. This particular set seems very specific to trezorhal. On the other hand trezor-crypto, for example, would use a completely different set of codes, like TS_ERROR_SIGNATURE_INVALID. Would this enum just be a large pile of error codes from various areas?

My impression is that if it is supposed to be a universal error type, then the set should be either very trivial (sectrue, secfalse) or maybe it should be something abstract like the protobuf FailureType.

This approach is similar to the errno codes used in the libc - all codes are defined in one place. If you need a new error code, define it here, but try to come up with the most generic name so that someone else can use it as well.

However, we can change it. The error type does not need to be an enum; we can use #define and allow modules to define their own error codes. For example, the storage or optiga driver can then have specific error codes defined elsewhere.

Not sure what's better. Both ways have their own pros and cons.

Also, ts_t shouldn't be an enum. All the values are non-zero, so if it is accidentally used as

if (module_function()) { // ... }

instead of

if (verify_ok(module_function())) { // ... }

then it will always evaluate to true without warning. We need to do something like:

typedef struct { int status_code; } ts_t; #define TS_OK ((const ts_t) {TS_BUILD(0)}) // ...

Yes, I agree. That's a good point. This approach eliminates a lot of bugs.

I've changed ts_t to a struct in this commit cf075b5. This change helped me to find some bugs that were introduced during the sdcard driver refactoring...

By changing the enum to an integer, we have lost the debugger's ability to show error codes in a readable form. However, we can easily fix this by using Natvis.

I've also thought about module-specific error codes. Look at the notes below. In the end, we only need to add two new codes that are not so specific and can be used elsewhere:

One common set of error codes also allows us to create a function that returns an error name string.

typedef enum _optiga_pin_result { OPTIGA_PIN_SUCCESS = 0, // The operation completed successfully. OPTIGA_PIN_INVALID, // The PIN is invalid. OPTIGA_PIN_COUNTER_EXCEEDED, // The PIN try counter limit was exceeded. OPTIGA_PIN_ERROR, // Optiga processing or communication error. } optiga_pin_result; OPTIGA_PIN_SUCCESS => TS_OK OPTIGA_PIN_INVALID => TS_ERROR_ARG OPTIGA_PIN_COUNTER_EXCEEDED => new error code: TS_ERROR_TRY_LIMIT OPTIGA_PIN_ERROR => TS_ERROR_IO typedef enum _optiga_sign_result { OPTIGA_SIGN_SUCCESS = 0, // The operation completed successfully. OPTIGA_SIGN_INACCESSIBLE, // The signing key is inaccessible. OPTIGA_SIGN_ERROR, // Invalid parameters or Optiga processing or // communication error. } optiga_sign_result; OPTIGA_SIGN_SUCCESS => TS_OK OPTIGA_SIGN_INCACESSIBLE => new error code: TS_ERROR_SIGNING_KEY OPTIGA_SIGN_ERROR => TS_ERROR_IO

I am not a fan of mapping the codes like OPTIGA_PIN_INVALID -> TS_ERROR_ARG. That just seems error-prone, e.g. if some sub-function returns TS_ERROR_ARG, then we have to watch out to translate that to OPTIGA_PIN_ERROR or TS_ERROR_IO, to distinguish it from an invalid PIN. It's a pretty important distinction. If the PIN is invalid, then the user should be offered to retry, but if there is a runtime error, then we probably want the user to restart their device, rather than panic and deplete their PIN attempts.

Ok, I understand. Maybe this case requires a more specific error code. We should also consider whether error codes should be used for situations like 'invalid PIN.' Is it really an error, or is it just invalid user input? Should be treated the same way as i/o error or invalid argument?

andrewkozlik · 2024-07-09T16:37:53Z

core/embed/lib/error_handling.h

+// Jumps to `error` label if the condition is not `sectrue`.
+#define verify_sec(seccond, status) \
+ do { \
+ if ((seccond) != sectrue) { \
+ __status = status; \
+ goto error; \
+ } \
+ } while (0)


Looking at the macros defined here, it looks like we are not getting rid of secbool. Our code currently uses bool, secbool, a custom enum (e.g. optiga code) or a custom set of integers (e.g. trezor-crypto) to indicate the return status. The most familiar, compact and easy to use is bool. A more secure variant is secbool, which should have better resilience against fault injection, but it can be cumbersome to use. The enums are also fine, but they lack type safety. Finally, the custom integers are an abomination. I think that we should be aiming to narrow down and consolidate these different approaches. That could mean for example making secbool less cumbersome to use and possibly building on top of it, but not building yet another approach alongside it.

I've been thinking about it too. However, I still don't know how to reasonably connect it. Using two different types to represent boolean values isn't ideal.

Maybe after the this change cf075b5, 'secfalse' is more like TS_OK and sectrue is more like TS_BUILD(0xAAAA)... I will think about it.

andrewkozlik · 2024-07-10T08:41:11Z

core/embed/lib/error_handling.h

+#define ts_ok(status) (_ts_checked(status).code == TS_OK.code)
+
+// Returns `true` if status code is NOT `TS_OK`
+#define ts_error(status) (_ts_checked(status).code != TS_OK.code)


If there will be multiple error states, then we should also have ts_eq() and possibly ts_neq(), although that's equivalent to !ts_eq(), but might be more easy to read. Similarly ts_error() vs. !ts_ok().

It might be also nice to introduce a ts_switch(status) macro which expands to switch(_ts_checked(status).code).

I think the mechanism needs to allow easy translation of error codes and the ability to suppress certain errors or treat them as success. This should make that possible.

I've added ts_eq() and ts_string(), see 8b027fb.

The following code cannot be compiled since TS_xxx is a structure. For some reason, the compiler cannot deduce that TS_OK.code is a constant expression... :-(.

switch (status.code) { case TS_OK.code: case TS_ERROR.code: }

Maybe we can give up using case for error codes and always use if. But I'll try to find some solution.

#define ts_case(status) case status.code ts_switch(f()) { ts_case(TS_OK): // ... break; default: // ... }

You mentioned IRL some issue with the compiler not being able to treat the case value as a constant. In C++ defining TS_OK etc. using constexpr would solve this, but that probably isn't an option in C. Another idea is something like this:

#define ts_case(status) case status##__value #define TS_OK__value TS_BUILD(0) #define TS_OK ((const ts_t){TS_OK__value})

core/embed/lib/error_handling.h

andrewkozlik · 2024-07-10T08:54:30Z

core/embed/lib/error_handling.h

+#define _ts_checked(status) \
+ ({ \
+ ts_t _checked = (status); \
+ if ((_checked.code & 0xFFFF) != (_checked.code >> 16)) { \
+ __fatal_error("ts_check() error", __FILE_NAME__, __LINE__); \
+ } \
+ _checked; \
+ })


I wonder how much size and time overhead this will add throughout the entire code base and whether it's worth it. It could also be a barrier to optimization.

I haven't looked at the assembly code yet, so I can't tell what the cost is. That's just an idea and we can omit this code if we think it's not worth it. I think this could be also optional.

andrewkozlik · 2024-07-10T09:22:59Z

core/embed/trezorhal/stm32u5/haptic/drv2625/drv2625.c

+ status = drv2625_set_reg(DRV2625_REG_LIBRARY,
+ LIB_SEL | DRV2625_REG_LIBRARY_GAIN_25);
+ verify_ok(status);


What's the use of the status variable here?

If you split this into two lines, you can place a breakpoint at the line containing verify_ok() to observe the operation status. This pattern is very useful.

andrewkozlik · 2024-07-10T09:29:57Z

core/embed/lib/error_handling.h

+#define verify_init() __attribute__((unused)) ts_t __status = TS_OK;
+
+// Returns the current status.
+#define verify_status() (__status)


The naming of the macros above is confusing. verify_*() usually have a flow control side effect, but these don't. Maybe consider #define TS_RETURN return __status and not allow accessing __status directly at all.

Ok, what about this:

ts_t module_function(int arg1, int arg2) { // Initialize the `__status` variable and set it to `TS_OK`. // This should be on the first line inside the function (rule 1). TS_INIT; // Check arguments to ensure they are greater than 0. TS_CHECK_ARG(arg1 > 0); TS_CHECK_ARG(arg2 > 0); // Check the result of another function with a `ts_t` return value. // The `status` value is properly propagated. // Splitting it into two lines makes debugging easier (rule 2). ts_t status = another_function(); TS_CHECK_OK(status); // Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`. // The HAL_xxx error is properly mapped to TS_xxx and propagated. HAL_StatusTypeDef hal_status = HAL_function(); TS_CHECK_HAL_OK(hal_status); // Check a generic condition inside the function and propagate the status code // if the condition is not fulfilled. TS_CHECK(arg1 * arg2 < 1000, TS_ERROR); // Alternatively, use the secure variant working with `secbool`. secbool ready = usb_ready(); TS_CHECK_SEC(ready); // In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`) TS_RETURN; error: // Potential cleanup code goes here. // ... // Return the content of the `__status` variable. // In case of no error, it's `TS_OK`. TS_RETURN; }

But I personally prefer:

// In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`) return TS_STATUS; error: // Potential cleanup code goes here. // ... // Return the content of the `__status` variable. // In case of no error, it's `TS_OK`. return TS_STATUS; }

But I'm getting used to it. Maybe I like TS_RETURN more... :-)

I made these changes here 7b70f0a

andrewkozlik · 2024-07-10T10:43:23Z

I created a document in notion summarizing my concerns.
https://www.notion.so/satoshilabs/Unify-error-handling-in-C-aebe0fe5afa1417996329350869835c4

cepetr added 3 commits July 9, 2024 09:31

feat(core/embed): introduce a new error handling mechanism

76c1dcc

[no changelog]

refactor(core/embed): improve haptic driver error handling

a3d6e1c

[no changelog]

refactor(core/embed): improve sdcard driver error handling

1e8b421

[no changelog]

cepetr self-assigned this Jul 9, 2024

cepetr requested a review from TychoVrahe July 9, 2024 08:21

andrewkozlik reviewed Jul 9, 2024

View reviewed changes

core/embed/lib/error_handling.h Outdated Show resolved Hide resolved

andrewkozlik reviewed Jul 9, 2024

View reviewed changes

cepetr added 2 commits July 10, 2024 09:50

fix: improved hamming distance, ts_t is struct now

cf075b5

fix: sdcard driver error handling

96048d6

cepetr force-pushed the cepetr/error-handling branch from 6ddf428 to 96048d6 Compare July 10, 2024 08:17

fix: haptic driver error handling

7039437

andrewkozlik reviewed Jul 10, 2024

View reviewed changes

core/embed/lib/error_handling.h Outdated Show resolved Hide resolved

andrewkozlik reviewed Jul 10, 2024

View reviewed changes

cepetr added 2 commits July 10, 2024 11:54

add: ts_eq and ts_string helpers

8b027fb

fix: formatting

8b6de53

cepetr added 3 commits July 11, 2024 09:30

fix: use uppercase in macro names

7b70f0a

fix: fix sdcard driver to use upper case macro names

3f50f36

fix: fix haptic drive to use upper case macro names

651a696

Consistent error handling #4008

Are you sure you want to change the base?

Consistent error handling #4008

Conversation

cepetr commented Jul 9, 2024 • edited Loading

github-actions bot commented Jul 9, 2024

hiviah commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkozlik Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cepetr commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkozlik Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cepetr Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

andrewkozlik Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkozlik Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkozlik commented Jul 10, 2024

cepetr commented Jul 9, 2024 •

edited

Loading

andrewkozlik Jul 9, 2024 •

edited

Loading

cepetr commented Jul 9, 2024 •

edited

Loading

andrewkozlik Jul 9, 2024 •

edited

Loading

cepetr Jul 10, 2024 •

edited

Loading

andrewkozlik Jul 10, 2024 •

edited

Loading

andrewkozlik Jul 11, 2024 •

edited

Loading