Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent error handling #4008

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Consistent error handling #4008

wants to merge 11 commits into from

Conversation

cepetr
Copy link
Contributor

@cepetr cepetr commented Jul 9, 2024

This PR proposes a consistent approach to error handling in the C codebase. The proposal includes the following:

  1. Introduction of system-wide error codes (ts_t type) with predefined errors prefixed with TS_ (Trezor status) and built-in fault injection hardening.
  2. A straightforward mechanism/framework for error handling within functions, utilizing goto statements to facilitate cleanup at the end of the function.
  3. Addition of new ensure() macro types to simplify top-level code.
  4. Simplification of STM32 HAL error handling with the hal_status_to_ts() mapping function and the TS_CHECK_HAL_OK() macro.

This PR consists of three commits. Two of them serve as examples: a refactored haptic driver and a refactored SD card driver.

Error handling pattern:

ts_t module_function(int arg1, int arg2) {
  // Initialize the `__status` variable and set it to `TS_OK`.
  // This should be on the first line inside the function (rule 1).
  TS_INIT; 

  // Check arguments to ensure they are greater than 0.
  TS_CHECK_ARG(arg1 > 0);
  TS_CHECK_ARG(arg2 > 0);
  
  // Check the result of another function with a `ts_t` return value.
  // The `status` value is properly propagated.
  // Splitting it into two lines makes debugging easier (rule 2).
  ts_t status = another_function();
  TS_CHECK_OK(status);
  
  // Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`.
  // The HAL_xxx error is properly mapped to TS_xxx and propagated.
  HAL_StatusTypeDef hal_status = HAL_function();
  TS_CHECK_HAL_OK(hal_status);
  
  // Check a generic condition inside the function and propagate the status code
  // if the condition is not fulfilled.
  TS_CHECK(arg1 * arg2 < 1000, TS_ERROR);

  // Alternatively, use the secure variant working with `secbool`.
  secbool ready = usb_ready();
  TS_CHECK_SEC(ready);
  
  // In some cases it's useful to return before the cleanup code, (rule 3: never use `return ...` directly)
  // TS_RETURN;
  
cleanup:
  // Potential cleanup code goes here.
  // ...

  // Return the content of the `__status` variable.
  // In case of no error, it's `TS_OK`.
  TS_RETURN;
}

We can now use the ensure_xxx() set of macros similar to verify_xxx() for handling fatal errors. The ensure_xxx() macros display an error message on the screen and shut down the device as before. (Note: This feature is not yet complete in this PR.)

  // The generic condition must be `true`
  //  TS_CHECK(condition, ts_t) ->  
  ensure(condition, message);

  // Status must be `TS_OK`
  // TS_CHECK_OK(status) -> 
  ensure_ok(status, message);

  // The generic condition (secbool) must be `sectrue`
  // TS_CHECK_SEC(secbool, status) -> 
  ensure_sec(secbool, message);

I understand that this is a completely new approach, which is not very common and requires some time to get used to. However, it is very simple and quickly understandable. I will not merge this until we reach a consensus that this is the right approach. Please feel free to provide any comments or suggestions.

@cepetr cepetr self-assigned this Jul 9, 2024
@cepetr cepetr requested a review from TychoVrahe July 9, 2024 08:21
Copy link

github-actions bot commented Jul 9, 2024

core UI changes device test click test persistence test
T2T1 Model T test(screens) main(screens) test(screens) main(screens) test(screens) main(screens)
T2B1 Safe 3 test(screens) main(screens) test(screens) main(screens) 2724
T3T1 test(screens) main(screens) test(screens) main(screens) test(screens) main(screens)
All main(screens)

@hiviah
Copy link
Contributor

hiviah commented Jul 9, 2024

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function.
In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

ts_t _status = status; \
if (ts_error(_status)) { \
__status = _status; \
goto error; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of using macros that perform jumps or return. It looks like a function, but has an unexpected side effect for someone who is not intimately familiar with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, it's a lot about getting used to it. People are always confused when they see it for the first time. However, this way of coding is very compact, and my experience is that it ends up being very easy to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In past I've used more weird name instead of verify_xxx() so the side effect was expected;-) Maybe different name could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another perspective is that using this macro enables us to log all errors in the debug or emulator build.

Copy link
Contributor

@andrewkozlik andrewkozlik Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt that it is very compact and I am sure that if all devs in the team would work with it daily, it would be very easy to read and they would find nothing unexpected about the side effects. However, I think that most of the development now gets done in Python and in Rust. So if a dev needs to work with the C code only occasionally, then they have to relearn the idioms again each time. Furthermore, since we are open-source, anyone who wants to look at our code would have to familiarize themselves with this idiom before being able to understand what's going on.

One example of macros with side effects that we have in our code are those in legacy/firmware/fsm.c: CHECK_PIN, CHECK_PIN_UNCACHED etc. I feel it's always a real pain to touch the code that uses these macros. They make the code compact, but it would be easier for me if they were inlined.

That being said I am not strictly against this idiom. I am just very hesitant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point of view. Unexpected side effects are almost always inappropriate - CHECK_PIN() and its counterparts are good examples. But I don't believe that verify_xxx() falls into this category since it's a cross-cutting concept used throughout the entire codebase (hopefully in the future:-), and the side effect is quite apparent and local.

Our code is open source and should be readable for anybody without special 'training'. However, I think it already contains a lot of complexity, so it's not true that it's accessible to everybody for a long time.

But yes, I understand your concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do decide to go ahead with this idiom, then we should at least use uppercase for the macros that have side effects to emphasize to the reader that they are macros. verify_ok() -> TS_CHECK()

@cepetr
Copy link
Contributor Author

cepetr commented Jul 9, 2024

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

Yes, good point. Hamming distance between TS_OK and TS_ERROR is low. There are certainly several ways to improve it. I'll think about it and try to suggest something.

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function. In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

I'm not sure that the __attribute__((cleanup(func))) approach is what we are looking for, although it's interesting. Cleanup code is not just about freeing resources, but also about performing some reverting actions, setting output arguments, etc. This might not be easy to do in a separate function. In some situations, you don't want to call any cleanup code if the function succeeds.

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

Yes, it's a bit unusual and requires some getting used to.

error or cleanup... I have no problem with either. Both work well, but neither is perfect (finally changed in 7b70f0a).

Comment on lines 34 to 44
// Status code is hardened against fault injections
// by storing the negated value in the upper 16 bits.
typedef enum {
TS_OK = TS_BUILD(0),
TS_ERROR = TS_BUILD(1), // Generic error
TS_ERROR_BUSY = TS_BUILD(2), // Busy
TS_ERROR_TIMEOUT = TS_BUILD(3), // Timeout
TS_ERROR_NOTINIT = TS_BUILD(4), // Not initialized
TS_ERROR_ARG = TS_BUILD(4), // Invalid argument
TS_ERROR_IO = TS_BUILD(5), // I/O error
} ts_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the set of error codes should be here. This particular set seems very specific to trezorhal. On the other hand trezor-crypto, for example, would use a completely different set of codes, like TS_ERROR_SIGNATURE_INVALID. Would this enum just be a large pile of error codes from various areas?

My impression is that if it is supposed to be a universal error type, then the set should be either very trivial (sectrue, secfalse) or maybe it should be something abstract like the protobuf FailureType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is similar to the errno codes used in the libc - all codes are defined in one place. If you need a new error code, define it here, but try to come up with the most generic name so that someone else can use it as well.

However, we can change it. The error type does not need to be an enum; we can use #define and allow modules to define their own error codes. For example, the storage or optiga driver can then have specific error codes defined elsewhere.

Not sure what's better. Both ways have their own pros and cons.

Copy link
Contributor

@andrewkozlik andrewkozlik Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ts_t shouldn't be an enum. All the values are non-zero, so if it is accidentally used as

if (module_function()) {
  // ...
}

instead of

if (verify_ok(module_function())) {
  // ...
}

then it will always evaluate to true without warning. We need to do something like:

typedef struct {
  int status_code;
} ts_t;

#define TS_OK ((const ts_t) {TS_BUILD(0)})
// ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. That's a good point. This approach eliminates a lot of bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed ts_t to a struct in this commit cf075b5. This change helped me to find some bugs that were introduced during the sdcard driver refactoring...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing the enum to an integer, we have lost the debugger's ability to show error codes in a readable form. However, we can easily fix this by using Natvis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also thought about module-specific error codes. Look at the notes below. In the end, we only need to add two new codes that are not so specific and can be used elsewhere:

One common set of error codes also allows us to create a function that returns an error name string.

typedef enum _optiga_pin_result {
  OPTIGA_PIN_SUCCESS = 0,       // The operation completed successfully.
  OPTIGA_PIN_INVALID,           // The PIN is invalid.
  OPTIGA_PIN_COUNTER_EXCEEDED,  // The PIN try counter limit was exceeded.
  OPTIGA_PIN_ERROR,             // Optiga processing or communication error.
} optiga_pin_result;

OPTIGA_PIN_SUCCESS => TS_OK
OPTIGA_PIN_INVALID => TS_ERROR_ARG
OPTIGA_PIN_COUNTER_EXCEEDED => new error code: TS_ERROR_TRY_LIMIT
OPTIGA_PIN_ERROR => TS_ERROR_IO


typedef enum _optiga_sign_result {
  OPTIGA_SIGN_SUCCESS = 0,   // The operation completed successfully.
  OPTIGA_SIGN_INACCESSIBLE,  // The signing key is inaccessible.
  OPTIGA_SIGN_ERROR,         // Invalid parameters or Optiga processing or
                             // communication error.
} optiga_sign_result;

OPTIGA_SIGN_SUCCESS => TS_OK
OPTIGA_SIGN_INCACESSIBLE => new error code: TS_ERROR_SIGNING_KEY
OPTIGA_SIGN_ERROR => TS_ERROR_IO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of mapping the codes like OPTIGA_PIN_INVALID -> TS_ERROR_ARG. That just seems error-prone, e.g. if some sub-function returns TS_ERROR_ARG, then we have to watch out to translate that to OPTIGA_PIN_ERROR or TS_ERROR_IO, to distinguish it from an invalid PIN. It's a pretty important distinction. If the PIN is invalid, then the user should be offered to retry, but if there is a runtime error, then we probably want the user to restart their device, rather than panic and deplete their PIN attempts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand. Maybe this case requires a more specific error code. We should also consider whether error codes should be used for situations like 'invalid PIN.' Is it really an error, or is it just invalid user input? Should be treated the same way as i/o error or invalid argument?

Comment on lines 128 to 135
// Jumps to `error` label if the condition is not `sectrue`.
#define verify_sec(seccond, status) \
do { \
if ((seccond) != sectrue) { \
__status = status; \
goto error; \
} \
} while (0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the macros defined here, it looks like we are not getting rid of secbool. Our code currently uses bool, secbool, a custom enum (e.g. optiga code) or a custom set of integers (e.g. trezor-crypto) to indicate the return status. The most familiar, compact and easy to use is bool. A more secure variant is secbool, which should have better resilience against fault injection, but it can be cumbersome to use. The enums are also fine, but they lack type safety. Finally, the custom integers are an abomination. I think that we should be aiming to narrow down and consolidate these different approaches. That could mean for example making secbool less cumbersome to use and possibly building on top of it, but not building yet another approach alongside it.

Copy link
Contributor Author

@cepetr cepetr Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about it too. However, I still don't know how to reasonably connect it. Using two different types to represent boolean values isn't ideal.

Maybe after the this change cf075b5, 'secfalse' is more like TS_OK and sectrue is more like TS_BUILD(0xAAAA)... I will think about it.

#define ts_ok(status) (_ts_checked(status).code == TS_OK.code)

// Returns `true` if status code is NOT `TS_OK`
#define ts_error(status) (_ts_checked(status).code != TS_OK.code)
Copy link
Contributor

@andrewkozlik andrewkozlik Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there will be multiple error states, then we should also have ts_eq() and possibly ts_neq(), although that's equivalent to !ts_eq(), but might be more easy to read. Similarly ts_error() vs. !ts_ok().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also nice to introduce a ts_switch(status) macro which expands to switch(_ts_checked(status).code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mechanism needs to allow easy translation of error codes and the ability to suppress certain errors or treat them as success. This should make that possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added ts_eq() and ts_string(), see 8b027fb.

The following code cannot be compiled since TS_xxx is a structure. For some reason, the compiler cannot deduce that TS_OK.code is a constant expression... :-(.

switch (status.code) {
  case TS_OK.code:
  case TS_ERROR.code:
}

Maybe we can give up using case for error codes and always use if. But I'll try to find some solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#define ts_case(status) case status.code

ts_switch(f()) {
  ts_case(TS_OK):
  // ...
  break;
default:
  // ...
}

Copy link
Contributor

@andrewkozlik andrewkozlik Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned IRL some issue with the compiler not being able to treat the case value as a constant. In C++ defining TS_OK etc. using constexpr would solve this, but that probably isn't an option in C. Another idea is something like this:

#define ts_case(status) case status##__value
#define TS_OK__value TS_BUILD(0)
#define TS_OK ((const ts_t){TS_OK__value})

Comment on lines 64 to 71
#define _ts_checked(status) \
({ \
ts_t _checked = (status); \
if ((_checked.code & 0xFFFF) != (_checked.code >> 16)) { \
__fatal_error("ts_check() error", __FILE_NAME__, __LINE__); \
} \
_checked; \
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much size and time overhead this will add throughout the entire code base and whether it's worth it. It could also be a barrier to optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the assembly code yet, so I can't tell what the cost is. That's just an idea and we can omit this code if we think it's not worth it. I think this could be also optional.

Comment on lines 106 to 108
status = drv2625_set_reg(DRV2625_REG_LIBRARY,
LIB_SEL | DRV2625_REG_LIBRARY_GAIN_25);
verify_ok(status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use of the status variable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you split this into two lines, you can place a breakpoint at the line containing verify_ok() to observe the operation status. This pattern is very useful.

#define verify_init() __attribute__((unused)) ts_t __status = TS_OK;

// Returns the current status.
#define verify_status() (__status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of the macros above is confusing. verify_*() usually have a flow control side effect, but these don't. Maybe consider #define TS_RETURN return __status and not allow accessing __status directly at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what about this:

ts_t module_function(int arg1, int arg2) {
  // Initialize the `__status` variable and set it to `TS_OK`.
  // This should be on the first line inside the function (rule 1).
  TS_INIT;

  // Check arguments to ensure they are greater than 0.
  TS_CHECK_ARG(arg1 > 0);
  TS_CHECK_ARG(arg2 > 0);
  
  // Check the result of another function with a `ts_t` return value.
  // The `status` value is properly propagated.
  // Splitting it into two lines makes debugging easier (rule 2).
  ts_t status = another_function();
  TS_CHECK_OK(status);
  
  // Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`.
  // The HAL_xxx error is properly mapped to TS_xxx and propagated.
  HAL_StatusTypeDef hal_status = HAL_function();
  TS_CHECK_HAL_OK(hal_status);
  
  // Check a generic condition inside the function and propagate the status code
  // if the condition is not fulfilled.
  TS_CHECK(arg1 * arg2 < 1000, TS_ERROR);

  // Alternatively, use the secure variant working with `secbool`.
  secbool ready = usb_ready();
  TS_CHECK_SEC(ready);
  
  // In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`)
  TS_RETURN;
  
error:
  // Potential cleanup code goes here.
  // ...

  // Return the content of the `__status` variable.
  // In case of no error, it's `TS_OK`.
  TS_RETURN;
}

But I personally prefer:

  // In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`)
  return TS_STATUS;
  
error:
  // Potential cleanup code goes here.
  // ...

  // Return the content of the `__status` variable.
  // In case of no error, it's `TS_OK`.
  return TS_STATUS;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm getting used to it. Maybe I like TS_RETURN more... :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made these changes here 7b70f0a

@andrewkozlik
Copy link
Contributor

I created a document in notion summarizing my concerns.
https://www.notion.so/satoshilabs/Unify-error-handling-in-C-aebe0fe5afa1417996329350869835c4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎 Needs review
Development

Successfully merging this pull request may close these issues.

3 participants