-
-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent error handling #4008
base: main
Are you sure you want to change the base?
Conversation
|
The weird definition and usage of It might be usable to have macro shorthand for To me |
core/embed/lib/error_handling.h
Outdated
ts_t _status = status; \ | ||
if (ts_error(_status)) { \ | ||
__status = _status; \ | ||
goto error; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of using macros that perform jumps or return. It looks like a function, but has an unexpected side effect for someone who is not intimately familiar with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, it's a lot about getting used to it. People are always confused when they see it for the first time. However, this way of coding is very compact, and my experience is that it ends up being very easy to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In past I've used more weird name instead of verify_xxx()
so the side effect was expected;-) Maybe different name could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another perspective is that using this macro enables us to log all errors in the debug or emulator build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't doubt that it is very compact and I am sure that if all devs in the team would work with it daily, it would be very easy to read and they would find nothing unexpected about the side effects. However, I think that most of the development now gets done in Python and in Rust. So if a dev needs to work with the C code only occasionally, then they have to relearn the idioms again each time. Furthermore, since we are open-source, anyone who wants to look at our code would have to familiarize themselves with this idiom before being able to understand what's going on.
One example of macros with side effects that we have in our code are those in legacy/firmware/fsm.c
: CHECK_PIN
, CHECK_PIN_UNCACHED
etc. I feel it's always a real pain to touch the code that uses these macros. They make the code compact, but it would be easier for me if they were inlined.
That being said I am not strictly against this idiom. I am just very hesitant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point of view. Unexpected side effects are almost always inappropriate - CHECK_PIN()
and its counterparts are good examples. But I don't believe that verify_xxx()
falls into this category since it's a cross-cutting concept used throughout the entire codebase (hopefully in the future:-), and the side effect is quite apparent and local.
Our code is open source and should be readable for anybody without special 'training'. However, I think it already contains a lot of complexity, so it's not true that it's accessible to everybody for a long time.
But yes, I understand your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do decide to go ahead with this idiom, then we should at least use uppercase for the macros that have side effects to emphasize to the reader that they are macros. verify_ok()
-> TS_CHECK()
Yes, good point. Hamming distance between
I'm not sure that the
Yes, it's a bit unusual and requires some getting used to.
|
core/embed/lib/error_handling.h
Outdated
// Status code is hardened against fault injections | ||
// by storing the negated value in the upper 16 bits. | ||
typedef enum { | ||
TS_OK = TS_BUILD(0), | ||
TS_ERROR = TS_BUILD(1), // Generic error | ||
TS_ERROR_BUSY = TS_BUILD(2), // Busy | ||
TS_ERROR_TIMEOUT = TS_BUILD(3), // Timeout | ||
TS_ERROR_NOTINIT = TS_BUILD(4), // Not initialized | ||
TS_ERROR_ARG = TS_BUILD(4), // Invalid argument | ||
TS_ERROR_IO = TS_BUILD(5), // I/O error | ||
} ts_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the set of error codes should be here. This particular set seems very specific to trezorhal. On the other hand trezor-crypto, for example, would use a completely different set of codes, like TS_ERROR_SIGNATURE_INVALID
. Would this enum just be a large pile of error codes from various areas?
My impression is that if it is supposed to be a universal error type, then the set should be either very trivial (sectrue
, secfalse
) or maybe it should be something abstract like the protobuf FailureType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is similar to the errno codes used in the libc - all codes are defined in one place. If you need a new error code, define it here, but try to come up with the most generic name so that someone else can use it as well.
However, we can change it. The error type does not need to be an enum; we can use #define
and allow modules to define their own error codes. For example, the storage or optiga driver can then have specific error codes defined elsewhere.
Not sure what's better. Both ways have their own pros and cons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ts_t
shouldn't be an enum
. All the values are non-zero, so if it is accidentally used as
if (module_function()) {
// ...
}
instead of
if (verify_ok(module_function())) {
// ...
}
then it will always evaluate to true without warning. We need to do something like:
typedef struct {
int status_code;
} ts_t;
#define TS_OK ((const ts_t) {TS_BUILD(0)})
// ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. That's a good point. This approach eliminates a lot of bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed ts_t
to a struct in this commit cf075b5. This change helped me to find some bugs that were introduced during the sdcard driver refactoring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing the enum to an integer, we have lost the debugger's ability to show error codes in a readable form. However, we can easily fix this by using Natvis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also thought about module-specific error codes. Look at the notes below. In the end, we only need to add two new codes that are not so specific and can be used elsewhere:
One common set of error codes also allows us to create a function that returns an error name string.
typedef enum _optiga_pin_result {
OPTIGA_PIN_SUCCESS = 0, // The operation completed successfully.
OPTIGA_PIN_INVALID, // The PIN is invalid.
OPTIGA_PIN_COUNTER_EXCEEDED, // The PIN try counter limit was exceeded.
OPTIGA_PIN_ERROR, // Optiga processing or communication error.
} optiga_pin_result;
OPTIGA_PIN_SUCCESS => TS_OK
OPTIGA_PIN_INVALID => TS_ERROR_ARG
OPTIGA_PIN_COUNTER_EXCEEDED => new error code: TS_ERROR_TRY_LIMIT
OPTIGA_PIN_ERROR => TS_ERROR_IO
typedef enum _optiga_sign_result {
OPTIGA_SIGN_SUCCESS = 0, // The operation completed successfully.
OPTIGA_SIGN_INACCESSIBLE, // The signing key is inaccessible.
OPTIGA_SIGN_ERROR, // Invalid parameters or Optiga processing or
// communication error.
} optiga_sign_result;
OPTIGA_SIGN_SUCCESS => TS_OK
OPTIGA_SIGN_INCACESSIBLE => new error code: TS_ERROR_SIGNING_KEY
OPTIGA_SIGN_ERROR => TS_ERROR_IO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of mapping the codes like OPTIGA_PIN_INVALID
-> TS_ERROR_ARG
. That just seems error-prone, e.g. if some sub-function returns TS_ERROR_ARG
, then we have to watch out to translate that to OPTIGA_PIN_ERROR
or TS_ERROR_IO
, to distinguish it from an invalid PIN. It's a pretty important distinction. If the PIN is invalid, then the user should be offered to retry, but if there is a runtime error, then we probably want the user to restart their device, rather than panic and deplete their PIN attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand. Maybe this case requires a more specific error code. We should also consider whether error codes should be used for situations like 'invalid PIN.' Is it really an error, or is it just invalid user input? Should be treated the same way as i/o error or invalid argument?
core/embed/lib/error_handling.h
Outdated
// Jumps to `error` label if the condition is not `sectrue`. | ||
#define verify_sec(seccond, status) \ | ||
do { \ | ||
if ((seccond) != sectrue) { \ | ||
__status = status; \ | ||
goto error; \ | ||
} \ | ||
} while (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the macros defined here, it looks like we are not getting rid of secbool. Our code currently uses bool
, secbool
, a custom enum (e.g. optiga code) or a custom set of integers (e.g. trezor-crypto) to indicate the return status. The most familiar, compact and easy to use is bool
. A more secure variant is secbool
, which should have better resilience against fault injection, but it can be cumbersome to use. The enums are also fine, but they lack type safety. Finally, the custom integers are an abomination. I think that we should be aiming to narrow down and consolidate these different approaches. That could mean for example making secbool less cumbersome to use and possibly building on top of it, but not building yet another approach alongside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about it too. However, I still don't know how to reasonably connect it. Using two different types to represent boolean values isn't ideal.
Maybe after the this change cf075b5, 'secfalse' is more like TS_OK
and sectrue
is more like TS_BUILD(0xAAAA)
... I will think about it.
6ddf428
to
96048d6
Compare
core/embed/lib/error_handling.h
Outdated
#define ts_ok(status) (_ts_checked(status).code == TS_OK.code) | ||
|
||
// Returns `true` if status code is NOT `TS_OK` | ||
#define ts_error(status) (_ts_checked(status).code != TS_OK.code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there will be multiple error states, then we should also have ts_eq()
and possibly ts_neq()
, although that's equivalent to !ts_eq()
, but might be more easy to read. Similarly ts_error()
vs. !ts_ok()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be also nice to introduce a ts_switch(status)
macro which expands to switch(_ts_checked(status).code)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mechanism needs to allow easy translation of error codes and the ability to suppress certain errors or treat them as success. This should make that possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added ts_eq()
and ts_string()
, see 8b027fb.
The following code cannot be compiled since TS_xxx
is a structure. For some reason, the compiler cannot deduce that TS_OK.code is a constant expression... :-(.
switch (status.code) {
case TS_OK.code:
case TS_ERROR.code:
}
Maybe we can give up using case
for error codes and always use if. But I'll try to find some solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define ts_case(status) case status.code
ts_switch(f()) {
ts_case(TS_OK):
// ...
break;
default:
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned IRL some issue with the compiler not being able to treat the case value as a constant. In C++ defining TS_OK
etc. using constexpr
would solve this, but that probably isn't an option in C. Another idea is something like this:
#define ts_case(status) case status##__value
#define TS_OK__value TS_BUILD(0)
#define TS_OK ((const ts_t){TS_OK__value})
core/embed/lib/error_handling.h
Outdated
#define _ts_checked(status) \ | ||
({ \ | ||
ts_t _checked = (status); \ | ||
if ((_checked.code & 0xFFFF) != (_checked.code >> 16)) { \ | ||
__fatal_error("ts_check() error", __FILE_NAME__, __LINE__); \ | ||
} \ | ||
_checked; \ | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much size and time overhead this will add throughout the entire code base and whether it's worth it. It could also be a barrier to optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the assembly code yet, so I can't tell what the cost is. That's just an idea and we can omit this code if we think it's not worth it. I think this could be also optional.
status = drv2625_set_reg(DRV2625_REG_LIBRARY, | ||
LIB_SEL | DRV2625_REG_LIBRARY_GAIN_25); | ||
verify_ok(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use of the status
variable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you split this into two lines, you can place a breakpoint at the line containing verify_ok()
to observe the operation status. This pattern is very useful.
core/embed/lib/error_handling.h
Outdated
#define verify_init() __attribute__((unused)) ts_t __status = TS_OK; | ||
|
||
// Returns the current status. | ||
#define verify_status() (__status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of the macros above is confusing. verify_*()
usually have a flow control side effect, but these don't. Maybe consider #define TS_RETURN return __status
and not allow accessing __status
directly at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what about this:
ts_t module_function(int arg1, int arg2) {
// Initialize the `__status` variable and set it to `TS_OK`.
// This should be on the first line inside the function (rule 1).
TS_INIT;
// Check arguments to ensure they are greater than 0.
TS_CHECK_ARG(arg1 > 0);
TS_CHECK_ARG(arg2 > 0);
// Check the result of another function with a `ts_t` return value.
// The `status` value is properly propagated.
// Splitting it into two lines makes debugging easier (rule 2).
ts_t status = another_function();
TS_CHECK_OK(status);
// Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`.
// The HAL_xxx error is properly mapped to TS_xxx and propagated.
HAL_StatusTypeDef hal_status = HAL_function();
TS_CHECK_HAL_OK(hal_status);
// Check a generic condition inside the function and propagate the status code
// if the condition is not fulfilled.
TS_CHECK(arg1 * arg2 < 1000, TS_ERROR);
// Alternatively, use the secure variant working with `secbool`.
secbool ready = usb_ready();
TS_CHECK_SEC(ready);
// In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`)
TS_RETURN;
error:
// Potential cleanup code goes here.
// ...
// Return the content of the `__status` variable.
// In case of no error, it's `TS_OK`.
TS_RETURN;
}
But I personally prefer:
// In some cases it's useful to return before the cleanup code, (rule 3: don't return `TS_OK`)
return TS_STATUS;
error:
// Potential cleanup code goes here.
// ...
// Return the content of the `__status` variable.
// In case of no error, it's `TS_OK`.
return TS_STATUS;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm getting used to it. Maybe I like TS_RETURN more... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made these changes here 7b70f0a
I created a document in notion summarizing my concerns. |
This PR proposes a consistent approach to error handling in the C codebase. The proposal includes the following:
ts_t
type) with predefined errors prefixed withTS_
(Trezor status) and built-in fault injection hardening.goto
statements to facilitate cleanup at the end of the function.ensure()
macro types to simplify top-level code.hal_status_to_ts()
mapping function and theTS_CHECK_HAL_OK()
macro.This PR consists of three commits. Two of them serve as examples: a refactored haptic driver and a refactored SD card driver.
Error handling pattern:
We can now use the
ensure_xxx()
set of macros similar toverify_xxx()
for handling fatal errors. Theensure_xxx()
macros display an error message on the screen and shut down the device as before. (Note: This feature is not yet complete in this PR.)I understand that this is a completely new approach, which is not very common and requires some time to get used to. However, it is very simple and quickly understandable. I will not merge this until we reach a consensus that this is the right approach. Please feel free to provide any comments or suggestions.