Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Rust Errors more idiomatic #1655

Open
mrinalwadhwa opened this issue Jul 31, 2021 · 24 comments
Open

Make Rust Errors more idiomatic #1655

mrinalwadhwa opened this issue Jul 31, 2021 · 24 comments
Assignees

Comments

@mrinalwadhwa
Copy link
Member

mrinalwadhwa commented Jul 31, 2021

I wrote

 let mut forwarding_address = String::new();
 std::io::stdin().read_line(&mut forwarding_address)?;

Got

error[E0277]: `?` couldn't convert the error to `ockam::Error`
  --> examples/alice.rs:15:54
   |
15 |   std::io::stdin().read_line(&mut forwarding_address)?;
   |                                                      ^ the trait `From<std::io::Error>` is not implemented for `ockam::Error`
   |
   = note: the question mark operation (`?`) implicitly performs a conversion on the error value using the `From` trait
   = help: the following implementations were found:
             <ockam::Error as From<CredentialError>>
             <ockam::Error as From<NodeError>>
             <ockam::Error as From<OckamError>>
             <ockam::Error as From<ockam_core::routing::error::RouteError>>
           and 7 others
   = note: required by `from`

Update:

I think we need to take a deeper look at the design of our Errors so that they carry source, backtrace, cause etc. as they propagate within Ockam or surface to a user. We chose the code and domain approach because we'll need to expose them over FFI. But that shouldn't hurt the experience of Rust users. Rust Users should get source, backtrace, cause etc.

@mcastorina
Copy link

Hey, I would like to help implement this, but I'm new to ockam. I see the error definition has an error code and domain string.

pub struct Error {
    code: u32,

    #[cfg(feature = "std")]
    domain: String,
}

What should those be when there is a std::io::Error?

@mrinalwadhwa
Copy link
Member Author

@mcastorina thank you for picking this up 🙏
@jared-s @SanjoDeundiak @antoinevg what are your thoughts on good values here.

I think we need to take a deeper look at the design of our Errors so that they carry source, backtrace, cause etc. as they propagate within Ockam or surface to a user. We chose the code and domain approach because we'll need to expose them over FFI. But that shouldn't hurt the experience of Rust users. Rust Users should get source, backtrace, cause etc.

@mrinalwadhwa mrinalwadhwa changed the title ockam::Error should implement From<std::io::Error> Make Rust Errors more idiomatic Aug 3, 2021
@mcastorina
Copy link

I'm excited to contribute! :)

Perhaps https://github.com/dtolnay/thiserror will be useful here.

@SanjoDeundiak
Copy link
Member

@mrinalwadhwa I'm not sure we should implement Fromstd::io::Error for ockam::Error. That may lead to worse errors experience. E.g. in this case when std::io::Error is thrown, from a function caller side I would expect error that says "Invalid forwarder address user input" or something like that, but if we implement default conversion from std::io::Error caller will get something like "IO error", which is less informative.

@SanjoDeundiak
Copy link
Member

As for extending Error information with stacktrace and additional data, that would be great, but I'm not very familiar with how that works and how that can or can't be compatible with our Actor framework. There's some research work to be done by someone

@devalain
Copy link
Contributor

devalain commented Aug 7, 2021

@mrinalwadhwa I your example, you are writing an application so I don't think you should return ockam::Result but anyhow::Result instead in order to be more idiomatic (see anyhow vs thiserror) then you can use Context which is an easy way to produce useful error messages for the user. Plus, the error in your example has nothing to do with ockam, it comes from reading the standard input which depends on the OS.

Anyway, IMHO @mcastorina is right about the usefulness of thiserror: it can be used on structs and does not bring breaking change, it supports backtrace, source, From<Error> automatic implementation. As for ffi, the implementation can be adapted to the possible changes.

@antoinevg
Copy link
Contributor

antoinevg commented Aug 10, 2021

Nice to see love for dtolnay/thiserror and dtolnay/anyhow! IMO they're excellent defaults when it comes to Rust error handling.

That said, Ockam has some special needs.

@mrinalwadhwa already mentioned the need to support FFI but our existing error handling strategy also has a number of important advantages for targeting embedded (no_std) platforms:

  1. The code: u32 field only consumes 4 bytes and the domain: String field is optional. This matters because we want to be able to target resource constrained platforms. 1
  2. We rely on explicit conversion to a universal ockam_core::Error type which allows us to have per-crate error definitions while avoiding the Box<dyn std::error::Error> pattern. This matters because no_std:
    • does not have access to the std::error::Error trait and
    • does not have access to a heap allocated Box type. 2

So the big question is, without losing these advantages, how can we get features like:

  • backtraces / error-chaining
  • converting a numeric error code back to a string representation of its originating enum type
  • human-readable error messages

It would be nice to also support these on no_std but, just as with FFI targets, it's not the end of the world if we can't have all of them.

i.e. Neither thiserror or anyhow 3 provide full no_std support but this is not necessarily a problem if it's possible to integrate them with our existing error implementation in a way that won't require a bunch of #[cfg(not(feature="std"))] boilerplate to avoid breaking no_std builds.

In general, the situation is not great when it comes to idiomatic unified error-handling and no_std and is unlikely to improve in the future:

So maybe we should also be looking at using a collection of simple crates & code to implement our wish list individually?

A couple that spring to mind are:

  • yaahc/displaydoc - provides thiserror-ish human-readable errors via a doc comment and string interpolation (Also has no_std support)
  • core-error/core-error - an attempt to provide a unified Error trait implementation for no_std
  • richardanaya/no_error - A simple error library for no_std + no_alloc Rust

Can anyone suggest some more?

Finally, Rust's error-handling story is starting to solidify but there are still a few big changes coming (e.g. RFC-2504) so it would be nice to not lock the library or dependent apps into anything that's going to be deprecated.


[1] One change I would like to see here is to define domain as a &'static str rather than String as, again, String needs alloc when we're targeting no_std.

[2] We can get Box via alloc but requiring alloc for something as fundamental as Error handling is probably not a great idea.

[3] The anyhow crate does provide partial no_std support but this comes at the cost of relying on alloc.

@adrianbenavides
Copy link
Member

adrianbenavides commented Aug 11, 2021

#1655 (comment) @mrinalwadhwa In your example, you are writing an application so I don't think you should return ockam::Result but anyhow::Result

+1 to use anyhow when writing client applications.

#1655 (comment) So the big question is, without losing these advantages, how can we get features like:

  • backtraces / error-chaining
  • converting a numeric error code back to a string representation of its originating enum type
  • human-readable error messages

I agree with @antoinevg that the best thing for now is to wait. Given my recent experience (which you can read below), points 1 (backtraces) and 2 (u32 to strrepr) should be easier down the road if errors don't contain parameters. Finally, point 3 (human-readable error messages) should be feasible if the core::fmt::Display trait is implemented, right?


Here's my two cents on whether enum's variants should contain parameters to build custom messages or not and how this impact the above three points.

In the project I've been working on for the past two years (close to 100k lines) we use thiserror. We don't use any feature other than the Error and Display trait implementations to save us from writing some boilerplate code. That is, we don't make use of the backtrace or source features.

We follow a similar approach to what's done in the ockam crates. We have a struct like the following:

pub struct Error {
    code: ErrorCode,
    friendly_message: String,
    internal_message: String,
}

where:

  • ErrorCode: is a string code, where multiple errors can return the same code (project requirements)
  • internal_message: prints the Display implementation, which contains a detailed description of the error
  • friendly_message: prints the message we want to display to the final user (API), usually a simplified version of the internal_message

Then we have a bunch of enums with the errors of a given layer (api, commands, infrastructure, core, ...) implementing the From<AbcdError> for Error, where each variant can have arbitrary parameters. For example:

#[derive(thiserror::Error, Debug)]
pub enum CustomError {
    #[error("score below threshold [expected={}][actual={}]", _0, _1)]
    ScoreBelowThreshold(u32, u32),
    #[error("unknown error")]
    Unknown,
}

We really struggled for a while deciding how to deal with error handling. We were also new to Rust. And, honestly, we are not 100% happy with this approach, but it has proved to be pretty solid so far. Having a good error handling strategy can make a huge difference in the developer experience, that's for sure.

The biggest disadvantage we've found: variants with parameters make debugging easier but are limiting if you want to use thiserror's source and backtrace features, for example. Having parameters also prevents us from converting an ErrorCode back to the original error. So unless it's strictly necessary, I would use the parameters to build log messages and keep the errors variants without parameters.

@thomcc
Copy link
Contributor

thomcc commented Aug 18, 2021

I've been thinking about this a bit, and feel like I should write down my thoughts in case I'm getting something massively wrong about the ockam codebase — totally possible, I'm very new. A lot of this is informed by my work on the stdlib and mozilla/application-services (which had some similar constraints, but, well, fewer).

I think a design like the std::io::Error type is probably most appropriate for this. This should allow us to avoid giving up the ability to expose things on the other side of the FFI cleanly (at least in some cases). In particular, the fact that in io::Error, you always have an ErrorKind — for our errors we could enforce always having an error code and a domain (which I would change to a &'static str, or even a &'static CStr, or something that wraps it anyway, for FFI compat).

It hopefully also would allow us to avoid allocation in many cases, see the internal io::Error::const_new function (or if rust-lang/rust#87869 has landed, io::const_io_error!... unless it has been renamed, which seems possible), and even for the custom error case, eventually (perhaps not immediately), something based on unsize or stackbox could be used to have a Box<dyn Error+Send+Sync> that doesn't use the heap (it's also possible that heapless has options for non-heap-allocating boxed trait objects, I haven't looked). It would also allow natural support for backtraces, of course, on platforms and configurations where they're available.

Unfortunately, the approach you get with something like thiserror on an enum tends to be better if you don't have to ferry user errors through it. It also grows fairly large over time, and is hard to expose to the FFI, except manually. We could probably design our own proc macro for this (mostly for assigning error codes, which is otherwise tedious) and I do still consider that an option (I haven't made up my mind here), but... My experience is it's a bit of a headache, and in https://github.com/mozilla/application-services became a bit unwieldy. This approach feels sloppy and bad to use from user code unless you curtail that, which can be very difficult (although it's probably the cleanest and simplest to write, ironically).

Also, direct use of anyhow (or similar, such as eyre) from library crates isn't great, since those types don't implement Error (they can't for coherence reasons), which means users can't wrap them. While we could invent something similar (I'm pretty familiar with the internals, and have patches in anyhow's low level guts), it wouldn't get the ergonomics we need, and I'm unsure that we could do it in a way that maintians both the no_std support we want, as well as the ability to expose to the FFI.

All that said, one constraint I haven't fully worked through is serialization. In #967 the errors were reworked due to the need to be deserializable, and that... complicates the design somewhat, especially the desire to have static data. Hrm.

Sorry that this is a bit disjointed, I felt like getting it down when I had the thoughts rather than waiting until morning (which may have been advisable).

(This is slightly related to #1562, which is partially about FFI error ergonomics)

@antoinevg
Copy link
Contributor

Great article about std::io::Error for everyone (like me!) who didn't realize just how neat the implementation is:

https://matklad.github.io/2020/10/15/study-of-std-io-error.html

@thomcc
Copy link
Contributor

thomcc commented Aug 18, 2021

Yep, it's a great and easily-overlooked example of Rust API design (and matklad's whole blog is top-notch, I highly recommend it).

@thomcc
Copy link
Contributor

thomcc commented Aug 18, 2021

All that said, one constraint I haven't fully worked through is serialization. In #967 the errors were reworked due to the need to be deserializable, and that... complicates the design somewhat, especially the desire to have static data. Hrm.

Having slept on it, I think there are two approaches here I hadn't really considered yesterday (and it's possible that the right solution will involve both in some capacity).

  1. Use heapless::String. Even if we have the stdlib, it doesn't seem unreasonable to put a max length on the "error domain" string, although maybe I'm mistaken here.

  2. Have deserialization convert a &'static str into a String.

For example:

// This is a private inner type that isn't exposed outside of ockam_core.
enum Repr {
   Static { code: u32, domain: &'static str, ... },
   NonStatic { code: u32, domain: String, ... },
   // ...
}

When serializing Repr::Static, it would contain data independent of which variant it used, but it would deserialize as Repr::NonStatic. This means we'd need to write manual Serialize/Deserialize implementations, but I don't think this is the end of the world.

A variant of this approach is to pack the &'static str and String into one bundle, similar to a Cow<'static, str>. (Other crates to see along these lines are https://github.com/thomcc/arcstr, and https://github.com/maciejhirsz/beef, although I don't think of these we'd actually use).

I still have to work through my thoughts here on whether or not this actually buys us anything (aside from some perf), but it's an interesting consideration I hadn't thought of when making the writeup last night.

(Sorry for all the messages, this is both one of my first issues here, and also something I think is critically important to having a good-feeling API, so I'd rather over-share my thought process than under-share it. It also serves to document some of the logic for future users)

@mrinalwadhwa
Copy link
Member Author

Multiple messages are great. Don't hold back .. your train of thought is insightful and I'm learning a lot.
Same for thoughts above from @antoinevg & @adrianbenavides.

@thomcc
Copy link
Contributor

thomcc commented Aug 23, 2021

Some further thoughts on this came to me over the weekend.

For one, I don't really know what the use of the error domain is, besides helping out the other side of the FFI, perhaps.

The second takes some explanation:

One hesitation I have with the io::Error-like scheme is that it won't work with ?. I alluded to this kind of issue before by mentioning that anyhow/eyre can't implement Error for coherence reasons, but essentially the issue is:

In order for ? to work, some error type (lets call it MyError but obviously in this case it would be ockam_core::Error — I'm being vague just because anyhow::Error and eyre::Error hit the same issue) needs to be From<UserErrorType>. If you want to handle this for all possible user error types, you want to do this as

// This impl is referenced to later as (*1)
impl<E: std::error::Error> From<E> for MyError { ... snip ... }

Then, to allow users to interact with our errors nicely, we need to implement std::error::Error. This will let them put it in a type-erased error container like Box<dyn Error>, io::Error, or even {anyhow, eyre}::Error (or failure, from back in the day):

// This impl is referenced to later as (*2)
impl std::error::Error for MyError { ... snip ... }

Unfortunately, we can't. Consider the case above where we invoke (*1) on MyError, that is, we're calling From<MyError> for MyError. Well, there are actually two of these. There's the (*1) we wrote above, and then there's one in the stdlib. Every type T already impls From<T>. This is partially so that x.into() is a noop if you're already the required type, and generally is useful for generic programming. The coherence issue is that you can't have both of these.

(There are additional complexities too — From<E> for Box<dyn Error> existing tends not to help matters when trying to work around this issue...)

There are two solutions to this:

  1. Get rid of From<E: Error> (e.g. (*1)), the way io::Error does, and have users go through an Error::new-constructor which hurts ergonomics.
  2. Get rid of impl std::error::Error (e.g. (*2)), the way anyhow/eyre (or failure in the past...) do, and make it harder for your users to "get away" from your error type. I think there might be things we can do here, but we don't really want to be in a space competing with anyhow/eyre on our error type — it would be needless.

I've been leaning towards 1, part of this is that I don't know how we can ever have a From impl that assigns the correct Error domain, and assigns a reasonable Error code that the user chooses1...

But the downside here is that it's friction for anybody writing an ockam::Worker implementation — kind of a lot of friction. Users would have to quite frequently do .map_err(|e| ockam::Error::new(e, ...))? everywhere you need to return an error inside an implementation of your ockam::Worker, which kind of sucks2. However, the solution-space here could just be to tweak the Worker trait, or the API on Context that accepts a worker... Which makes me think that this is related to the problems in #1564 or at least could share part of a solution.

That is, one way you could imagine this working is:

  1. We add an type Error: std::error::Error + Send + Sync to ockam::Worker.
  2. Users write code for an ockam::Worker<Error = UserErrorType, ...>.
  3. We automatically rewrite it to use ockam::Error by converting the return values, which effectively type-erases the Error parameter.

This still has a few issues. The main one being that now we might end up with ockam::Errors that wrap ockam::Errors, which is kinda janky. In practice it's not that big of a deal... but an approach based fully on adjusting the #[ockam::worker] (and possibly #[ockam::node]) proc macros might be able to do some magic under the hood to avoid it (faking specialization via autoref, for example).

I haven't gotten that far in experimenting here (it basically came to me over the weekend that this approach might be solve more problems), but so far it seems promising.


1: I'm also not sure how to ensure users pick a good error code, though — even as it is we probably have cases where the error code contains a u32 payload, which depending on the value, can theoretically cause it to overlap with another error code. It's not clear this can happen in practice, though, probably not.

I also have been imagining that maybe implementing a proc-macro that auto-assigns error domains / codes like #[ockam::error(code = ...)], similar to a thiserror-style thing. I'm not sure this is worth the trouble, though.

2: It's not as bad for io::Error of course, because you don't implement the io traits yourself very often, you just use them. Whereas, I get the impression that implementing ockam::Workers is a significant part of the well, "whole point".

@SanjoDeundiak
Copy link
Member

@thomcc I wonder how backtrace looks like if error happened inside a Worker/populated to the Worker. When you debug Workers, backtrace that debugger shows you is useless.

@thomcc
Copy link
Contributor

thomcc commented Aug 26, 2021

I need to do some experimentation to answer that, but it's a good point.

@thomcc
Copy link
Contributor

thomcc commented Dec 6, 2021

Over the weekend I've been revisiting the error design, which has been fairly fruitful — I didn't really understand the issues or design constraints last time... and I think the proposals I made would not have helped.

Or, they'd help a little, but would have left many of the same problems that exist in the current design1. (In my defense, it was my first task).

Anyway, after a lot more understanding of ockam, how it's used, and its constraints, and a bunch of research... I think I have come to a good understanding about the use cases ockam_core::Error needs to support, and how to get there.

Note that... this is mostly ignoring things that I think are bigger issues that I intend to address in another write up (that I have in progress), like swallowing errors, serialization, and several more.

Anyway, here's my current thinking about what ockam_core::Error should be good at:

ockam_core::Error requirements

ockam_core::Error has (at least) three important use cases:

  1. Communicating errors in a way that other code could interpret and interpret and handle. This should work across API and protocol boundaries, even to remote machines.

    The intent behind the current error design seems to address this case, as you need to know both of those things. The implementation is very hard to use in practice though, and having only this seems to cause problems.

    It is worth noting, that the current node implementation ignores errors from message handling after logging them. In order to support doing better things than this, the system probably needs to be able to distinguish between different severities of errors.

    That said, I have another writeup on this that I'm working on, so I won't get into it here too much.

  2. Communicating an error in a way that someone who is not an expert in ockam2 can understand the error quickly, and if the error indicates a problem they need to fix, resolve the problem easily -- even if they are .

    This also should work across remote machines, but probably doesn't need to be that much more than decent error message.

    Needless to say, we ockam_core::Error's current implementation does not help here.

  3. Communicating an error with sufficient context and detail so that a developer working with or debugging Ockam (or ockam-using code) can identify and fix the problem.

    Largely speaking, this needs to work very well on the same machine, but (at least in debug builds or when an environment variable is enabled), it should have some support for transmitting it remotely. (That said, I think most of the unserializable stuff here can be handled by converting to a string or similar).

    This is also only relevant for certain kinds of errors — routine I/O failures probably don't need this kind of handling, at least by default

    As you probably already know, ockam_core::Error currently has no support for this.

How to get there (... roughly)

(Caveat: I have a bit more of the design worked out than this, enough to have cleared up most holes, but still not 100% and probably am not summarizing it in enough details anyway)

I think all three of these have different solutions. My current thinking for how to handle the use cases above is (in reverse order):

  • For the 3rd of those — lots of context aimed at developers debugging Ockam — is the easiest3, since it's what the Rust ecosystem has focused most on. Location::caller() and #[track_caller] to get the file/line/etc where the error is generated, and even collecting a Backtrace and/or SpanTrace at the site of capture if an env var is flipped. (Anyway, I'm aware that I'm handwaving this one, and my plan here is actually pretty cool, but I haven't finished it and want to move on)
  • For the 2nd — error messages a technical non-expert can understand — probably requires either requesting an explicit error message. It might be enough to accept a cause error too. This seems an easy-ish one, and will help a lot especially moving forward, but if you can't tell

  • Finally, the first despite "sort of working for that use case" has several very significant areas where it can be improved — I don't think our crates should be in charge of assigning their own error codes for the most part (and I think we want to store real enums, meaning we get a human readable name for the code, etc).

    Fixing this without causing issues for FFI and/or targets that can't rely on the message/metadata being present is... the hardest — see Appendix A for the current sketch of the design.

Anyway, thats all I have now.

Appendix: Replacing ockam_core::Error::code.

Basic idea: two new enums (at least) are added to ockam_core that describe "canonical error codes" and "canonical error origins". Something like:

  • enum ockam_core::ErrorOrigin, which describes the component at the level of detail like Transport, Vault, Node, etc (probably things like Application and Unknown too).

    We can probably automatically assign this in most cases based on the module_path of the code that produced the error, which we should be able to get.

    Note that this explicitly is not at the level of granularity of crates, modules, or even conceptual components — (e.g, ockam_transport_tcp and ockam_transport_websocket are both ErrorOrigin::Transport)

  • enum ockam_core::ErrorKind, which looks a lot like std::io::ErrorKind, but with a set of errors likely to happen inside ockam, rather than a general set for all I/O.

The main idea behind these is that most code that checks errors will probably prefer writing checks like

use ockam_core::{ErrorKind, ErrorOrigin};
let is_transpor_timeout = (
    e.origin() == ockam_core::ErrorOrigin::Transport &&
    e.kind() == ockam_core::ErrorKind::Timeout
);
if is_transport_timeout { ... }

which then handles all transports, rather than having to figure out which error code number comes from TCP, which from WS, etc.

Footnotes

  1. For example, the error codes would be just as bad. A proc-macro to help generating them would solve essentially none of their problems, even if it would be nifty. Backtrace/SpanTrace would be nice to have (they are are better than nothing), but they are not going be magic — our notion of a "call" for us is not the same as a stack frame, and its probably not a tracing span either... Etc

  2. Perhaps they're a developer getting started with it, someone who is operating a ockam-using tool, or invoking the Ockam CLI tool.

  3. Assuming fixing the log-and-ignore error handling we have in some places.

@thomcc
Copy link
Contributor

thomcc commented Apr 8, 2022

In #2566 @spacekookie and I overhauled the error API.

While I believe that this is largely an improvement, It revealed it has a number of areas needing further refinement, and several aspects did not pan out as well as I had hoped:

no_std insufficiently considered during the design:

This part of the design completely failed to pan out, IMO. The root issue is that the trait bounds used to accept "any error" under feature = "std" are non-viable under no_std -- we have a compat trait, but nobody implements it.

The largest problems this causes are two-fold:

  1. This means that with this change, code that previously could ignore no_std now must special case it.

  2. We essentially lose all ability to have any error information beyond the code in the no_std case. This will make it even harder to debug what went wrong than it was before.

Needs improved usage examples

While there's a lot of documentation for the error kinds/codes there's insufficient guidance on usage of the error container. For example, I invisioned we'd use Error::new(kind, origin, "message ...") to construct things if no cause is present, for example. (This is basically how something like std::io::Error::new gets used)

Too much boilerplate in per-crate error enums

In practice I think we should not need the per-crate error enums much. They encourage us to discard error causes, and require a significant amount of boilerplate.

Instead, I'd rather us propagate errors directly

Thoughts for improvement

No more serialization constraint

One development since this design is that I think we no longer actually need errors to be serializable (I believe it was only used by VaultWorker, which has been removed).

This would simplify the design, and allow us to support downcasting to the cause (or even allow anyhow-style downcasting to any member of the set of causes). This is desirable because it allows users of ockam to return an error from a worker and losslessly get it back from the other side.

Blanket impls on no_std

The biggest area of uncertaintly I have is how to improve things so that the no_std situation is less of a problem.

  1. I think one possible fix may be to add blanket impls to ockam_core::compat::error::Error under no_std on anything which is Display+Debug. That said, this is a meaningful divergence from the trait it attempts compatibility with, so it probably requires some rethinking the trait entirely.

  2. Without the serialization constraint, it would be easy for us to store:

    • The caller location -- The API returns a &'static Location<'static> to us, which is only a single pointer in size, and requires no allocation. I suspect this is acceptable on embedded (that said, we would probably allow disabling it, since it the location info adds to the static data in the binary).
    • The error that was passed in -- in particular, we could use the same trick dyn*/ThinBox intend to use and store it inline in the case where it's smaller than a pointer. This would hopefully allow supporting some information about the error cause.

Much of this can be done in a more incremental way than was attempted this time.

@spacekookie
Copy link
Contributor

spacekookie commented Apr 8, 2022

I agree with most of your observations, except:

Too much boilerplate in per-crate error enums

In practice I think we should not need the per-crate error enums much. They encourage us to discard error causes, and require a significant amount of boilerplate.

Instead, I'd rather us propagate errors directly

The problem I see with this approach is that we will return errors that don't mean anything to users. If you look at the breakdown from the ockam_node miro board, tons of errors are because of an internal sender problem. Throwing a tokio error at the user is extremely unergonomic, so we'll want to wrap the error with some amount of context at least.
Similar to decoding issues. serde_bare errors are very obtuse so instead we should wrap them in some context. Down the line, at least in ockam_node and ockam I would like to keep some kind of failure journal. Because lots of our errors are red herrings (i.e. tokio failed or bare failed) we need to be able to keep track of other events that happened in the system, for example when a receive fails, we can check the error journal or some other state, see that the node was shut down, and then return a Rejected instead, etc.

Overall crate-local errors (or even module-local errors) should be considered part of the API, not just an afterthought. And especially not something where we should just propagate errors from layers that our users neither know or care about

@thomcc
Copy link
Contributor

thomcc commented Apr 8, 2022

so we'll want to wrap the error with some amount of context at least

This is what the Error::context is for, mostly. The use I imagined was failure/anyhow context chaining like:

some_tokio_func().await.map_err(|e| {
    Error::new(Origin::Node, Kind::Io, e)
    	.context("reason", "some_tokio_func() is upset!")
})

(That said, I don't blame you for not realizing this -- the fact that it's key/value so that stuff like errors from workers could contain which worker type/instance caused the issue probably made this confusing)

Or is there a reason that's not viable?

@spacekookie
Copy link
Contributor

I think I need to play around with the attached context API a bit more. It's certainly doable. I still think that having a per-API error facade to abstract the most common problems is a good design decision.
That being said, the current one is bloated and kinda useless so I'll look into how and what can be improved here

@thomcc
Copy link
Contributor

thomcc commented Apr 8, 2022

especially not something where we should just propagate errors from layers that our users neither know or care about

I partially agree. I think you should have the underlying error available in some way -- dropping it on the floor because it seems like an implementation detail is not useful behavior for a library.

And I think a lot of them are potentially relevant to to users of the API. For example, serde errors:

  • deserialization errors could be them sending data on one end that doesn't match on the other. (some context that would be useful would be the type name of the type which couldn't be deserialized to)
  • serialization errors in serde should basically never be ignored -- almost always indicates misuse, at least when serializing to data in memory. Users should fix their types when that occurs, since it means the output format can't support the type (for example, JSON can't support serializing a Map<K, V> for if K is not a string/integer, etc)

(The use of BARE makes everything more annoying here since it has basically no ability to provide even remotely useful info...)

I also think tokio errors are a weird case. Tokio errors have a few flavors:

  1. IO errors: These are definitely relevant to the user, but are probably not coming from ockam_node.
  2. Channel errors: Most of the time, Send/Recv errors may indicate bugs in the node/router, probably (I suppose except during shutdown?).
  3. Timer errors: except during shudown, these indicate load is too high. Something needs to backoff somehow, I think this is an error that needs to be handled by the node/router.

A lot of these I feel like we are in the habit of just sending to without thought. The very vague hope was that the fact that there's easy anyhow/failure/Box<dyn Error> autoconversion of any error type to ockam_core::Error would eventually force us to think about this kind of stuff more. (That said I didn't expect this to happen during the initial landing)

I think I need to play around with the attached context API a bit more.

Ah, yeah, I think this is maybe what I mean by needing improved usage examples. It's probably pretty unclear that that's what that part is for. I had planned on integrating it myself, which would help there, but... well, yeah (thank you so much for taking it over though seriously).

That said I think this part of the API does have some real wonkiness:

  • The context keys are arbitrary strings. This is flexible to the point of being unclear. There should be an enum or something (maybe allowing arbitrary strings for cases not otherwise handled).

  • The context key should be optional. the most common case for the context in failure/anyhow in my experience is noting what you were doing when the error happened, so it should support that

  • Definitely need a extension trait adding .context and .with_context methods to Result<T, Error> (which would do nothing for Ok(_), but add it to the error).

  • Assuming we don't need deserialization of these, the context entries should possibly store a &'static Location<'static>.

Down the line, at least in ockam_node and ockam I would like to keep some kind of failure journal

Hmm... I don't really know what you mean by a failure journal.

Overall crate-local errors (or even module-local errors) should be considered part of the API

Yeah I mean, the need to go through traits really limits flexibility here. Same reason std::io::Error can be used to round-trip user errors, as well as being a giant catch-all error type which can represent any problem that you can have when doing any sort of interaction with the OS.

@thomcc
Copy link
Contributor

thomcc commented Apr 8, 2022

I still think that having a per-API error facade to abstract the most common problems is a good design decision

So I'm not against this, for sure. In the direct user-facing API-surface there are very likely pieces where it doesn't really make sense to use something type-erased like ockam_core::Error.

In the internals I'm unsure. I think we need to be considered about error handling there, and the errors aren't API-facing. If there's specific cases that need to be handled separately it makes sense though... (Hm, I guess I see how something like the Result<Result<_, _>, _> stuff in that sled blog post might be a better way for some pieces here).

I'd... really like to try and avoid making it too easy/automatic to ignore errors (at least in some of the code).

@thomcc
Copy link
Contributor

thomcc commented Apr 8, 2022

Ah, okay. I'm pretty sure I have a solution to how to make the design not require different trait bounds on no_std vs std (in no_std mode nothing was convertible into the error trait, because no_std-compatible code tends to not use a fallback for errors and just guards on #[cfg(feature = "std")]).

  1. Using a more complete error1 trait shim in no_std.
  2. Using a specialization hack (like autoref or valuable-style typeid checks) for a clean fallback for types that do not implement the error trait, solving the rest of the problem.

This allows the function signatures to work without differences between no_std and std -- presumably embedded is pretty restricted with what it can store. It will also hopefully improve ergonomics on all cases, especially of the context API.


Also, it does turn out that nothing needs error roundtrip serialization (when the initial work was done, it was still needed, but it got removed since then). I had a quick meeting with @spacekookie this morning about this and it also seems that supervisors will not need it either (they need serialization of some data, but no requirement for deserialization)

This enables... major cleanups to the internals and the model. Actually, almost all the complexity in the code (and much of the complexity in the model) came from how painful serialization of some of this data is.

Footnotes

  1. We're already using core2::io as the std::io fallback, so switching to core2::error seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants