Improved Error format #6676

markfields · 2021-07-09T15:09:25Z

markfields
Jul 9, 2021
Maintainer

[edit: removing waypoints for now, see previous version for context on that]

On the MS Office Fluid team we've been struggling with how difficult it is to aggregate across the error messages that we log (see #5426). I've spent quite some time in both the code and the production logs to learn what we can do differently in how we instrument errors to yield quicker insight when analyzing error signals at scale.

My proposal is to standardize on the following simple interface for errors raised or handled through the Fluid Framework:

interface IFluidErrorBase extends Readonly<Partial<Error>> {
  readonly errorType: string; // Term succinctly describing what kind of error occurred (e.g. AuthorizationError)
  readonly fluidErrorCode: string; // Term describing a more specific case or location in code where the error originated
}

We already use errorType throughout, both to enable type narrowing to a specific TypeScript type, and to give a fairly low-granularity categorization of our errors. fluidErrorCode is new, and provides a property distinct from message where we can further specify the "what" of an error (or sometimes the "where") with static strings that are well-suited for aggregation.

Looking at how we instrument errors today, there are a few patterns that muddy the water today, which will be cleaned up as part of this effort. Here are some examples:

"Wrapping" an arbitrary error by extracting message, stack, and possibly errorType and creating a new object with a particular errorType. (See usages of wrapError)
Taking an existing error or error message (e.g. returned from the service) and prefixing it with additional information (see here or here)

After considering each use of wrapError, I don't believe we ever actually want to properly wrap an error where we change the errorType - the errorType is inherent to what originally went wrong. Rather, we are just looking to annotate the error for logging - or possibly mixing in other properties exposed via a TypeScript type. Annotating for logging can be achieved more plainly considering the ILoggingError interface, and adding a function annotateError:

interface ILoggingError {
  addTelemetryProperties(props: ITelemetryProperties): void;
  getTelemetryProperties(): ITelemetryProperties;
}

function annotateError(
  error: unknown,
  props: ITelemetryProperties,
): IFluidErrorBase & ILoggingError {
  // check typeguards for both IFluidErrorBase and ILoggingError.
  // If it's not ILoggingError then mixin ILoggingError members and add the props
  // If it's not IFluidErrorBase then use error.name as errorType
  // Likewise, try to fill in `message`, `name` and `stack` if any are missing.
}

Note that annotateError will return the same error object, just with additional stuff mixed in. The exception being if the provided error is not an Object, in which case we'll use it as the message of a new Error object.

See #6764 for PR introducing IFluidErrorBase, and the evolution of annotateError: Two functions, annotateErrorObject and normalizeError, both of which yield a valid IFluidErrorBase.

Execution Plan

Here's the sequence of issues we're working through here:

markfields · 2021-07-09T15:41:48Z

markfields
Jul 9, 2021
Maintainer Author

Another change I'd like to start staging is to transition from errorType to simply using the standard Error.name property. And in transitioning to that, I'd also propose we use some namespacing, to explicitly distinguish FF errors from something like TypeError - and even Driver v. Container v. Runtime error names.

2 replies

markfields Jul 9, 2021
Maintainer Author

@anthony-murphy you raised an objection to this:

I'm also not a huge fan of getting rid of error type, and i don't think it is synonymous with error.name. error.name seems more like the error code concept.

Compare https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error with our various ErrorType enums - feels like the same abstraction level to me. Esp as we use it for type narrowing to indicate what interface a particular interface adheres to for conveying additional context (like IAuthorizationError.claims)

markfields Jul 9, 2021
Maintainer Author

cc @jvoels, who intuitively expected the stuff we have in errorType to live in Error.name as I was explaining how our errors are shaped the other day.

anthony-murphy · 2021-07-09T16:40:50Z

anthony-murphy
Jul 9, 2021
Maintainer

I feel like this is getting over complicated. is finding where an error occurred really that hard? Do we really need another concept? seems like trying to rebuild stack, which we will still have? The only problem i ever had finding the location of was asserts without messages in cases where a function had multiple asserts, or all the function names had been minified after the last public method

I'm also not a huge fan of getting rid of error type, and i don't think it is synonymous with error.name. error.name seems more like the error code concept. rather than invest in changing this i'd rather see us use more structured error apis between our layers. then all thrown errors are just truly unknown things that should be investigated

5 replies

anthony-murphy Jul 9, 2021
Maintainer

If we have errors in our apis, i think we still need much of this in the case of unknown errors. a good example of this is the DataProcessingError which point to an unhandled exception along the data processing path, either ops or summary, and signifies a critical error.

anthony-murphy Jul 9, 2021
Maintainer

I also think something like waypoint encourages the catch and rethrow pattern, which i believe is an anti pattern. The only time we should catch is when we want to do something. usually this is a near an entry point, and results in failure as we should fail fast in the presence of unknow errors. The fastest way to fail is to just let the error bust the stack, and let the top most frame catch and close everything down

markfields Jul 9, 2021
Maintainer Author

I feel like this is getting over complicated

It's getting simpler, compared to the state of the code today. The reality is, we are doing this kind of annotation throughout the code, but in a nonstandard way that thwarts any kind of aggregate analysis. As part of this work I will be removing all notion of "wrapping" errors, which is a step in the right direction I think we all agree. I see the introduction of waypoints as a concession - to maintain some of the logging that's been attempted in the code today, but isolating it. Maybe it will be useless, and if so, we can follow-up and remove it. I've kept it out of the core IFluidErrorBase interface precisely because its value is uncertain, but it's an easy "bonus" we can get in the logs.

I'm not sure why catch-annotate-rethrow is an antipattern. A key part of this design is to preserve the original error object, so the net result is just additive, in terms of letting folks tack on logging info. CreateProcessingError is doing exactly this, except it's busted because it's overloading errorType - like @vladsud commented in a recent PR which I now fully agree with. It should just catch-annotate-throw the original error (and we'll use a new property to indicate DPE state, update our monitors and queries accordingly, etc).

markfields Jul 9, 2021
Maintainer Author

rather than invest in changing this i'd rather see us use more structured error apis between our layers. then all thrown errors are just truly unknown things that should be investigated

We don't have to wait until we change our API surface returning structured errors to get to the point where we can distinguish "truly unknown things". With this proposal, even in a world where we're still throwing our own errors, just filtering on eventType (even better if we do namespaced names in Error.name) you'll be able to distinguish our error types/names from unknown ones which will have types like Error or TypeError. Today, everything ends up stamped genericError which is useless.

markfields Jul 9, 2021
Maintainer Author

I just said this about CreateProcessingError, but want to be a little more nuanced after re-reading that PR conversation:

except it's busted because it's overloading errorType - like @vladsud commented in a recent PR

Here's Vlad's comment and your response (talking about his 2nd point). I re-read your response, and my thinking is definitely a departure here from your point about unknown errors being intentionally diverted here.

But I think the split we really want is expected v. unexpected. Surely there are unexpected errors that would occur that we raise (e.g. an unexpected server response that we generate a JS Error from), v. some unknown error. So I'd rather preserve the original error - whatever form it takes - and annotate that it interrupted data processing.

markfields · 2021-07-09T20:18:54Z

markfields
Jul 9, 2021
Maintainer Author

I spoke with @anthony-murphy directly, and agreed to take waypoints out for now. It's an additive change that can come later, if it becomes clear that it will be valuable. As of now, it may be me overgeneralizing a pattern that only appears in a few places.

0 replies

vladsud · 2021-07-12T20:56:35Z

vladsud
Jul 12, 2021
Maintainer

My main feedback would be to find an incremental way to move forward and learn on the way. Quite often these types of changes are disruptive and we realize new constraints or better ways of doing something on the way, but redoing is expensive (compat cost).

I think annotate pattern is good. I agree with Tony that number of places where we catch and retrhow should be low, but there are obvious places where it happens directly or indirectly (summarization is an example where it's done indirectly without throwing). Plus that will allow us to not use string concatenation, and instead add more props indicating "what" and maybe "why" (i.e. "socket.io" + "disconnect" + "transport close", but allow consumption / aggregation in Kusto based on whole thing or parts of it).
It's orthogonal, but we should also consider if we should use exceptions for expected errors (vs. using return value / type to clearly indicate that condition is expected, and make all non-expected errors more catastrophic).

As for wrapping - I agree with you that it should be avoided. That said, there might be places where code needs to communicate more generic error code to consumers (generic error(503) -> throttling error). I think when such cases happen, we should strongly consider logging original error as is but with additional correlation ID, and then raising new error with child correlation ID pointing to original error. That way we can connect the dots from telemetry, at the same time keeping API our consumers need to deal with clean (they should not care how we arrived to this state). Basically in places where logging & API (error reporting) are not aligned, we should not try to put a square peg in a round hole.

1 reply

markfields Jul 15, 2021
Maintainer Author

Thanks Vlad.

As for incremental path, I've already got a sequence of PRs in progress to go step-by-step. which has been useful. I'll update the original post with those PRs/issues.

I think converting some APIs to returning strongly typed error objects in failure cases is great, and can be explored independent of this, as it comes up. In the meantime. one feature of my work will be to make it immediately obvious via fluidErrorCode whether an error was raised by us v. came from an unknown source - the latter will have error code none. I'm looking forward to having this pivot in our telemetry very much.

As for unique IDs per error and correlating when wrapping, this is something I'd pondered and would be pretty simple to add. I've opened an issue about it in case we want to have further discussion before implementing. #6746

vladsud · 2021-07-16T18:12:54Z

vladsud
Jul 16, 2021
Maintainer

to make it immediately obvious via fluidErrorCode whether an error was raised by us v. came from an unknown source

I think it's more nuanced than that. We can raise the disconnect event (and mark where it was raised in our code), but some of the payload comes from server.
Similarly, we may raise summarizer failed event, but exception came who knows from where, so error object can't be fully trusted.
I believe it's much more about tagging individual properties and having a system to ensure we do not log questionable data in environments where it's unsafe (we can log everything in DF) then trying to do what you suggest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Error format #6676

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improved Error format #6676

markfields Jul 9, 2021 Maintainer

Execution Plan

Replies: 5 comments · 8 replies

markfields Jul 9, 2021 Maintainer Author

markfields Jul 9, 2021 Maintainer Author

markfields Jul 9, 2021 Maintainer Author

anthony-murphy Jul 9, 2021 Maintainer

anthony-murphy Jul 9, 2021 Maintainer

anthony-murphy Jul 9, 2021 Maintainer

markfields Jul 9, 2021 Maintainer Author

markfields Jul 9, 2021 Maintainer Author

markfields Jul 9, 2021 Maintainer Author

markfields Jul 9, 2021 Maintainer Author

vladsud Jul 12, 2021 Maintainer

markfields Jul 15, 2021 Maintainer Author

vladsud Jul 16, 2021 Maintainer

markfields
Jul 9, 2021
Maintainer

Replies: 5 comments 8 replies

markfields
Jul 9, 2021
Maintainer Author

markfields Jul 9, 2021
Maintainer Author

markfields Jul 9, 2021
Maintainer Author

anthony-murphy
Jul 9, 2021
Maintainer

anthony-murphy Jul 9, 2021
Maintainer

anthony-murphy Jul 9, 2021
Maintainer

markfields Jul 9, 2021
Maintainer Author

markfields Jul 9, 2021
Maintainer Author

markfields Jul 9, 2021
Maintainer Author

markfields
Jul 9, 2021
Maintainer Author

vladsud
Jul 12, 2021
Maintainer

markfields Jul 15, 2021
Maintainer Author

vladsud
Jul 16, 2021
Maintainer