-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report Normalization failures to Sentry #15695
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the second screenshot, I see the title as DbtDatabaseError
-- is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?
case "warn" -> logger.warn(logMsg); | ||
case "error" -> logAndCollectErrorMessage(logMsg); | ||
default -> logger.info(jsonLine.asText()); // this shouldn't happen but logging it to avoid hiding unexpected lines. | ||
default -> logger.info(logMsg); // this shouldn't happen but logging it to avoid hiding unexpected lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it would render jsonLines differently than the current impl -- what's the upside to doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did have some wider changes in this part and missed changing this bit back, reverted.
// The relevant error message is often the line following the "Database Error..." line | ||
// e.g. "Column 10 in UNION ALL has incompatible types: DATETIME, TIMESTAMP" | ||
boolean nextLine = false; | ||
for (String errorLine : streamFactory.getDbtErrors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way to DRY this with the method in SentryExceptionHelper
?
@@ -166,4 +169,43 @@ private static Optional<List<SentryException>> buildJavaSentryExceptions(final S | |||
return Optional.of(sentryExceptions); | |||
} | |||
|
|||
private static Optional<List<SentryException>> buildNormalizationDbtSentryExceptions(final String stacktrace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how common is it that the line immediately following Database error
is the "interesting" one? e.g: is it 100% of the time based on a reasonable empirical sample size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was going for a classic 80-20 coverage move here, using the following data:
all normalization-system-errors (with the new failure reasons)
1,110 rows
Filtering for 'database errors'
736 rows
Nominal 736/1110 (66%)
Unique 368/442 (83%)
694/736 (~94%) follow the pattern of line immediately following "Database Error in model" is the single useful part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some futher investigation today looking deeper at the complete set of failures we've encountered (coming from those metabase links above) and decided it's not significant extra work to cover a lot of the possible error structures based on what we've seen so far:
The other 42 errors follow 1 of 3 patterns:
- SQL compilation error:\n (next line with detail)
- Invalid input (a line X after this one that is "context: {}" has detail)
- syntax error at or near "{value}" (next line with detail)
Currently implementing specific logic for these three extra edge cases so we cover all currently known database errors. (No doubt there are other edge cases that exist and we haven't encountered but we can build upon this parsing code over time.)
Also implementing logic for the rest of the error types (the other ~20% of unique errors we've seen):
- Unhandled error
- Compilation Error
- Runtime Error
Basing the logic from our dataset in Metabase (in previous comment links), now covering >95% of every dbt error we've seen since implementing normalization failure reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?
These changes will also solve this point
...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java
Outdated
Show resolved
Hide resolved
// previous line was "Database Error..." so this is our useful message line | ||
if (nextLine) { | ||
usefulError = errorLine; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading this I think we're only building one SentryException from the stack trace, though I see the dbt stacks could have multiple errors (e.g. error 1 of 2 counts) - is there any value in building a SentryException for each of those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this but decided to scrap it because:
- the dbt error logs are json formatted and do sometimes have an identifier (.node_info.unique_identifier), however this isn't present in every error log line so we can't rely on that for grouping.
- we also can't rely on simple order (for example look at the stacktrace in this test event). We could get clever and parse structures that are ordered but this would then still only cover the errors that are structured in a specific way when unfortunately they're wildly variable (e.g. check out all these with different structures).
side note: I will be pushing up some changes soon to cover grabbing a useful message from a wider array of the error structures but for now going to just identify events based on the first error message for simplicity. Will create an issue for incremental improvement on this to handle multiple exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, though seeing the stack trace parsing reminded me of the JSON structured logging format you had mentioned earlier on - mainly left a question of whether those could be used to potentially do this a bit cleaner
...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java
Show resolved
Hide resolved
...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java
Outdated
Show resolved
Hide resolved
@Phlair I wont' be able to review tomorrow so feel free to move ahead with Pedro's review! |
# Conflicts: # airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java
* bulk * simplification * voila * normalization version * key prefix & pmd fix * bits * test fix * handle more dbt error structures and DRY * format * better code comment * enum for keys * fix pmd * I _love_ pmd
What
Implementation for this issue: #14926
Reports Normalization caused failures to Sentry so we can track in oncall.
How
Building on the job_error_reporter framework and Normalization failure reasons, JobErrorReporter now checks for Normalization caused failure reasons and sends up to Sentry as events.
Added some notes next to each item describing the changes in the recommend reading order below.
Notes
Failure Reason
.Recommended reading order
DefaultNormalizationRunner
- some changes to building dbt TRACE msgSentryExceptionHelper
- building exception using just the useful error line (for grouping)JobErrorReporter
- addding handling for normalization failures