Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Normalization failures to Sentry #15695

Merged
merged 22 commits into from
Aug 25, 2022
Merged

Conversation

Phlair
Copy link
Contributor

@Phlair Phlair commented Aug 16, 2022

What

Implementation for this issue: #14926

Reports Normalization caused failures to Sentry so we can track in oncall.

How

Building on the job_error_reporter framework and Normalization failure reasons, JobErrorReporter now checks for Normalization caused failure reasons and sends up to Sentry as events.

Added some notes next to each item describing the changes in the recommend reading order below.

Notes

  • This doc details how we wanted to group dbt errors in Sentry. In order to do this, the SentryException is built by grabbing just the useful error info out of the 'stacktrace'. The full stacktrace is still visible in Sentry under the Failure Reason.
  • Experimented with handling errors coming from multiple models but ultimately couldn't find a consistent identifier in the dbt logs to separate these so only the first error will be used for grouping in Sentry. The rest of the errors will still be present in the stacktrace for the event.
  • We need the normalization version to group and tie into Sentry releases properly but we don't persist the normalization version outside of the code.
    • Decided to add as parameter to JobErrorReporter and then import NormalizationErrorFactory (1 of 2 sources of truth for this value, the other being in normalization dockerfile) to ServerApp & WorkerApp where we build the JobErrorReporter.
    • The tradeoff here is to be a less intrusive change than altering a db model and storing normalization version there. This doesn't change where we update the normalization version (workflow consistent), however it might not be a terrible idea in the future to persist the version of Normalization used in each attempt.

Recommended reading order

  1. DefaultNormalizationRunner - some changes to building dbt TRACE msg
  2. SentryExceptionHelper - building exception using just the useful error line (for grouping)
  3. JobErrorReporter - addding handling for normalization failures
  4. tests + minor changes

@github-actions github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Aug 16, 2022
@Phlair Phlair temporarily deployed to more-secrets August 16, 2022 17:01 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 16, 2022 21:37 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 17, 2022 15:20 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 17, 2022 15:48 Inactive
@Phlair Phlair changed the title normalization -> sentry Report Normalization failures to Sentry Aug 17, 2022
@Phlair
Copy link
Contributor Author

Phlair commented Aug 17, 2022

example of test error in Sentry:

Screenshot 2022-08-17 at 17 59 04

Screenshot 2022-08-17 at 17 59 20

@Phlair Phlair marked this pull request as ready for review August 17, 2022 17:07
@Phlair Phlair temporarily deployed to more-secrets August 17, 2022 17:29 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 17, 2022 18:38 Inactive
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the second screenshot, I see the title as DbtDatabaseError -- is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?

case "warn" -> logger.warn(logMsg);
case "error" -> logAndCollectErrorMessage(logMsg);
default -> logger.info(jsonLine.asText()); // this shouldn't happen but logging it to avoid hiding unexpected lines.
default -> logger.info(logMsg); // this shouldn't happen but logging it to avoid hiding unexpected lines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it would render jsonLines differently than the current impl -- what's the upside to doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did have some wider changes in this part and missed changing this bit back, reverted.

// The relevant error message is often the line following the "Database Error..." line
// e.g. "Column 10 in UNION ALL has incompatible types: DATETIME, TIMESTAMP"
boolean nextLine = false;
for (String errorLine : streamFactory.getDbtErrors()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way to DRY this with the method in SentryExceptionHelper?

@@ -166,4 +169,43 @@ private static Optional<List<SentryException>> buildJavaSentryExceptions(final S
return Optional.of(sentryExceptions);
}

private static Optional<List<SentryException>> buildNormalizationDbtSentryExceptions(final String stacktrace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how common is it that the line immediately following Database error is the "interesting" one? e.g: is it 100% of the time based on a reasonable empirical sample size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was going for a classic 80-20 coverage move here, using the following data:

all normalization-system-errors (with the new failure reasons)
1,110 rows

Filtering for 'database errors'
736 rows
Nominal 736/1110 (66%)
Unique 368/442 (83%)

694/736 (~94%) follow the pattern of line immediately following "Database Error in model" is the single useful part.

Copy link
Contributor Author

@Phlair Phlair Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some futher investigation today looking deeper at the complete set of failures we've encountered (coming from those metabase links above) and decided it's not significant extra work to cover a lot of the possible error structures based on what we've seen so far:

The other 42 errors follow 1 of 3 patterns:

  • SQL compilation error:\n (next line with detail)
  • Invalid input (a line X after this one that is "context: {}" has detail)
  • syntax error at or near "{value}" (next line with detail)

Currently implementing specific logic for these three extra edge cases so we cover all currently known database errors. (No doubt there are other edge cases that exist and we haven't encountered but we can build upon this parsing code over time.)

Also implementing logic for the rest of the error types (the other ~20% of unique errors we've seen):

  • Unhandled error
  • Compilation Error
  • Runtime Error

Basing the logic from our dataset in Metabase (in previous comment links), now covering >95% of every dbt error we've seen since implementing normalization failure reasons.

Copy link
Contributor Author

@Phlair Phlair Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?

These changes will also solve this point

// previous line was "Database Error..." so this is our useful message line
if (nextLine) {
usefulError = errorLine;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading this I think we're only building one SentryException from the stack trace, though I see the dbt stacks could have multiple errors (e.g. error 1 of 2 counts) - is there any value in building a SentryException for each of those?

Copy link
Contributor Author

@Phlair Phlair Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this but decided to scrap it because:

  • the dbt error logs are json formatted and do sometimes have an identifier (.node_info.unique_identifier), however this isn't present in every error log line so we can't rely on that for grouping.
  • we also can't rely on simple order (for example look at the stacktrace in this test event). We could get clever and parse structures that are ordered but this would then still only cover the errors that are structured in a specific way when unfortunately they're wildly variable (e.g. check out all these with different structures).

side note: I will be pushing up some changes soon to cover grabbing a useful message from a wider array of the error structures but for now going to just identify events based on the first error message for simplicity. Will create an issue for incremental improvement on this to handle multiple exceptions.

@Phlair Phlair temporarily deployed to more-secrets August 18, 2022 19:13 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 18, 2022 20:18 Inactive
Copy link
Contributor

@pedroslopez pedroslopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, though seeing the stack trace parsing reminded me of the JSON structured logging format you had mentioned earlier on - mainly left a question of whether those could be used to potentially do this a bit cleaner

@Phlair Phlair temporarily deployed to more-secrets August 19, 2022 14:13 Inactive
@Phlair Phlair requested a review from pedroslopez August 19, 2022 14:14
@Phlair Phlair temporarily deployed to more-secrets August 22, 2022 11:40 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 22, 2022 13:35 Inactive
@Phlair Phlair temporarily deployed to more-secrets August 22, 2022 15:38 Inactive
@sherifnada
Copy link
Contributor

@Phlair I wont' be able to review tomorrow so feel free to move ahead with Pedro's review!

# Conflicts:
#	airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java
@Phlair Phlair temporarily deployed to more-secrets August 25, 2022 10:00 Inactive
@Phlair Phlair merged commit ea44a0c into master Aug 25, 2022
@Phlair Phlair deleted the george/normalization-sentry branch August 25, 2022 10:44
rodireich pushed a commit that referenced this pull request Aug 25, 2022
* bulk

* simplification

* voila

* normalization version

* key prefix & pmd fix

* bits

* test fix

* handle more dbt error structures and DRY

* format

* better code comment

* enum for keys

* fix pmd

* I _love_ pmd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/scheduler area/server area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants