Report Normalization failures to Sentry #15695

Phlair · 2022-08-16T15:55:18Z

What

Implementation for this issue: #14926

Reports Normalization caused failures to Sentry so we can track in oncall.

How

Building on the job_error_reporter framework and Normalization failure reasons, JobErrorReporter now checks for Normalization caused failure reasons and sends up to Sentry as events.

Added some notes next to each item describing the changes in the recommend reading order below.

Notes

This doc details how we wanted to group dbt errors in Sentry. In order to do this, the SentryException is built by grabbing just the useful error info out of the 'stacktrace'. The full stacktrace is still visible in Sentry under the Failure Reason.
Experimented with handling errors coming from multiple models but ultimately couldn't find a consistent identifier in the dbt logs to separate these so only the first error will be used for grouping in Sentry. The rest of the errors will still be present in the stacktrace for the event.
We need the normalization version to group and tie into Sentry releases properly but we don't persist the normalization version outside of the code.
- Decided to add as parameter to JobErrorReporter and then import NormalizationErrorFactory (1 of 2 sources of truth for this value, the other being in normalization dockerfile) to ServerApp & WorkerApp where we build the JobErrorReporter.
- The tradeoff here is to be a less intrusive change than altering a db model and storing normalization version there. This doesn't change where we update the normalization version (workflow consistent), however it might not be a terrible idea in the future to persist the version of Normalization used in each attempt.

Recommended reading order

DefaultNormalizationRunner - some changes to building dbt TRACE msg
SentryExceptionHelper - building exception using just the useful error line (for grouping)
JobErrorReporter - addding handling for normalization failures
tests + minor changes

Phlair · 2022-08-17T17:00:34Z

example of test error in Sentry:

sherifnada

in the second screenshot, I see the title as DbtDatabaseError -- is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?

sherifnada · 2022-08-17T19:31:18Z

...orkers/src/main/java/io/airbyte/workers/normalization/NormalizationAirbyteStreamFactory.java

            case "warn" -> logger.warn(logMsg);
            case "error" -> logAndCollectErrorMessage(logMsg);
-            default -> logger.info(jsonLine.asText()); // this shouldn't happen but logging it to avoid hiding unexpected lines.
+            default -> logger.info(logMsg); // this shouldn't happen but logging it to avoid hiding unexpected lines.


this seems like it would render jsonLines differently than the current impl -- what's the upside to doing it this way?

Did have some wider changes in this part and missed changing this bit back, reverted.

sherifnada · 2022-08-17T19:49:39Z

airbyte-workers/src/main/java/io/airbyte/workers/normalization/DefaultNormalizationRunner.java

+    // The relevant error message is often the line following the "Database Error..." line
+    // e.g. "Column 10 in UNION ALL has incompatible types: DATETIME, TIMESTAMP"
+    boolean nextLine = false;
+    for (String errorLine : streamFactory.getDbtErrors()) {


is there any way to DRY this with the method in SentryExceptionHelper?

sherifnada · 2022-08-17T19:51:33Z

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java

@@ -166,4 +169,43 @@ private static Optional<List<SentryException>> buildJavaSentryExceptions(final S
    return Optional.of(sentryExceptions);
  }

+  private static Optional<List<SentryException>> buildNormalizationDbtSentryExceptions(final String stacktrace) {


how common is it that the line immediately following Database error is the "interesting" one? e.g: is it 100% of the time based on a reasonable empirical sample size?

So I was going for a classic 80-20 coverage move here, using the following data:

all normalization-system-errors (with the new failure reasons)
1,110 rows

Filtering for 'database errors'
736 rows
Nominal 736/1110 (66%)
Unique 368/442 (83%)

694/736 (~94%) follow the pattern of line immediately following "Database Error in model" is the single useful part.

Did some futher investigation today looking deeper at the complete set of failures we've encountered (coming from those metabase links above) and decided it's not significant extra work to cover a lot of the possible error structures based on what we've seen so far:

The other 42 errors follow 1 of 3 patterns:

SQL compilation error:\n (next line with detail)

Invalid input (a line X after this one that is "context: {}" has detail)

syntax error at or near "{value}" (next line with detail)

Currently implementing specific logic for these three extra edge cases so we cover all currently known database errors. (No doubt there are other edge cases that exist and we haven't encountered but we can build upon this parsing code over time.)

Also implementing logic for the rest of the error types (the other ~20% of unique errors we've seen):

Unhandled error

Compilation Error

Runtime Error

Basing the logic from our dataset in Metabase (in previous comment links), now covering >95% of every dbt error we've seen since implementing normalization failure reasons.

is this the title we see in the list of incidents in sentry's incident view? if so could we change that to be more unique for each issue?

These changes will also solve this point

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java

pedroslopez · 2022-08-18T07:45:10Z

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java

+      // previous line was "Database Error..." so this is our useful message line
+      if (nextLine) {
+        usefulError = errorLine;
+        break;


From reading this I think we're only building one SentryException from the stack trace, though I see the dbt stacks could have multiple errors (e.g. error 1 of 2 counts) - is there any value in building a SentryException for each of those?

I played around with this but decided to scrap it because:

the dbt error logs are json formatted and do sometimes have an identifier (.node_info.unique_identifier), however this isn't present in every error log line so we can't rely on that for grouping.

we also can't rely on simple order (for example look at the stacktrace in this test event). We could get clever and parse structures that are ordered but this would then still only cover the errors that are structured in a specific way when unfortunately they're wildly variable (e.g. check out all these with different structures).

side note: I will be pushing up some changes soon to cover grabbing a useful message from a wider array of the error structures but for now going to just identify events based on the first error message for simplicity. Will create an issue for incremental improvement on this to handle multiple exceptions.

pedroslopez

This looks good, though seeing the stack trace parsing reminded me of the JSON structured logging format you had mentioned earlier on - mainly left a question of whether those could be used to potentially do this a bit cleaner

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java

sherifnada · 2022-08-22T23:36:23Z

@Phlair I wont' be able to review tomorrow so feel free to move ahead with Pedro's review!

# Conflicts: # airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java

* bulk * simplification * voila * normalization version * key prefix & pmd fix * bits * test fix * handle more dbt error structures and DRY * format * better code comment * enum for keys * fix pmd * I _love_ pmd

bulk

145bae5

github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Aug 16, 2022

simplification

68285d9

Phlair temporarily deployed to more-secrets August 16, 2022 17:01 Inactive

voila

70a4141

Phlair temporarily deployed to more-secrets August 16, 2022 21:37 Inactive

normalization version

eec2bd3

github-actions bot added the area/server label Aug 17, 2022

Merge branch 'master' into george/normalization-sentry

681d976

Phlair temporarily deployed to more-secrets August 17, 2022 15:20 Inactive

key prefix & pmd fix

3564c92

Phlair temporarily deployed to more-secrets August 17, 2022 15:48 Inactive

Phlair changed the title ~~normalization -> sentry~~ Report Normalization failures to Sentry Aug 17, 2022

Phlair requested review from pedroslopez, benmoriceau and sherifnada August 17, 2022 17:01

bits

1e7f1ec

Phlair marked this pull request as ready for review August 17, 2022 17:07

Phlair temporarily deployed to more-secrets August 17, 2022 17:29 Inactive

Phlair added 2 commits August 17, 2022 19:35

Merge branch 'master' into george/normalization-sentry

518c99a

test fix

d63189c

Phlair temporarily deployed to more-secrets August 17, 2022 18:38 Inactive

sherifnada reviewed Aug 17, 2022

View reviewed changes

pedroslopez reviewed Aug 18, 2022

View reviewed changes

Phlair added 2 commits August 18, 2022 20:00

handle more dbt error structures and DRY

dea2aef

Merge branch 'master' into george/normalization-sentry

fe49c46

format

ec590da

Phlair temporarily deployed to more-secrets August 18, 2022 19:13 Inactive

Phlair requested review from pedroslopez and sherifnada August 18, 2022 19:15

Phlair added 2 commits August 18, 2022 21:16

better code comment

a169a3b

Merge branch 'master' into george/normalization-sentry

fd7654b

Phlair temporarily deployed to more-secrets August 18, 2022 20:18 Inactive

pedroslopez reviewed Aug 19, 2022

View reviewed changes

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java Show resolved Hide resolved

...src/main/java/io/airbyte/scheduler/persistence/job_error_reporter/SentryExceptionHelper.java Outdated Show resolved Hide resolved

Phlair added 2 commits August 19, 2022 15:09

enum for keys

7d4c0e7

Merge branch 'master' into george/normalization-sentry

23d5471

Phlair temporarily deployed to more-secrets August 19, 2022 14:13 Inactive

Phlair requested a review from pedroslopez August 19, 2022 14:14

Phlair added 2 commits August 22, 2022 12:36

fix pmd

76fae3c

Merge branch 'master' into george/normalization-sentry

e21dc1d

Phlair temporarily deployed to more-secrets August 22, 2022 11:40 Inactive

Phlair added 2 commits August 22, 2022 14:32

I _love_ pmd

f5ca8d7

Merge branch 'master' into george/normalization-sentry

de453be

Phlair temporarily deployed to more-secrets August 22, 2022 13:35 Inactive

Merge branch 'master' into george/normalization-sentry

0fa918e

Phlair temporarily deployed to more-secrets August 22, 2022 15:38 Inactive

pedroslopez approved these changes Aug 22, 2022

View reviewed changes

sherifnada approved these changes Aug 24, 2022

View reviewed changes

Merge branch 'master' into george/normalization-sentry

e38e896

# Conflicts: # airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java

Phlair temporarily deployed to more-secrets August 25, 2022 10:00 Inactive

Phlair merged commit ea44a0c into master Aug 25, 2022

Phlair deleted the george/normalization-sentry branch August 25, 2022 10:44

octavia-squidington-iii mentioned this pull request Aug 26, 2022

Bump Airbyte version from 0.40.2 to 0.40.3 #16028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report Normalization failures to Sentry #15695

Report Normalization failures to Sentry #15695

Phlair commented Aug 16, 2022 •

edited

Loading

Phlair commented Aug 17, 2022

sherifnada left a comment

sherifnada Aug 17, 2022

Phlair Aug 18, 2022

sherifnada Aug 17, 2022

sherifnada Aug 17, 2022

Phlair Aug 18, 2022

Phlair Aug 18, 2022 •

edited

Loading

Phlair Aug 18, 2022 •

edited

Loading

pedroslopez Aug 18, 2022

Phlair Aug 18, 2022 •

edited

Loading

pedroslopez left a comment

sherifnada commented Aug 22, 2022

Report Normalization failures to Sentry #15695

Report Normalization failures to Sentry #15695

Conversation

Phlair commented Aug 16, 2022 • edited Loading

What

How

Notes

Recommended reading order

Phlair commented Aug 17, 2022

sherifnada left a comment

Choose a reason for hiding this comment

sherifnada Aug 17, 2022

Choose a reason for hiding this comment

Phlair Aug 18, 2022

Choose a reason for hiding this comment

sherifnada Aug 17, 2022

Choose a reason for hiding this comment

sherifnada Aug 17, 2022

Choose a reason for hiding this comment

Phlair Aug 18, 2022

Choose a reason for hiding this comment

Phlair Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Phlair Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

pedroslopez Aug 18, 2022

Choose a reason for hiding this comment

Phlair Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

pedroslopez left a comment

Choose a reason for hiding this comment

sherifnada commented Aug 22, 2022

Phlair commented Aug 16, 2022 •

edited

Loading

Phlair Aug 18, 2022 •

edited

Loading

Phlair Aug 18, 2022 •

edited

Loading

Phlair Aug 18, 2022 •

edited

Loading