Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with Sentry performance monitoring for BigQuery destination #9328

Closed
tuliren opened this issue Jan 6, 2022 · 10 comments · Fixed by #9745
Closed

Experiment with Sentry performance monitoring for BigQuery destination #9328

tuliren opened this issue Jan 6, 2022 · 10 comments · Fixed by #9745
Assignees
Labels
area/benchmark area/connectors Connector related issues area/databases priority/high High priority type/enhancement New feature or request

Comments

@tuliren
Copy link
Contributor

tuliren commented Jan 6, 2022

Acceptance criteria

  1. Introduce Sentry SDK to warehouse destination connector
    • Let's start with BigQuery destination.
    • It can either be a pure Java SDK, or Log4j SDK.
    • Compare the two different ways to install Sentry (or any other approach if there is more), and determine which way is better.
  2. Collect exceptions with the Sentry client.
    • If there is a way to automatically collect them all (e.g. through the Log4j SDK), let's try that.
    • Otherwise, let's try to collect one or two frequent exceptions, and see if it is useful.
      • We can systematically determine which exceptions to collect in the future.
  3. Add custom instrumentation to track the integration runner.
    • It is unclear what information Sentry will collect.
    • Determine whether this is useful.
  4. Add JDBC instrumentation to track the JDBC communication.
    • Again it is unclear what exactly will be sent to Sentry.
    • Determine whether this is helpful.
  5. Add documentation to Sentry configuration for Java-based connectors.

I have created a new Sentry project: test-java-connector. Feel free to use this one, or create new testing project if necessary.

@tuliren tuliren changed the title Experiment Sentry performance monitoring Experiment Sentry performance monitoring for database benchmark Jan 6, 2022
@tuliren tuliren changed the title Experiment Sentry performance monitoring for database benchmark Experiment Sentry performance monitoring for Java source database Jan 6, 2022
@tuliren tuliren changed the title Experiment Sentry performance monitoring for Java source database Experiment with Sentry performance monitoring for Java source database Jan 6, 2022
@alexandr-shegeda alexandr-shegeda moved this to Ready for implementation in GL Roadmap Jan 6, 2022
@sherifnada
Copy link
Contributor

@tuliren Could we actually switch this to be focused on destinations? With the recent timeline change to getting snowflake in GA, it would be very helpful to do this experiment for snowflake so we can start collecting metrics about it asap.

@tuliren tuliren changed the title Experiment with Sentry performance monitoring for Java source database Experiment with Sentry performance monitoring for Java destination Jan 7, 2022
@tuliren tuliren changed the title Experiment with Sentry performance monitoring for Java destination Experiment with Sentry performance monitoring for BigQuery destination Jan 7, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Jan 7, 2022

I have updated this issue to start with BigQuery, instead of databases.

@tuliren tuliren added the priority/high High priority label Jan 7, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Jan 7, 2022

@alexandr-shegeda, @oustynova, can this issue be prioritized starting next week (2022-01-10)? We just realized that it is more important than we previously thought. If this project cannot be prioritized any time soon, I can work on it.

@alexandertsukanov alexandertsukanov moved this from Ready for implementation to Implementation in progress in GL Roadmap Jan 11, 2022
@alexandertsukanov alexandertsukanov moved this from Implementation in progress to Internal review in GL Roadmap Jan 11, 2022
@alexandertsukanov alexandertsukanov moved this from Internal review to Implementation in progress in GL Roadmap Jan 11, 2022
@tuliren tuliren added this to the ConnCore Jan 19 milestone Jan 16, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Jan 16, 2022

@alexandertsukanov, any update or ETA of this issue?

@alexandertsukanov
Copy link
Contributor

alexandertsukanov commented Jan 18, 2022

@tuliren Hi Liren!
I created the ticket to get the access for the cloud to test how logging works at the cloud https://github.com/airbytehq/airbyte-cloud/issues/830.
Also, I consider to use Log4j SDK for logging as I think it will allow us to make logging more flexible. In local dev environment I was successful with that.

However, I Have a couple of questions:

  • Regarding https://docs.sentry.io/platforms/java/performance/instrumentation/custom-instrumentation/. Do we know what part of Airbyte do we need to track with transactions? (Do you want to track performance of IntegrationRunner(destination).run(args) by its-self or something more deeper?)
  • Do we know what DSL for Sentry will be used in cloud?
  • You say in point airbytehq/airbyte-platform-internal#3: "Add JDBC instrumentation to track the JDBC communication". As far as I know BigQuery doesn't use JDBC underneath, correct me if I am wrong. So, Should I make PR only for BigQuery destination and BigQuery destination denormailzed in scope of this task?

Thanks.

@tuliren
Copy link
Contributor Author

tuliren commented Jan 19, 2022

I consider to use Log4j SDK for logging as I think it will allow us to make logging more flexible. In local dev environment I was successful with that.

Cool.

Regarding https://docs.sentry.io/platforms/java/performance/instrumentation/custom-instrumentation/. Do we know what part of Airbyte do we need to track with transactions? (Do you want to track performance of IntegrationRunner(destination).run(args) by its-self or something more deeper?)

You can start with instrumenting the IntegrationRunner#run method and see what it gives us.

Do we know what DSL for Sentry will be used in cloud?

What do you mean by DSL?

You say in point checkin catalogs #3: "Add JDBC instrumentation to track the JDBC communication". As far as I know BigQuery doesn't use JDBC underneath, correct me if I am wrong. So, Should I make PR only for BigQuery destination and BigQuery destination denormailzed in scope of this task?

Oh, right. My bad. Is it too late to switch to Snowflake destination? If not, the SnowflakeInsertDestination and SnowflakeInternalStagingDestination would be a good candidate. It's a JDBC destination. If it is too much trouble to switch, you can ignore this bullet point for now.

@alexandertsukanov
Copy link
Contributor

@tuliren
What do you mean by DSL?

Oh, right. My bad. Is it too late to switch to Snowflake destination? If not, the SnowflakeInsertDestination and SnowflakeInternalStagingDestination would be a good candidate. It's a JDBC destination. If it is too much trouble to switch, you can ignore this bullet point for now.,

  • I am finishing with BigQuery and BigQuery denormilized. I can whether to add integration with the Sentry for Snowflake in this PR (I don't think it will take much time) or in scope of other task.

Also, should we have some filter for sensitive information? (Some keywords we should ignore in logs, etc).

@alexandertsukanov
Copy link
Contributor

I believe we can should use Performance Metrics, at least in IntegrationRunner#run, as its provide us useful info about about Duration Breakdown here is an example:

image

You can give me your email I can provide you an access to my test-application, after you can see which info we have in Performance metrics. Thanks.

@tuliren
Copy link
Contributor Author

tuliren commented Jan 19, 2022

DSN

You can use yours for now. We will figure out these variables to inject into the cloud later.

I am finishing with BigQuery and BigQuery denormilized. I can whether to add integration with the Sentry for Snowflake in this PR (I don't think it will take much time) or in scope of other task.

Cool. It would be very helpful if you can add the integration for Snowflake as well.

should we have some filter for sensitive information?

Secrets should already be removed from the logs. There may be some edge cases. We can look into this later.

You can give me your email I can provide you an access to my test-application, after you can see which info we have in Performance metrics.

This looks nice. Thanks. I will DM you my email.

@alexandertsukanov alexandertsukanov moved this from Implementation in progress to Internal review in GL Roadmap Jan 24, 2022
@alexandertsukanov alexandertsukanov moved this from Internal review to Airbyte review in GL Roadmap Jan 26, 2022
@tuliren tuliren assigned tuliren and unassigned alexandertsukanov Jan 29, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Jan 29, 2022

Current status

  • Sentry integration is working.
  • Using Sentry Log4j seems to mess up with routine logging. Let's use the more generic Sentry Java SDK for now.

Remaining TODOs

  • [P0] Debug Sentry exception emission
  • [P0] Add connector version as a tag.
  • [P1] Add more granular transaction to at least measure runtime of each command.
  • [P1] Add a release tag to Sentry.
  • [P1] Publish the release tag to Sentry in CI.
  • [P1] Move the environment variable to .env.
  • [P2] Set an environment to distinguish production from dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/benchmark area/connectors Connector related issues area/databases priority/high High priority type/enhancement New feature or request
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants