Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Disable Failing Connections feature #10877

Closed
wants to merge 39 commits into from

Conversation

terencecho
Copy link
Contributor

@terencecho terencecho commented Mar 5, 2022

What

Adds feature to automatically disable failing connections. This addresses issue 9715. When this feature is turned on, it will disable a connection after MAX_FAILURE_JOBS_IN_A_ROW consecutive failures (currently set to 100 jobs) or after MAX_DAYS_OF_STRAIGHT_FAILURE days of only failed jobs (currently set to 14 days).

todo:

  • Notify user when connection has been auto-disable. Will be followed up through this issue.
  • figure out why listJobStatusWithConnection is not filtering by timestamp in unit tests
  • move max job failure limits to config

How

When the feature flag is turned on, after a job failure, the DisableActivity will check if either of the above criteria has been met. If so, it will set the connection status to INACTIVE.

Recommended reading order

  1. airbyte-scheduler/persistence/src/main/java/io/airbyte/scheduler/persistence/DefaultJobPersistence.java
  2. airbyte-workers/src/main/java/io/airbyte/workers/temporal/scheduling/activities/AutoDisableConnectionActivityImpl.java
  3. the rest

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Mar 5, 2022
@terencecho terencecho requested a review from benmoriceau March 5, 2022 01:26
public class DisableActivityImpl implements DisableActivity {

@VisibleForTesting
public static final int MAX_FAILURE_JOBS_IN_A_ROW = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be coming from the .env file, which means that we need to update the Configs interface to add getter for those variables

public void disableConnection(final DisableActivityInput input) {
try {
// lists job in descending order by created_at
final List<Job> jobs = jobPersistence.listJobs(ConfigType.SYNC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should include the reset connection type here. To be check with product.

@terencecho terencecho temporarily deployed to more-secrets March 7, 2022 22:11 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 7, 2022 22:11 Inactive
@terencecho
Copy link
Contributor Author

terencecho commented Mar 9, 2022

Not ready for review.
what still needs to be done:

  • figure out why listJobStatusWithConnection is not filtering by timestamp in unit tests
  • move max job failure limits to config

@terencecho terencecho temporarily deployed to more-secrets March 9, 2022 08:35 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 9, 2022 08:35 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:18 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:18 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:22 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:23 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:40 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:40 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:43 Inactive
@terencecho terencecho temporarily deployed to more-secrets March 10, 2022 22:43 Inactive
cgardens and others added 17 commits March 14, 2022 00:48
* Remove deprecated/unused enum values from json schema, migration to update records to use corrected values

* make migration-specific classes handle any string, and remove extranneous comments/annotations for readability. also test that an unrecognized enum value is left alone and doesn't cause deserialization errors

* gradle format

* fix test

* fix jobTrackerTest with new enum values
Co-authored-by: benmoriceau <benmoriceau@users.noreply.github.com>
…bucket (#10856)

* Add a test for listObjects permission to destination-s3 connector

* add testIAMUserHasListObjectPermission method to S3Destination
  and call this method from S3Destination::check. Method throws
  an exception if IAM user does not have listObjects permission
  on the destination bucket

* add a unit test to S3DestinationTest to verify that S3Destination::check
  fails if listObjects throws an exception

* add a unit test to S3DestinationTest to verify that S3Destination::check
  succeeds if listObjects succeeds

* Add S3DestinationConfigFactory in order to be able to mock S3 client
  used in S3Destination::check

* Addressing review comments:

 - separate positive and negative unit tests
 - fix formatting
 - reuse s3 client for both positive and negative tests

* Add information about PR #10856 to the changelog

* Prepare for publishing new version:
 * Bump version to 0.2.10 in Dockerfile
 * Bump version to 0.2.10 in changelog

* Update destination-s3 version in connector index

* Update seed spec for destination-s3 connector
Co-authored-by: benmoriceau <benmoriceau@users.noreply.github.com>
Add a metric to monitor the monitoring app's rate of publishing metrics. Though this isn't perfect, it gives us some insight into whether metric publishing is okay or running into issues.
* Set up script to remove old instances.

* Switch to push for testing.

* Debugging.

* Authenticate.

* Try again.

* Set region.

* Add JQ.

* Add iso check.

* Add curr time. Add JQ query.

* Correctly set env.

* Format command.

* Format command.

* Should be inverted.

* Only get running.

* Filter for instance name.

* Space instead of comma.

* Debug.

* Put back running instance filter.

* Undo changes.

* Test.

* Move comment.
@CLAassistant
Copy link

CLAassistant commented Mar 14, 2022

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added area/api Related to the api area/connectors Connector related issues area/documentation Improvements or additions to documentation area/frontend area/protocol area/server kubernetes normalization labels Mar 14, 2022
@terencecho terencecho temporarily deployed to more-secrets March 14, 2022 07:56 Inactive
@terencecho terencecho closed this Mar 14, 2022
@terencecho terencecho temporarily deployed to more-secrets March 14, 2022 07:56 Inactive
@terencecho
Copy link
Contributor Author

This PR has been moved to #11099.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Related to the api area/connectors Connector related issues area/documentation Improvements or additions to documentation area/platform issues related to the platform area/protocol area/scheduler area/server area/worker Related to worker kubernetes normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.