Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make flaky test less sensitive #19976

Merged
merged 3 commits into from
Dec 1, 2022
Merged

Conversation

mfsiega-airbyte
Copy link
Contributor

@mfsiega-airbyte mfsiega-airbyte commented Dec 1, 2022

What

Make flaky test less sensitive.

How

Don't insist that only a single job be run. If the sync needed a retry for any other reason then the test will fail. This is too sensitive, and the flakiness is interfering with normal dev process.

@mfsiega-airbyte mfsiega-airbyte temporarily deployed to more-secrets December 1, 2022 14:31 Inactive
@mfsiega-airbyte mfsiega-airbyte temporarily deployed to more-secrets December 1, 2022 15:07 Inactive
@supertopher
Copy link
Contributor

this looks great to me, but I'm pretty short on context for this part of the codebase so I'm shying away from approval.
thanks for working on this so quickly

me and Michael had a DM conversation about this. I think my team can do more to set expectations around responsibility. I'll have a more clear DRI and better process going forward. Thanks for helping us with the first run

@mfsiega-airbyte mfsiega-airbyte temporarily deployed to more-secrets December 1, 2022 15:56 Inactive
Comment on lines -133 to -139
final long numAttempts = apiClient.getJobsApi()
.getJobInfo(new JobIdRequestBody().id(connectionSyncRead.getJob().getId()))
.getAttempts()
.size();

// it should be able to accomplish the resume without an additional attempt!
assertEquals(1, numAttempts);
Copy link
Contributor

@lmossman lmossman Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason this assertion was being made was that we explicitly wanted to test that if the worker is scaled down and back up in the middle of a sync (e.g. deploying a new change to cloud prod), then those syncs will NOT be failed but will instead continue on as normal and succeed.

If we get rid of this assertion, it could mean that a regression where scaling down a working in the middle of a sync actually causes that sync to fail will go unnoticed.

@mfsiega-airbyte Do we have a sense of why these sync attempts are failing in CI? I feel like we should be able to rely on the fact that our acceptance test sync attempts are succeeding, otherwise other tests will likely be breaking as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding - bear in mind that I don't actually have any context on this test, beyond looking at it today.

This test (even after this PR) does validate that the sync succeeds even if the worked is scaled down and back up. Note that in the preceding line we waitForSuccessfulJob - if the sync fails, waitForSuccessfulJob will eventually throw and the test will fail.

What the assertion -- deleted by this PR -- is verifying is that it succeeds within a single attempt. Is there a reason that it's important that it succeeds within a single attempt? And in other places where our acceptance tests look for successful syncs, do they also verify that it succeeds within a single attempt?

If not, it's hard to see why we're verifying that here (except I guess that it's a cool implementation detail).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point about our other acceptance tests.

I think this was trying to ensure that scaling down and up doesn't actually result in any "failures" at all, whether at the attempt or the overall sync level (i.e. it should be invisible to the user that the scaling down/up happened at all). But since our other tests are not ensuring this, it probably doesn't make sense to do that here. And as long as the overall sync succeeds, that's the more important part.

Given that, I think I'm fine with this change. But it might be worthwhile to separately look into why these attempts are failing, since I think that is still unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on all of it. I see what you mean about scaling being invisible.

I'll go ahead and merge this but agree we should understand why some attempts are failing, and ideally we'd be able to re-enable these.

@mfsiega-airbyte mfsiega-airbyte merged commit b4a334e into master Dec 1, 2022
@mfsiega-airbyte mfsiega-airbyte deleted the msiega/fix-flaky-test branch December 1, 2022 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants