KAFKA-14143: Exactly-once source connector system tests (KIP-618)#11783
KAFKA-14143: Exactly-once source connector system tests (KIP-618)#11783C0urante merged 1 commit intoapache:trunkfrom
Conversation
1bce968 to
7259d38
Compare
|
Converting to draft until upstream PRs are reviewed. |
c962806 to
1369ad7
Compare
503f389 to
b1383e1
Compare
775bb0d to
dd8135c
Compare
ed2cb48 to
9b2e4fe
Compare
tombentley
left a comment
There was a problem hiding this comment.
Thanks @C0urante! I left a few comments.
There was a problem hiding this comment.
If src_seqno_max==0 then we've not really proven that EOS is actually able to make progress in the presence of worker restarts, right?
There was a problem hiding this comment.
Yeah, that's correct. Added a check for that case.
There was a problem hiding this comment.
I wonder if we really should restart the nodes in the same order each time.
There was a problem hiding this comment.
For context, this follows the same logic in the test_bounces case. I guess we could do something a little less repetitive; I've offset the order by 1 with each successive restart.
There was a problem hiding this comment.
Agreed that no data making it through the test is something to avoid, but I think always giving the consumer groups time to recover is a lenient way of achieving that. Perhaps the 2nd restart should not have the timeout, precisely to ensure that the messy case still doesn't cause problems.
There was a problem hiding this comment.
This is also copied from the test_bounces case. It should probably be updated to not refer to consumer groups, but instead, worker rebalances and task startups.
I guess if we want to be rigorous here, we could do some rolling bounces with this cushioning in place, and some without. Perhaps two rounds of each?
862206e to
3b41de7
Compare
|
@C0urante , I'll take a look this week. Before that, I'd like to know if you have run these system tests locally? |
|
I ran the tests locally when I first wrote them, repeatedly (believe I ran everything in a loop overnight and made sure it all came out green). In the year since then, enough has changed (including getting a new laptop) that I'm no longer able to run them locally. Attempts to do so using Docker have led to some hung JVMs and appear to be due to environmental issues. If there's dedicated hardware out there to run these on, it'd be nice if we could leverage that for these tests. Otherwise, I can try to diagnose my local Docker issues and/or experiment with an alternative testing setup. |
There was a problem hiding this comment.
- Did we validate
sinktasks in the test? - Should we mention "exactly once" here?
There was a problem hiding this comment.
- Good point, no, we do not.
- I believe it is already mentioned? I've cleaned up the description a bit but left in the "deliver messages exactly once" part.
There was a problem hiding this comment.
Could you also help update the description in test_bounce? It tests not only clean bounces, right?
5f72c1a to
259f747
Compare
|
I've been able to get one green run of the test locally, but all other attempts have failed with timeouts, even when bumping the permitted duration for worker startup from one minute to five. I also fixed a typo that would have broken the |
Yes, that's what I saw when running in my local env. I think we need to make sure it works well before we can merge it. @jsancio , this is the last PR for KIP-618. We'd like to put this into v3.3, but needs to make sure the new added/updated system tests didn't break any test. Could you help run it and confirm it? Thanks. |
|
I've started seeing this worker startup error in the logs for my local runs: I've seen it both with the new |
259f747 to
cdebeac
Compare
|
Ah, thanks for the update Luke! In the meantime, I've discovered and addressed a few more issues that surfaced during my local runs yesterday: Since follower workers don't retry when zombie fencing requests to the leader fail (which is intentional, as we want to be able to surface failures caused by things like insufficient ACLs to perform a round of fencing), it's possible that a task that's hosted on a follower may fail during startup if the leader has just been bounced the worker process hasn't started yet. I've added a small part to restart any failed tasks after all the bounces have completed and before we check to make sure that the connector and its tasks are healthy. Since the REST API is available before workers have actually completed startup, it's also possible that requests to fence zombies (and submit task configs) can be made to the leader before it has been able to read a session key from the config topic. I've tweaked the herder logic to catch this case and throw a 503 error with a user-friendly error message. I experimented with some other approaches to automatically refresh the leader's view of the config topic in this case, and/or handle request signature validation on the herder's tick thread (which would ensure that the worker had been able to complete startup and read to the current end of the config topic), but the additional complexity incurred by these options didn't seem worth the benefits since they would still be incomplete for cases like the one described above. It's also possible that, when hard-bouncing a worker, a transaction opened by one of its tasks gets left hanging. If the task has begun to write offsets, then startup for subsequent workers will be blocked on the expiration of that transaction, which by default takes 60 seconds. This can cause test failures because we usually wait for 60 seconds for workers to complete startup. To address this, I've lowered the transaction timeout to 10 seconds. Ideally, we could proactively abort any open transactions left behind by prior task generations during zombie fencing, but it's probably too late to add this kind of logic in time for the 3.3.0 release. I've filed https://issues.apache.org/jira/browse/KAFKA-14091 to track this. There's also a possible NPE in I've kicked off another local run of |
cdebeac to
25d1679
Compare
|
Out of the 100 runs, 94 passed. The 6 failures all appear to be environmental as they were encountered while trying to allocate a console consumer at the end of the test with this stack trace: |
|
@showuon any news? Hoping we can get this merged and backported soon so that the stability of 3.3 isn't impacted; let me know if there's anything I can do to help. |
25d1679 to
2f62b74
Compare
|
@showuon @mimaison @tombentley any chance we could revisit this soon? I think we've probably missed the 3.3.0 boat but it'd be nice to get this in just to have the testing framework available in case we need to do any debugging for KIP-618 after the release. |
|
Still working on it. Hope we can get the results this week. |
|
@showuon and @C0urante here is the system test job https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5109/ not sure if you are authorized to see the job. @yashmayya also submitted this fix: #12575. Should we merge that and cherry pick it to 3.3? At a glance it seems like a lower risk change and it fixes the failing system tests. |
|
@jsancio We can't access the Jenkins URL you shared ( |
@mimaison Sound good to me. I marked it as a 3.3.0 blocker but I need help reviewing the PR. I'll post the job result here when it finishes. It usually takes a few hours to complete. |
|
Thank you @jsancio ! As long as the system test passed, I'm good to merge it. |
|
Yes if the system tests pass, I'm in favor of merging this in 3.3. |
|
@jsancio , do you have the test results for the system test? And they both passed. |
|
@showuon I tried running the system tests twice in Confluent's infrastructure and we got some infrastructure issues in both cases. Feel free to merge this change and cherry pick it to 3.3. Thanks! I started another build but I would wait for the result if you want to merge this PR. |
|
Let's wait for the result. Otherwise, this merge will also block 3.3 release. |
|
@jsancio looks like all the Connect tests were green, and the failures were unrelated. LGTY? |
|
@C0urante sounds good. Do you want to merge it and cherry-pick it to 3.3? I can also merge it if you update the description with what you want me to write in the commit message. |
|
I can handle the merge. Thanks @jsancio, @showuon, @mimaison, and @tombentley for the review! |
Also includes a minor quality-of-life improvement to clarify why some internal REST requests to workers may fail while that worker is still starting up. Reviewers: Tom Bentley <tbentley@redhat.com>, Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
Implements system tests for KIP-618.
Relies on changes from: