Skip to content

KAFKA-14535: Fix flaky EndToEndAuthorization tests which were sensitive to ACL change reordering#13086

Merged
omkreddy merged 1 commit intoapache:trunkfrom
gharris1727:kafka-14535-acl-test-flake
Jan 7, 2023
Merged

KAFKA-14535: Fix flaky EndToEndAuthorization tests which were sensitive to ACL change reordering#13086
omkreddy merged 1 commit intoapache:trunkfrom
gharris1727:kafka-14535-acl-test-flake

Conversation

@gharris1727
Copy link
Contributor

@gharris1727 gharris1727 commented Jan 6, 2023

The ACL change methods (create, delete) are eventually consistent across a Kafka cluster. As part of that, changes to the same resource made to different brokers may be reordered. In this test, a delete operation initiated in noConsumeWithoutDescribeAclSetup was being reordered to an unexpected time later in the test, causing an expected ACL to be missing.

To fix this, add a wait condition that asserts that the delete operations initiated in noConsumeWithoutDescribeAclSetup are completely applied before returning from the method.

Note: test failures were present for both the ZK and KRAFT iterations of the test, but the ZK iteration encountered it much more often, and was used to diagnose the reordering.

This is a typo in the original implementation of this method in https://github.com/apache/kafka/pull/1908/files#diff-c38532e2fcf5cce4cf30a824bc6bef9953c05d9472db51c3778a0286f6f01296R311 . And was made worse when the CREATE acl was added but never deleted: https://github.com/apache/kafka/pull/4795/files#diff-c38532e2fcf5cce4cf30a824bc6bef9953c05d9472db51c3778a0286f6f01296R293-R308 .
However, I don't think that this became flakey until #12843 which replaced the AclCommand with the AdminClient, and removed the synchronization to wait for the ACL operation to resolve before proceeding with the test.

This does not require a change to the ACL mechanisms, as I believe the reordering is an expected part of the consistency algorithm, which includes a CAS operation to ZK. This is also not a security risk, as this only affects successful ACL change operations, and should not allow unsuccessful ACL change operations to insert invalid ACLs.

Signed-off-by: Greg Harris greg.harris@aiven.io

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

…ve to ACL change reordering

The ACL change methods (create, delete) are eventually consistent across a Kafka cluster.
As part of that, changes to the same resource made to different brokers may be reordered.
In this test, a delete operation initiated in noConsumeWithoutDescribeAclSetup was being
reordered to an unexpected time later in the test, causing an expected ACL to be missing.

To fix this, add a wait condition that asserts that the delete operations initiated in
noConsumeWithoutDescribeAclSetup are completely applied before returning from the method.

Note: test failures were present for both the ZK and KRAFT iterations of the test, but
the ZK iteration encountered it much more often, and was used to diagnose the reordering.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Contributor Author

The last people working in this area were @pprovenzano @omkreddy and @soarez, if any of you have some time i'd appreciate a review on this test which flakes ~10-20% of the time. Thanks!

Copy link
Contributor

@omkreddy omkreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. LGTM

@omkreddy
Copy link
Contributor

omkreddy commented Jan 7, 2023

Test failures are unrelated, merging the PR

@omkreddy omkreddy merged commit d798ec7 into apache:trunk Jan 7, 2023
rajinisivaram added a commit to confluentinc/kafka that referenced this pull request Jan 9, 2023
…9-jan-2023

* apache/trunk: (16 commits)
  KAFKA-14570: Fix parenthesis in verifyFullFetchResponsePartitions output (apache#13072)
  MINOR: Remove public mutable fields from ProducerAppendInfo (apache#13091)
  KAFKA-14558: Move/Rewrite LastRecord, TxnMetadata, BatchMetadata, ProducerStateEntry, and ProducerAppendInfo to the storage module (apache#13043)
  KAFKA-14535: Fix flaky EndToEndAuthorization tests which were sensitive to ACL change reordering (apache#13086)
  KAFKA-9087 Replace log high watermark by future log high watermark wh… (apache#13075)
  MINOR: add error reason when controller failed to handle events (apache#13050)
  MINOR: doc: note how JDK-8136913 can affect client SASL (apache#13071)
  2023 (apache#13083)
  KAFKA-14279; Add 3.3.x to core compatibility tests (apache#13076)
  MINOR Fixed doc generation for LogConfig class in genTopicConfigDocs. (apache#13079)
  ...
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
…ve to ACL change reordering (apache#13086)

The ACL change methods (create, delete) are eventually consistent across a Kafka cluster.
As part of that, changes to the same resource made to different brokers may be reordered.
In this test, a delete operation initiated in noConsumeWithoutDescribeAclSetup was being
reordered to an unexpected time later in the test, causing an expected ACL to be missing.

To fix this, add a wait condition that asserts that the delete operations initiated in
noConsumeWithoutDescribeAclSetup are completely applied before returning from the method.

Note: test failures were present for both the ZK and KRAFT iterations of the test, but
the ZK iteration encountered it much more often, and was used to diagnose the reordering.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants