KAFKA-15119:Support incremental syncTopicAcls in MirrorSourceConnector by hudeqi · Pull Request #13913 · apache/kafka

hudeqi · 2023-06-25T12:32:16Z

Activation

In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of "TopicAclBindings" related to the replicated topics of the source cluster will be regularly listed, and then fully updated to the target cluster. Therefore, a large number of repeated "TopicAclBindings" will be repeatedly sent by calling "targetAdminClient". This action is redundant. In addition, if too many "TopicAclBindings" are updated at one time, it may also take a long time for the target cluster to handle processing the "createAcls" request, which will affect the accumulation of the request queue of the target cluster and further affect the processing delay of other type requests.

Solution

"TopicAclBinding" can be like the variable “knownConsumerGroups” in MirrorCheckpointConnector, and only update the incremental added "TopicAclBinding" every time, which can solve the above-mentioned problems.

hudeqi · 2023-06-25T12:34:05Z

@C0urante @mimaison Hi, please help to review this PR if you have time, thanks!

hudeqi · 2023-06-25T12:37:48Z

If this improvement is reasonable, I will add related unit test.

hudeqi · 2023-06-29T16:54:47Z

Hi! please help to review this PR when you are free, thank you! @C0urante

C0urante

Thanks @hudeqi! Looks good for the most part, left a few small comments.

Can we also add some unit testing coverage for this logic?

connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceConnector.java

hudeqi · 2023-07-03T12:28:47Z

Hi, I have updated, which show in unresolved conversation. @C0urante

hudeqi · 2023-07-07T10:16:45Z

pin again @C0urante

C0urante

Thanks @hudeqi! Left a few nits on the testing logic, but also found one other possible issue in the main logic if an update fails to go through with our admin client.

connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceConnector.java

connect/mirror/src/test/java/org/apache/kafka/connect/mirror/MirrorSourceConnectorTest.java

hudeqi · 2023-07-11T04:13:50Z

I have added the unit test for the related ”createAcl failure“ case, thanks for the review! @C0urante

hudeqi · 2023-07-11T11:07:11Z

I would like to ask a question by the way: Why do we not synchronize the write permission of TopicAclBing and GroupAclBinding in MirrorSourceConnector? @C0urante @gharris1727

C0urante · 2023-07-11T15:03:08Z

connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceConnector.java

+                }
+            }));
+            knownTopicAclBindings = new HashSet<>(bindings);
+            knownTopicAclBindings.removeAll(failedBindings);


Isn't failedBindings likely to be empty at this point unless the admin client is able to perform the request to create the ACL bindings exceptionally fast (which also happens to be the scenario we're covering in the unit test)?

C0urante

Hmmm... the more I think about it, the more hesitant I am to include this change.

There are some issues with the proposed implementation, but even if those are addressed (and I believe they can be), I don't know if it's safe at all to cache our view of the target cluster's ACLs. The current behavior is responsible not only for creating initial ACL bindings, but also for continually re-applying them if they are changed. I'm worried that we may break existing users' setups if we change this logic.

Users can also increase the value for the sync.topic.acls.interval.seconds property if the existing ACL syncing logic is placing too high a workload on their Kafka cluster.

Separately, to answer your question about total mirroring of ACL bindings, replicated topics were originally intended to be read-only for everything except MM2 (see the relevant section in KIP-382). There's a Jira ticket to add support for mirroring of more than just READ ACLs, but it's currently unassigned and will require a KIP to address.

hudeqi · 2023-07-14T03:06:49Z

Especially thanks for your separate reply! @C0urante I have tracked it in the Jira ticket.

Going back to this PR, please correct me if I'm wrong, thanks! Although it is named incremental ACL synchronization, it does not change the existing behavior. The reason is: knownTopicAclBindings saves all the ACL that MirrorSourceConnector has sensed for the synchronization topics of the source cluster and has been synchronized to the target cluster. When the initial startup, all relevant ACL will be synchronized, and then as long as the source cluster ACL changes, it will be synchronized to the target cluster again as a new ACL (so it caches not the view of the target cluster's ACLs, but the view of the source cluster's ACLs), as for you mentioned "The current behavior is responsible not only for creating initial ACL bindings, but also for continuously re-applying them if they are changed." In fact, the logic has not changed, because the AclBinding class implements equals layer by layer, the resynchronization of the changed AclBinding should not be missed.

In addition, the "affect the accumulation of the request queue of the target cluster and further affect the processing delay of other type requests" I mentioned is not groundless. It is a serious problem found in production environment. It can be seen that full synchronization and incremental synchronization are important for target cluster producer latency impact:

C0urante · 2023-07-14T16:09:48Z

@hudeqi Sorry, I think there's a misunderstanding here. I'm not claiming that MM2 would be incapable of detecting changes in source cluster ACLs with this change; I'm worried that it would be unable to detect (or really, just overwrite) changes in target cluster ACLs, if they were made by a separate user/process from the MM2 cluster.

hudeqi · 2023-07-17T02:53:25Z

@hudeqi Sorry, I think there's a misunderstanding here. I'm not claiming that MM2 would be incapable of detecting changes in source cluster ACLs with this change; I'm worried that it would be unable to detect (or really, just overwrite) changes in target cluster ACLs, if they were made by a separate user/process from the MM2 cluster.

Yes, the logic has changed before and after the change (reconfirm that there is no misunderstanding: the ACL of the target cluster is updated but the ACL of the source cluster is not updated. In this case, the ACL of the source cluster cannot be used to cover the target cluster). But look at this document, it seems that the goal of this ACL synchronization is only to synchronize the changes of the source cluster ACL? I don't know if I understand it right.

gharris1727 · 2023-07-17T23:57:13Z

The original MM2 KIPs use very few words to describe very large parts of it's functionality, often leaving things very under-specified, which I think is the case here. I don't think that the original proposal gives us enough to decide for or against this change.

Personally, I think that if a user can get themselves into a situation where they:

Have ACL sync enabled
Can externally observe some difference between the source and target
The difference has existed for (2x) longer than the sync interval

They are reasonable to conclude that the system is misbehaving, either because the source or target system is unhealthy, or MM2 is unhealthy, or MM2 has a bug in it. If caching causes the above situation to occur, I don't think that caching is a viable solution.

I'd be interested in trying other strategies such as:

Intentionally un-batching these requests so as to spread them evenly across the poll interval
Performing a target read-before-write to replace (potentially expensive?) write calls with read calls
Waiting for previous requests to finish before initiating subsequent ones
Exponentially backing off after failures

@hudeqi In your environment, are you noticing the load on the source system from the ACL reads? Do you have more MM2s connected to the target cluster or the source cluster? I'm wondering if (2) would actually be helpful, or if reads and writes have approximately the same cost.

github-actions · 2023-12-06T03:33:24Z

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

KAFKA-15119:Support incremental syncTopicAcls in MirrorSourceConnector

7cf6489

C0urante added the mirror-maker-2 label Jun 26, 2023

C0urante reviewed Jun 30, 2023

View reviewed changes

add unit test

ee89046

hudeqi requested a review from C0urante July 1, 2023 07:18

hudeqi mentioned this pull request Jul 1, 2023

KAFKA-15139:Optimize the performance of Set.removeAll(List) in MirrorCheckpointConnector #13946

Merged

C0urante reviewed Jul 10, 2023

View reviewed changes

fix

373579e

hudeqi requested a review from C0urante July 11, 2023 04:17

minor

dd25b78

C0urante reviewed Jul 11, 2023

View reviewed changes

fix bug

585cbcd

hudeqi requested a review from C0urante July 12, 2023 08:30

C0urante reviewed Jul 13, 2023

View reviewed changes

hudeqi marked this pull request as draft September 7, 2023 02:21

github-actions bot added the stale Stale PRs label Dec 6, 2023

hudeqi closed this Apr 18, 2024

Conversation

hudeqi commented Jun 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Activation

Solution

Uh oh!

hudeqi commented Jun 25, 2023

Uh oh!

hudeqi commented Jun 25, 2023

Uh oh!

hudeqi commented Jun 29, 2023

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudeqi commented Jul 3, 2023

Uh oh!

hudeqi commented Jul 7, 2023

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudeqi commented Jul 11, 2023

Uh oh!

hudeqi commented Jul 11, 2023

Uh oh!

C0urante Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

hudeqi commented Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

C0urante commented Jul 14, 2023

Uh oh!

hudeqi commented Jul 17, 2023

Uh oh!

gharris1727 commented Jul 17, 2023

Uh oh!

github-actions bot commented Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hudeqi commented Jun 25, 2023 •

edited

Loading

hudeqi commented Jul 14, 2023 •

edited

Loading