Skip to content

KAFKA-15119:Support incremental syncTopicAcls in MirrorSourceConnector#13913

Closed
hudeqi wants to merge 5 commits intoapache:trunkfrom
hudeqi:increment_sycAcl
Closed

KAFKA-15119:Support incremental syncTopicAcls in MirrorSourceConnector#13913
hudeqi wants to merge 5 commits intoapache:trunkfrom
hudeqi:increment_sycAcl

Conversation

@hudeqi
Copy link
Contributor

@hudeqi hudeqi commented Jun 25, 2023

Activation

In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of "TopicAclBindings" related to the replicated topics of the source cluster will be regularly listed, and then fully updated to the target cluster. Therefore, a large number of repeated "TopicAclBindings" will be repeatedly sent by calling "targetAdminClient". This action is redundant. In addition, if too many "TopicAclBindings" are updated at one time, it may also take a long time for the target cluster to handle processing the "createAcls" request, which will affect the accumulation of the request queue of the target cluster and further affect the processing delay of other type requests.

Solution

"TopicAclBinding" can be like the variable “knownConsumerGroups” in MirrorCheckpointConnector, and only update the incremental added "TopicAclBinding" every time, which can solve the above-mentioned problems.

@hudeqi
Copy link
Contributor Author

hudeqi commented Jun 25, 2023

@C0urante @mimaison Hi, please help to review this PR if you have time, thanks!

@hudeqi
Copy link
Contributor Author

hudeqi commented Jun 25, 2023

If this improvement is reasonable, I will add related unit test.

@hudeqi
Copy link
Contributor Author

hudeqi commented Jun 29, 2023

Hi! please help to review this PR when you are free, thank you! @C0urante

Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hudeqi! Looks good for the most part, left a few small comments.

Can we also add some unit testing coverage for this logic?

@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 3, 2023

Hi, I have updated, which show in unresolved conversation. @C0urante

@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 7, 2023

pin again @C0urante

Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hudeqi! Left a few nits on the testing logic, but also found one other possible issue in the main logic if an update fails to go through with our admin client.

@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 11, 2023

I have added the unit test for the related ”createAcl failure“ case, thanks for the review! @C0urante

@hudeqi hudeqi requested a review from C0urante July 11, 2023 04:17
@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 11, 2023

I would like to ask a question by the way: Why do we not synchronize the write permission of TopicAclBing and GroupAclBinding in MirrorSourceConnector? @C0urante @gharris1727

}
}));
knownTopicAclBindings = new HashSet<>(bindings);
knownTopicAclBindings.removeAll(failedBindings);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't failedBindings likely to be empty at this point unless the admin client is able to perform the request to create the ACL bindings exceptionally fast (which also happens to be the scenario we're covering in the unit test)?

@hudeqi hudeqi requested a review from C0urante July 12, 2023 08:30
Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... the more I think about it, the more hesitant I am to include this change.

There are some issues with the proposed implementation, but even if those are addressed (and I believe they can be), I don't know if it's safe at all to cache our view of the target cluster's ACLs. The current behavior is responsible not only for creating initial ACL bindings, but also for continually re-applying them if they are changed. I'm worried that we may break existing users' setups if we change this logic.

Users can also increase the value for the sync.topic.acls.interval.seconds property if the existing ACL syncing logic is placing too high a workload on their Kafka cluster.

Separately, to answer your question about total mirroring of ACL bindings, replicated topics were originally intended to be read-only for everything except MM2 (see the relevant section in KIP-382). There's a Jira ticket to add support for mirroring of more than just READ ACLs, but it's currently unassigned and will require a KIP to address.

@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 14, 2023

Especially thanks for your separate reply! @C0urante I have tracked it in the Jira ticket.

Going back to this PR, please correct me if I'm wrong, thanks! Although it is named incremental ACL synchronization, it does not change the existing behavior. The reason is: knownTopicAclBindings saves all the ACL that MirrorSourceConnector has sensed for the synchronization topics of the source cluster and has been synchronized to the target cluster. When the initial startup, all relevant ACL will be synchronized, and then as long as the source cluster ACL changes, it will be synchronized to the target cluster again as a new ACL (so it caches not the view of the target cluster's ACLs, but the view of the source cluster's ACLs), as for you mentioned "The current behavior is responsible not only for creating initial ACL bindings, but also for continuously re-applying them if they are changed." In fact, the logic has not changed, because the AclBinding class implements equals layer by layer, the resynchronization of the changed AclBinding should not be missed.

In addition, the "affect the accumulation of the request queue of the target cluster and further affect the processing delay of other type requests" I mentioned is not groundless. It is a serious problem found in production environment. It can be seen that full synchronization and incremental synchronization are important for target cluster producer latency impact:
image
image

@C0urante
Copy link
Contributor

@hudeqi Sorry, I think there's a misunderstanding here. I'm not claiming that MM2 would be incapable of detecting changes in source cluster ACLs with this change; I'm worried that it would be unable to detect (or really, just overwrite) changes in target cluster ACLs, if they were made by a separate user/process from the MM2 cluster.

@hudeqi
Copy link
Contributor Author

hudeqi commented Jul 17, 2023

@hudeqi Sorry, I think there's a misunderstanding here. I'm not claiming that MM2 would be incapable of detecting changes in source cluster ACLs with this change; I'm worried that it would be unable to detect (or really, just overwrite) changes in target cluster ACLs, if they were made by a separate user/process from the MM2 cluster.

Yes, the logic has changed before and after the change (reconfirm that there is no misunderstanding: the ACL of the target cluster is updated but the ACL of the source cluster is not updated. In this case, the ACL of the source cluster cannot be used to cover the target cluster). But look at this document, it seems that the goal of this ACL synchronization is only to synchronize the changes of the source cluster ACL? I don't know if I understand it right.

@gharris1727
Copy link
Contributor

The original MM2 KIPs use very few words to describe very large parts of it's functionality, often leaving things very under-specified, which I think is the case here. I don't think that the original proposal gives us enough to decide for or against this change.

Personally, I think that if a user can get themselves into a situation where they:

  1. Have ACL sync enabled
  2. Can externally observe some difference between the source and target
  3. The difference has existed for (2x) longer than the sync interval

They are reasonable to conclude that the system is misbehaving, either because the source or target system is unhealthy, or MM2 is unhealthy, or MM2 has a bug in it. If caching causes the above situation to occur, I don't think that caching is a viable solution.

I'd be interested in trying other strategies such as:

  1. Intentionally un-batching these requests so as to spread them evenly across the poll interval
  2. Performing a target read-before-write to replace (potentially expensive?) write calls with read calls
  3. Waiting for previous requests to finish before initiating subsequent ones
  4. Exponentially backing off after failures

@hudeqi In your environment, are you noticing the load on the source system from the ACL reads? Do you have more MM2s connected to the target cluster or the source cluster? I'm wondering if (2) would actually be helpful, or if reads and writes have approximately the same cost.

@hudeqi hudeqi marked this pull request as draft September 7, 2023 02:21
@github-actions
Copy link

github-actions bot commented Dec 6, 2023

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Dec 6, 2023
@hudeqi hudeqi closed this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants