-
Notifications
You must be signed in to change notification settings - Fork 400
Enable leader election on controller when scaled #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…oller Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements leader election for the kagent controller when running with multiple replicas to ensure only one instance actively reconciles resources. The changes include creating necessary RBAC resources, adding a helper function to determine when leader election is needed, and configuring the controller with the appropriate environment variable.
- Adds RBAC Role and RoleBinding for leader election with permissions for coordination.k8s.io leases and events
- Introduces
kagent.leaderElectionEnabledhelper that evaluates to true when controller replicas > 1 - Sets
LEADER_ELECTenvironment variable in the controller ConfigMap based on replica count
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| helm/kagent/templates/rbac/leader-election-role.yaml | New Role granting permissions for lease coordination and event creation required for leader election |
| helm/kagent/templates/rbac/leader-election-rolebinding.yaml | New RoleBinding connecting the leader election Role to the controller ServiceAccount |
| helm/kagent/templates/controller-configmap.yaml | Adds LEADER_ELECT environment variable to enable/disable leader election based on replica count |
| helm/kagent/templates/_helpers.tpl | Adds helper function to determine if leader election should be enabled (replicas > 1) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| {{- if eq (include "kagent.leaderElectionEnabled" .) "true" }} | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: RoleBinding | ||
| metadata: | ||
| name: {{ include "kagent.fullname" . }}-leader-election-rolebinding | ||
| namespace: {{ include "kagent.namespace" . }} | ||
| labels: | ||
| {{- include "kagent.controller.labels" . | nindent 4 }} | ||
| roleRef: | ||
| apiGroup: rbac.authorization.k8s.io | ||
| kind: Role | ||
| name: {{ include "kagent.fullname" . }}-leader-election-role | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: {{ include "kagent.fullname" . }}-controller | ||
| namespace: {{ include "kagent.namespace" . }} | ||
| {{- end }} No newline at end of file |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new leader election RBAC resources (Role and RoleBinding) lack test coverage. Consider adding tests to helm/kagent/tests/rbac_test.yaml to verify:
- The Role and RoleBinding are created when
controller.replicas > 1 - The resources are NOT created when
controller.replicas == 1 - The Role has correct permissions for leases and events
- The RoleBinding references the correct ServiceAccount and Role
| IMAGE_REGISTRY: {{ .Values.controller.agentImage.registry | default .Values.registry | quote }} | ||
| IMAGE_REPOSITORY: {{ .Values.controller.agentImage.repository | quote }} | ||
| IMAGE_TAG: {{ coalesce .Values.controller.agentImage.tag .Values.tag .Chart.Version | quote }} | ||
| LEADER_ELECT: {{ include "kagent.leaderElectionEnabled" . | quote }} |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LEADER_ELECT environment variable and the kagent.leaderElectionEnabled helper function lack test coverage. Consider adding tests to helm/kagent/tests/controller-deployment_test.yaml to verify:
LEADER_ELECTis set to "true" whencontroller.replicas > 1LEADER_ELECTis set to "false" whencontroller.replicas == 1
| IMAGE_REGISTRY: {{ .Values.controller.agentImage.registry | default .Values.registry | quote }} | ||
| IMAGE_REPOSITORY: {{ .Values.controller.agentImage.repository | quote }} | ||
| IMAGE_TAG: {{ coalesce .Values.controller.agentImage.tag .Values.tag .Chart.Version | quote }} | ||
| LEADER_ELECT: {{ include "kagent.leaderElectionEnabled" . | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: LEADER_ELECTION_ENABLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to match the flag in the controller https://github.com/kagent-dev/kagent/blob/main/go%2Fpkg%2Fapp%2Fapp.go#L127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, maybe at some point in the future we can clarify
| {{/* | ||
| Check if leader election should be enabled (more than 1 replica) | ||
| */}} | ||
| {{- define "kagent.leaderElectionEnabled" -}} | ||
| {{- gt (.Values.controller.replicas | int) 1 -}} | ||
| {{- end -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on replicas feels potentially a tad dangerous to me. In the future we could potentially use an HPA or something like that. Maybe let's just set it to true by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how dangerous having this configured based on replicas really is... And, I suspect the underlying concern/issue may lie in the fact that the controller is currently responsible for both resource reconiliation and serving the API.
From a controller perspective, leader election is only needed when there's more than 1 controller pod - to prevent duplicate reconiliation of resources. As for HPA, controllers are never1 scaled based on HPA.
However, in the case of kagent I acknowledge that that might not hold true given that the API is exposed via the controller. I would hope however that if it gets to the point of really needing to scale out/in the API, we'd consider splitting this out into it's own deployment instead.2
Maybe let's just set it to true by default?
Totally OK with adding this as an explicit configuration that defaults to true if that's what you prefer. But I would also then add another error that checks that someone doesn't disable this when replicas > 1.
Also, just so you are aware, enabling this with only 1 replica will add a short-delay when rolling out updates since the new pod will only become the leader after a short hand-off delay. WDYT?
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I buy all of this, the reality is that the A2A gateway should probably be decoupled from the controller at some point, so this is a good first pass :)
All controllers require leader election (NeedLeaderElection: true) but PR kagent-dev#1146 conditionally enabled it only when replicas > 1. This causes a deadlock with 1 replica (the default) where controllers wait for leader election that never happens. Leader election works correctly with 1 replica (immediately becomes leader) and should always be enabled since all controllers require it. Fixes reconciliation deadlock when running with single controller replica. Signed-off-by: Julien Dubeau <julien.dubeau@amplitude.com> Signed-off-by: Julien Dubeau <julien@amplitude.com>
All controllers explicitly require leader election (NeedLeaderElection: true): - remote_mcp_server_controller.go:56 - agent_controller.go:71 - modelconfig_controller.go:64 - mcp_server_tool_controller.go:64 - service_controller.go:51 However, PR kagent-dev#1146 made leader election conditional (enabled only when replicas > 1). When running with 1 replica (the default), controllers wait for leader election that never happens, causing a reconciliation deadlock where no resources are processed. This fix removes the conditional helper and always enables leader election: - Deleted kagent.leaderElectionEnabled helper from _helpers.tpl - Hardcoded LEADER_ELECT="true" in controller-configmap.yaml - Removed conditional wrappers from leader-election RBAC resources Leader election works correctly with 1 replica (pod immediately becomes leader) and is required for controllers to function. Making it unconditional clarifies that it is not an optional feature but a requirement. Fixes reconciliation deadlock when running with single controller replica. Signed-off-by: Julien Dubeau <julien.dubeau@amplitude.com> Signed-off-by: Julien Dubeau <julien@amplitude.com>
This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created. Final part of kagent-dev#1133 (excluding kagent-dev#1138). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created.
Final part of #1133 (excluding #1138).