Skip to content

Conversation

@onematchfox
Copy link
Contributor

This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created.

Final part of #1133 (excluding #1138).

…oller

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements leader election for the kagent controller when running with multiple replicas to ensure only one instance actively reconciles resources. The changes include creating necessary RBAC resources, adding a helper function to determine when leader election is needed, and configuring the controller with the appropriate environment variable.

  • Adds RBAC Role and RoleBinding for leader election with permissions for coordination.k8s.io leases and events
  • Introduces kagent.leaderElectionEnabled helper that evaluates to true when controller replicas > 1
  • Sets LEADER_ELECT environment variable in the controller ConfigMap based on replica count

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
helm/kagent/templates/rbac/leader-election-role.yaml New Role granting permissions for lease coordination and event creation required for leader election
helm/kagent/templates/rbac/leader-election-rolebinding.yaml New RoleBinding connecting the leader election Role to the controller ServiceAccount
helm/kagent/templates/controller-configmap.yaml Adds LEADER_ELECT environment variable to enable/disable leader election based on replica count
helm/kagent/templates/_helpers.tpl Adds helper function to determine if leader election should be enabled (replicas > 1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +17
{{- if eq (include "kagent.leaderElectionEnabled" .) "true" }}
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "kagent.fullname" . }}-leader-election-rolebinding
namespace: {{ include "kagent.namespace" . }}
labels:
{{- include "kagent.controller.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: {{ include "kagent.fullname" . }}-leader-election-role
subjects:
- kind: ServiceAccount
name: {{ include "kagent.fullname" . }}-controller
namespace: {{ include "kagent.namespace" . }}
{{- end }} No newline at end of file
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new leader election RBAC resources (Role and RoleBinding) lack test coverage. Consider adding tests to helm/kagent/tests/rbac_test.yaml to verify:

  • The Role and RoleBinding are created when controller.replicas > 1
  • The resources are NOT created when controller.replicas == 1
  • The Role has correct permissions for leases and events
  • The RoleBinding references the correct ServiceAccount and Role

Copilot uses AI. Check for mistakes.
IMAGE_REGISTRY: {{ .Values.controller.agentImage.registry | default .Values.registry | quote }}
IMAGE_REPOSITORY: {{ .Values.controller.agentImage.repository | quote }}
IMAGE_TAG: {{ coalesce .Values.controller.agentImage.tag .Values.tag .Chart.Version | quote }}
LEADER_ELECT: {{ include "kagent.leaderElectionEnabled" . | quote }}
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LEADER_ELECT environment variable and the kagent.leaderElectionEnabled helper function lack test coverage. Consider adding tests to helm/kagent/tests/controller-deployment_test.yaml to verify:

  • LEADER_ELECT is set to "true" when controller.replicas > 1
  • LEADER_ELECT is set to "false" when controller.replicas == 1

Copilot uses AI. Check for mistakes.
IMAGE_REGISTRY: {{ .Values.controller.agentImage.registry | default .Values.registry | quote }}
IMAGE_REPOSITORY: {{ .Values.controller.agentImage.repository | quote }}
IMAGE_TAG: {{ coalesce .Values.controller.agentImage.tag .Values.tag .Chart.Version | quote }}
LEADER_ELECT: {{ include "kagent.leaderElectionEnabled" . | quote }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: LEADER_ELECTION_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, maybe at some point in the future we can clarify

Comment on lines +110 to +115
{{/*
Check if leader election should be enabled (more than 1 replica)
*/}}
{{- define "kagent.leaderElectionEnabled" -}}
{{- gt (.Values.controller.replicas | int) 1 -}}
{{- end -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on replicas feels potentially a tad dangerous to me. In the future we could potentially use an HPA or something like that. Maybe let's just set it to true by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how dangerous having this configured based on replicas really is... And, I suspect the underlying concern/issue may lie in the fact that the controller is currently responsible for both resource reconiliation and serving the API.

From a controller perspective, leader election is only needed when there's more than 1 controller pod - to prevent duplicate reconiliation of resources. As for HPA, controllers are never1 scaled based on HPA.

However, in the case of kagent I acknowledge that that might not hold true given that the API is exposed via the controller. I would hope however that if it gets to the point of really needing to scale out/in the API, we'd consider splitting this out into it's own deployment instead.2

Maybe let's just set it to true by default?

Totally OK with adding this as an explicit configuration that defaults to true if that's what you prefer. But I would also then add another error that checks that someone doesn't disable this when replicas > 1.

Also, just so you are aware, enabling this with only 1 replica will add a short-delay when rolling out updates since the new pod will only become the leader after a short hand-off delay. WDYT?

Footnotes

  1. I challenge you to find me an example in the wild - would love to see how they dynamically shard ownership of resources.

  2. Even in the case of HPA introduced on the controller, this could be changed to activate based on maxReplicas instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I buy all of this, the reality is that the A2A gateway should probably be decoupled from the controller at some point, so this is a good first pass :)

@EItanya EItanya merged commit 790b0e3 into kagent-dev:main Dec 2, 2025
23 checks passed
jdubeau123 added a commit to jdubeau123/kagent that referenced this pull request Dec 4, 2025
All controllers require leader election (NeedLeaderElection: true) but PR kagent-dev#1146
conditionally enabled it only when replicas > 1. This causes a deadlock with
1 replica (the default) where controllers wait for leader election that never happens.

Leader election works correctly with 1 replica (immediately becomes leader) and
should always be enabled since all controllers require it.

Fixes reconciliation deadlock when running with single controller replica.

Signed-off-by: Julien Dubeau <julien.dubeau@amplitude.com>
Signed-off-by: Julien Dubeau <julien@amplitude.com>
jdubeau123 added a commit to jdubeau123/kagent that referenced this pull request Dec 4, 2025
All controllers explicitly require leader election (NeedLeaderElection: true):
- remote_mcp_server_controller.go:56
- agent_controller.go:71
- modelconfig_controller.go:64
- mcp_server_tool_controller.go:64
- service_controller.go:51

However, PR kagent-dev#1146 made leader election conditional (enabled only when replicas > 1).
When running with 1 replica (the default), controllers wait for leader election
that never happens, causing a reconciliation deadlock where no resources are processed.

This fix removes the conditional helper and always enables leader election:
- Deleted kagent.leaderElectionEnabled helper from _helpers.tpl
- Hardcoded LEADER_ELECT="true" in controller-configmap.yaml
- Removed conditional wrappers from leader-election RBAC resources

Leader election works correctly with 1 replica (pod immediately becomes leader)
and is required for controllers to function. Making it unconditional clarifies
that it is not an optional feature but a requirement.

Fixes reconciliation deadlock when running with single controller replica.

Signed-off-by: Julien Dubeau <julien.dubeau@amplitude.com>
Signed-off-by: Julien Dubeau <julien@amplitude.com>
GTRekter pushed a commit to GTRekter/kagent that referenced this pull request Dec 4, 2025
This PR enables leader election on the controller if it is configured
with one than 1 replica to ensure that only 1 replica is actively
reconciling watched manifests. It also ensures that the necessary RBAC
manifests are created.

Final part of kagent-dev#1133 (excluding kagent-dev#1138).

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
@onematchfox onematchfox deleted the enable-leader-election branch December 8, 2025 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants