-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kubernetes] Introduce leader election among agents for event collection #3476
Conversation
c0b45a4
to
e592299
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leader_elector.py
looks good to me, mostly comments about the lifecycle management
utils/kubernetes/kubeutil.py
Outdated
res.raise_for_status() | ||
return res | ||
|
||
def post_to_apiserver(self, url, data, timeout=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more explicit that data is dumped in json, can you rename that method (and the put
variant) to post_json_to_apiserver
?
""" | ||
# Leader election | ||
self.kubeutil.refresh_leader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of triggering leader election from kube_event_retriever
(separation of concerns).
I'd rather just have kubernetes.py
call kubeutil.refresh_leader()
at the beginning of the check if the leader_candidate
option is true
.
Am I missing an added value of triggering leader election here instead of the check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils/kubernetes/kubeutil.py
Outdated
# leader status triggers event collection | ||
self.is_leader = False | ||
self.leader_elector = None | ||
if os.environ.get('DD_LEADER_CANDIDATE'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the final version, I'd rather have an option line with an envvar binding, might be easier for on-host install, and more consistent with the rest of the confs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah will add something in datadog.conf, this is temporary
utils/kubernetes/leader_elector.py
Outdated
CM_ENDPOINT = '/namespaces/{namespace}/configmaps' | ||
CM_NAME = 'datadog-leader-elector' | ||
CREATOR_LABEL = 'creator' | ||
ACQUIRE_TIME_LABEL = 'acquired_time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ACQUIRE_TIME_LABEL_NAME
and CREATOR_LABEL_NAME
would be more explicit
utils/kubernetes/leader_elector.py
Outdated
The leader needs to refresh its status by overriding the acquire-time label in the CM meta. | ||
|
||
This mechanism doesn't ensure uniqueness of the leader because of clock skew. | ||
A clock sync between nodes in the cluster is highly recommended to minimize this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just s/highly recommended/required/
and put it in the documentation (and say "basically, if you use ntp, you're fine"
utils/kubernetes/leader_elector.py
Outdated
elif len(res) == 1: | ||
cm = res[0] | ||
acquired_time = cm['metadata'].get('labels', {}).get(ACQUIRE_TIME_LABEL) | ||
self.last_acquire_time = datetime.datetime.strptime(acquired_time, "%Y-%m-%dT%H:%M:%S.%f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we needed it to be human-readable or could we use a timestamp instead?
03f4e4f
to
b0a89f1
Compare
b0a89f1
to
1e458be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, otherwise 🍭
tests/core/test_leader_elector.py
Outdated
from tests.core.test_kubeutil import KubeTestCase | ||
|
||
|
||
HEALTH_ENDPOINT = '/healthz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoud you import the contants from the LeaderElector
class instead of copying them?
Several tests fail because we mocked |
707d624
to
08bc76e
Compare
713aeb2
to
8c7da43
Compare
8c7da43
to
dc59d26
Compare
Note: Please remember to review the Datadog Contribution Guidelines
if you have not yet done so.
What does this PR do?
Implement a leader election mechanism among agents on kubernetes to collect events automatically.
Motivation
It's a pain to create a single deployment for the one agent responsible to collect events. It's also not recommended to have all agents report events due to the load this puts on the apiserver (and the redundant data sent by all agents).
This feature will allow having event collection enabled only once, without needed a snowflake agent deployment. All agents can be started with the
leader_candidate: true
option and take part in this election. Only the leading agent will report events.Testing Guidelines
Unit tests were added, integration tests are err... manual for now.
Additional Notes
Linked with DataDog/integrations-core#687