Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubernetes] Introduce leader election among agents for event collection #3476

Merged
merged 9 commits into from
Aug 22, 2017

Conversation

hkaj
Copy link
Member

@hkaj hkaj commented Aug 17, 2017

Note: Please remember to review the Datadog Contribution Guidelines
if you have not yet done so.

What does this PR do?

Implement a leader election mechanism among agents on kubernetes to collect events automatically.

Motivation

It's a pain to create a single deployment for the one agent responsible to collect events. It's also not recommended to have all agents report events due to the load this puts on the apiserver (and the redundant data sent by all agents).

This feature will allow having event collection enabled only once, without needed a snowflake agent deployment. All agents can be started with the leader_candidate: true option and take part in this election. Only the leading agent will report events.

Testing Guidelines

Unit tests were added, integration tests are err... manual for now.

Additional Notes

Linked with DataDog/integrations-core#687

@hkaj hkaj force-pushed the haissam/k8s-leader-event-collect branch from c0b45a4 to e592299 Compare August 21, 2017 08:26
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leader_elector.py looks good to me, mostly comments about the lifecycle management

res.raise_for_status()
return res

def post_to_apiserver(self, url, data, timeout=3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more explicit that data is dumped in json, can you rename that method (and the put variant) to post_json_to_apiserver?

"""
# Leader election
self.kubeutil.refresh_leader()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of triggering leader election from kube_event_retriever (separation of concerns).

I'd rather just have kubernetes.py call kubeutil.refresh_leader() at the beginning of the check if the leader_candidate option is true.

Am I missing an added value of triggering leader election here instead of the check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right

# leader status triggers event collection
self.is_leader = False
self.leader_elector = None
if os.environ.get('DD_LEADER_CANDIDATE'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the final version, I'd rather have an option line with an envvar binding, might be easier for on-host install, and more consistent with the rest of the confs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah will add something in datadog.conf, this is temporary

CM_ENDPOINT = '/namespaces/{namespace}/configmaps'
CM_NAME = 'datadog-leader-elector'
CREATOR_LABEL = 'creator'
ACQUIRE_TIME_LABEL = 'acquired_time'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ACQUIRE_TIME_LABEL_NAME and CREATOR_LABEL_NAME would be more explicit

The leader needs to refresh its status by overriding the acquire-time label in the CM meta.

This mechanism doesn't ensure uniqueness of the leader because of clock skew.
A clock sync between nodes in the cluster is highly recommended to minimize this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just s/highly recommended/required/ and put it in the documentation (and say "basically, if you use ntp, you're fine"

elif len(res) == 1:
cm = res[0]
acquired_time = cm['metadata'].get('labels', {}).get(ACQUIRE_TIME_LABEL)
self.last_acquire_time = datetime.datetime.strptime(acquired_time, "%Y-%m-%dT%H:%M:%S.%f")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we needed it to be human-readable or could we use a timestamp instead?

@xvello xvello modified the milestones: 5.18.0, 5.17 Aug 21, 2017
@hkaj hkaj force-pushed the haissam/k8s-leader-event-collect branch from b0a89f1 to 1e458be Compare August 21, 2017 21:28
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise 🍭

from tests.core.test_kubeutil import KubeTestCase


HEALTH_ENDPOINT = '/healthz'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud you import the contants from the LeaderElector class instead of copying them?

@xvello
Copy link
Contributor

xvello commented Aug 22, 2017

Several tests fail because we mocked retrieve_json_auth's response with a dict. We could reuse this MockResponse object

@hkaj hkaj force-pushed the haissam/k8s-leader-event-collect branch from 707d624 to 08bc76e Compare August 22, 2017 13:04
@hkaj hkaj changed the title [WIP][kubernetes] implement leader election among agent for event collection [kubernetes] implement leader election among agent for event collection Aug 22, 2017
@xvello xvello force-pushed the haissam/k8s-leader-event-collect branch 2 times, most recently from 713aeb2 to 8c7da43 Compare August 22, 2017 14:25
@xvello xvello force-pushed the haissam/k8s-leader-event-collect branch from 8c7da43 to dc59d26 Compare August 22, 2017 14:41
@xvello xvello merged commit d5b021d into master Aug 22, 2017
@xvello xvello deleted the haissam/k8s-leader-event-collect branch August 22, 2017 15:10
@hkaj hkaj changed the title [kubernetes] implement leader election among agent for event collection [kubernetes] implement leader election among agents for event collection Nov 7, 2017
@hkaj hkaj changed the title [kubernetes] implement leader election among agents for event collection [kubernetes] Introduce leader election among agents for event collection Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants