[Ingest Manager] Improve agent unenrollment #67409

nchaulet · 2020-05-26T20:29:03Z

Description

Work in progress

Currently the agent unenrollment is done with the following:

we invalidate API keys (access and output API keys)
the agent receive a 401 when trying a new checkin and interpret that as unenroll

We can improve this process to ensure the unenrollment worked correctly with a gracefull unenrollment.

We should also provide a way to have a immediate unenrollment, that invalidate API keys, without gracefull shutdown.

Possible implementations

We can send a new action with a new ACTION type UNENROLL the agent can do all the thing he need (uninstalling endpoint, sending last events) then when he ack the action we will invalidate the API keys and change the agent status.

We probably need to have a background job that after a defined amount of time, clean agent that did not ack the UNENROLL action. (It's possible to do this in Kibana?)

I suggest we introduce a new action in fleet UNENROLL ?

Fleet will send the action
Agent will do this stuff, finish to send logs, deinstall
Agent will ack the UNENROLL action to fleet, in reaction Fleet will deactivate all the agent related API keys and mark the agent as unenrolled

We probably want a mean to force unenrolled an agent, (invalidating API keys directly, without sending the action to the agent)

a background job that do this automatically after an amount of time where we sent UNENROLL
an API to do that
or both

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-05-26T20:29:04Z

Pinging @elastic/ingest-management (Team:Ingest Management)

ph · 2020-06-26T18:10:49Z

Would like your feedback on this @blakerouse @michalpristas

ph · 2020-06-26T18:30:10Z

@nchaulet It is possible that the "graceful" and "force" is the same thing with different timeout value?

nchaulet · 2020-06-26T19:47:24Z

@ph the way I see this the "force" unenrollment is the end of the gracefull one.

Scenario 1 gracefull unenrollment:
-> we send UNENROLL action to the agent -> the agent finish, uninstall endpoint, -> send ACK -> we invalidate API keys

Scenario 2 force unenrollment (compromised tokens for example)
-> we invalidate API keys

ph · 2020-06-26T20:17:27Z

Ok, I think I am OK with you with this.

I the case of scenario 1, we do a clean shutdown from a specific action.

For scenario 2, the only things the Agent will receive in that case is a 401, so we probably need a mix of "retries X maybe it's transient with a badly configured proxy" after X retries we put the agent halt mode, we uninstall endpoint and we try to reconnect?

But concerning scenario 1, it is possible that we never receive an ack from the agent, so we should also have a "timeout" period to invalidate the key.

WDYT @ruflin @blakerouse @michalpristas

nchaulet · 2020-06-26T20:24:05Z

But concerning scenario 1, it is possible that we never receive an ack from the agent, so we should also have a "timeout" period to invalidate the key.

Yes we should have a timeout, or we can have a new status for the agent, and allow users to force unenroll if the gracefull shutdown did not work

ph · 2020-06-26T20:31:43Z

I think we should have a state machine on the Agent with a defined transition for the states.

But I think the behavior should be automatic, let's assume that you gracefully unenroll agent and leave Kibana. It's possible that you never come back and its possible that you can't find the agent that you have "gracefully unenrolled" I think it's fair to expect the end result is to have the key invalidated.

Also maybe force unenroll is possible while in the unenrolling state, this is probably a bad work to describe it?

michalpristas · 2020-06-29T11:22:26Z

side note on graceful period: we probably need to make that visible as this means something went wrong with the agent-fleet communication. we might have an agent with removed processes and failed ack or we might have an agent with failed uninstall still running processes failed to report error (and this error will never be reported because token is revoked)

for this we need to make sure that admin has the information that this agent is misbehaving and needs manual resolution

ph · 2020-06-29T22:14:21Z

I like the idea here @michalpristas, not sure we can implement it for 7.9. This is something we should find a way to expose in the UI.

This also expose the need to have a better defined "state" for the agent, a formal state machine. I think we could link it with the work that has been done by @blakerouse by adding degraded.

michalpristas · 2020-06-30T06:06:15Z

i'm with you on state machine @ph
let's plan both of the issues for next release

nchaulet added the Team:Fleet Team label for Observability Data Collection Fleet team label May 26, 2020

ph added the Ingest Management:beta1 label May 27, 2020

nchaulet mentioned this issue May 28, 2020

[Ingest Manager] Improve agent unenrollment elastic/beats#18825

Closed

nchaulet self-assigned this Jun 25, 2020

nchaulet mentioned this issue Jun 26, 2020

[Ingest Manager] Improve agent unenrollment with unenroll action #70031

Merged

hbharding mentioned this issue Jun 30, 2020

[Ingest Manager] Improve unenroll agent UX #67704

Closed

nchaulet closed this as completed in #70031 Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingest Manager] Improve agent unenrollment #67409

[Ingest Manager] Improve agent unenrollment #67409

nchaulet commented May 26, 2020 •

edited

Loading

elasticmachine commented May 26, 2020

ph commented Jun 26, 2020 •

edited

Loading

ph commented Jun 26, 2020

nchaulet commented Jun 26, 2020

ph commented Jun 26, 2020

nchaulet commented Jun 26, 2020

ph commented Jun 26, 2020

michalpristas commented Jun 29, 2020

ph commented Jun 29, 2020 •

edited

Loading

michalpristas commented Jun 30, 2020

[Ingest Manager] Improve agent unenrollment #67409

[Ingest Manager] Improve agent unenrollment #67409

Comments

nchaulet commented May 26, 2020 • edited Loading

Description

Possible implementations

elasticmachine commented May 26, 2020

ph commented Jun 26, 2020 • edited Loading

ph commented Jun 26, 2020

nchaulet commented Jun 26, 2020

ph commented Jun 26, 2020

nchaulet commented Jun 26, 2020

ph commented Jun 26, 2020

michalpristas commented Jun 29, 2020

ph commented Jun 29, 2020 • edited Loading

michalpristas commented Jun 30, 2020

nchaulet commented May 26, 2020 •

edited

Loading

ph commented Jun 26, 2020 •

edited

Loading

ph commented Jun 29, 2020 •

edited

Loading