Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Improve agent unenrollment #67409

Closed
nchaulet opened this issue May 26, 2020 · 10 comments · Fixed by #70031
Closed

[Ingest Manager] Improve agent unenrollment #67409

nchaulet opened this issue May 26, 2020 · 10 comments · Fixed by #70031
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@nchaulet
Copy link
Member

nchaulet commented May 26, 2020

Description

Work in progress

Currently the agent unenrollment is done with the following:

  • we invalidate API keys (access and output API keys)
  • the agent receive a 401 when trying a new checkin and interpret that as unenroll

We can improve this process to ensure the unenrollment worked correctly with a gracefull unenrollment.

We should also provide a way to have a immediate unenrollment, that invalidate API keys, without gracefull shutdown.

Possible implementations

We can send a new action with a new ACTION type UNENROLL the agent can do all the thing he need (uninstalling endpoint, sending last events) then when he ack the action we will invalidate the API keys and change the agent status.

We probably need to have a background job that after a defined amount of time, clean agent that did not ack the UNENROLL action. (It's possible to do this in Kibana?)

I suggest we introduce a new action in fleet UNENROLL ?

  • Fleet will send the action
  • Agent will do this stuff, finish to send logs, deinstall
  • Agent will ack the UNENROLL action to fleet, in reaction Fleet will deactivate all the agent related API keys and mark the agent as unenrolled

We probably want a mean to force unenrolled an agent, (invalidating API keys directly, without sending the action to the agent)

  • a background job that do this automatically after an amount of time where we sent UNENROLL
  • an API to do that
  • or both
@nchaulet nchaulet added the Team:Fleet Team label for Observability Data Collection Fleet team label May 26, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph
Copy link
Contributor

ph commented Jun 26, 2020

Would like your feedback on this @blakerouse @michalpristas

@ph
Copy link
Contributor

ph commented Jun 26, 2020

@nchaulet It is possible that the "graceful" and "force" is the same thing with different timeout value?

@nchaulet
Copy link
Member Author

@ph the way I see this the "force" unenrollment is the end of the gracefull one.

Scenario 1 gracefull unenrollment:
-> we send UNENROLL action to the agent -> the agent finish, uninstall endpoint, -> send ACK -> we invalidate API keys

Scenario 2 force unenrollment (compromised tokens for example)
-> we invalidate API keys

@ph
Copy link
Contributor

ph commented Jun 26, 2020

Ok, I think I am OK with you with this.

I the case of scenario 1, we do a clean shutdown from a specific action.

For scenario 2, the only things the Agent will receive in that case is a 401, so we probably need a mix of "retries X maybe it's transient with a badly configured proxy" after X retries we put the agent halt mode, we uninstall endpoint and we try to reconnect?

But concerning scenario 1, it is possible that we never receive an ack from the agent, so we should also have a "timeout" period to invalidate the key.

WDYT @ruflin @blakerouse @michalpristas

@nchaulet
Copy link
Member Author

But concerning scenario 1, it is possible that we never receive an ack from the agent, so we should also have a "timeout" period to invalidate the key.

Yes we should have a timeout, or we can have a new status for the agent, and allow users to force unenroll if the gracefull shutdown did not work

@ph
Copy link
Contributor

ph commented Jun 26, 2020

I think we should have a state machine on the Agent with a defined transition for the states.

But I think the behavior should be automatic, let's assume that you gracefully unenroll agent and leave Kibana. It's possible that you never come back and its possible that you can't find the agent that you have "gracefully unenrolled" I think it's fair to expect the end result is to have the key invalidated.

Also maybe force unenroll is possible while in the unenrolling state, this is probably a bad work to describe it?

@michalpristas
Copy link

side note on graceful period: we probably need to make that visible as this means something went wrong with the agent-fleet communication. we might have an agent with removed processes and failed ack or we might have an agent with failed uninstall still running processes failed to report error (and this error will never be reported because token is revoked)

for this we need to make sure that admin has the information that this agent is misbehaving and needs manual resolution

@ph
Copy link
Contributor

ph commented Jun 29, 2020

I like the idea here @michalpristas, not sure we can implement it for 7.9. This is something we should find a way to expose in the UI.

This also expose the need to have a better defined "state" for the agent, a formal state machine. I think we could link it with the work that has been done by @blakerouse by adding degraded.

@michalpristas
Copy link

i'm with you on state machine @ph
let's plan both of the issues for next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants