Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Correctly handle enqueued events affected by agent policy changes #49

Closed
Tracked by #118 ...
cmacknz opened this issue Jun 7, 2022 · 5 comments
Closed
Tracked by #118 ...
Assignees
Labels
8.6-candidate estimation:Month Task that represents a month of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Jun 7, 2022

We need to think through all the edge cases that can arise when events in the shipper queue are affected by an agent policy change. For a concrete example, consider the case where a user removes an integration but events collected by that integration still reside in the shipper queue:

  1. User creates Agent policy rev.1 containing integration A and Integration B.
  2. Fleet server generates an API key with append permission to write to data stream for integration A and B.
  3. Elastic Agent receives and runs the Agent policy rev.1
  4. Elastic Agent needs to persist events to disk (events from integration B and A are persisted on disk).
  5. User removes integration B, Agent policy is updated to rev. 2.
  6. Fleet server generates an API key with append permission to write to data stream for integration A.
  7. Elastic Agent receives and runs Agent policy rev.2
  8. Elastic Agent acknowledges the configuration.
  9. Fleet-Server invalidates Elasticsearch api key.

In the case above the events for the removed integration B will never be able to be ingested by Elasticsearch after the API key has been changed. This series of events is worse with the disk queue because the number of events can be larger, but this situation would apply to the memory queue as well.

We must also consider that every policy change does not necessarily cause a problem. For example, changing the number of output workers does not affect events in the queue.

For policy changes that do affect enqueued events, there are several paths forward we could take to solve this problem:

  1. Decide that it is safe to drop events for integration B, and have a mechanism to do so reliably when the API key changes. This option is complicated by the shipper pipeline being unaware of agent policy changes and the ability to configure infinite retry of failed events.
  2. Ensure all affected events have been successfully sent and removed from the queue before acknowledging the policy change. In the V2 agent control protocol the the agent will could send the shipper expected state of stopped which the shipper can take it as a signal to flush all events. The gent doesn't accept it to be done and the policy rolled out until that unit is reported back as observed "stopped". So as soon as it gets "stopped" as the expected state, it reports "stopping" (aka. starting the flush), then "stopped" (aka. completely flushed).

Option 2 avoids data loss, but is the most complex path forward. There are multiple ways we could ensure all events affected by a policy change are drained from the shipper queue before acknowledging the policy change:

  1. Have the agent provision a second instance of the shipper process, with new events routed to the new second instance. The policy change is considered acknowledged when the original shipper exits successfully after flushing all events to the output. The system would need to handle the case where the first shipper never exits successfully. The primary downside with this solution is it temporarily doubles the number of queues and connections made the to the output.
  2. Have the shipper internally provision a second instance of its data pipeline, with all new events routed to the new pipeline. This is the same as the first option but with the pipeline duplicated in a single shipper process. The number of connections can be kept constant but the number of queues is doubled.
  3. A policy change emits a special meta event into the pipeline. When this event is read at the output the shipper knows all affected events have been flushed through the queue and it acknowledges the policy change. This avoids duplicating the queues and connections.

This is a complex issue with many possible solutions. Evaluate each of the proposed solutions (and consider new ones) to decide which path we should take to solve this issue. The outcome of this issue should be a meta issue with an implementation plan for solving this problem.

@cmacknz cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.5-candidate labels Jun 7, 2022
@rdner rdner added the estimation:Month Task that represents a month of work. label Jul 14, 2022
@rdner rdner assigned leehinman and unassigned rdner Sep 6, 2022
@jlind23 jlind23 added v8.6.0 and removed v8.5.0 labels Sep 7, 2022
@leehinman
Copy link
Contributor

closing in favor of #118

@cmacknz
Copy link
Member Author

cmacknz commented Sep 26, 2022

Rather than closing, I moved this to be tracked under #118 since I don't see another issue covering this behaviour there (unless I missed something).

@cmacknz cmacknz reopened this Sep 26, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Oct 18, 2022

Let's treat this like a design issue. The outcome for this issue should be a decision on which approach we take to handle this situation, with follow up issues created for implementation of that approach.

@leehinman leehinman changed the title Correctly handle enqueued events affected by agent policy changes [Design] Correctly handle enqueued events affected by agent policy changes Oct 27, 2022
@leehinman
Copy link
Contributor

Design doc here

@pierrehilbert pierrehilbert changed the title [Design] Correctly handle enqueued events affected by agent policy changes Correctly handle enqueued events affected by agent policy changes Feb 7, 2023
@leehinman
Copy link
Contributor

Closing this, #286 is implementation issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
8.6-candidate estimation:Month Task that represents a month of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0
Projects
None yet
Development

No branches or pull requests

4 participants