Correctly handle enqueued events affected by agent policy changes #49

cmacknz · 2022-06-07T14:57:33Z

We need to think through all the edge cases that can arise when events in the shipper queue are affected by an agent policy change. For a concrete example, consider the case where a user removes an integration but events collected by that integration still reside in the shipper queue:

User creates Agent policy rev.1 containing integration A and Integration B.
Fleet server generates an API key with append permission to write to data stream for integration A and B.
Elastic Agent receives and runs the Agent policy rev.1
Elastic Agent needs to persist events to disk (events from integration B and A are persisted on disk).
User removes integration B, Agent policy is updated to rev. 2.
Fleet server generates an API key with append permission to write to data stream for integration A.
Elastic Agent receives and runs Agent policy rev.2
Elastic Agent acknowledges the configuration.
Fleet-Server invalidates Elasticsearch api key.

In the case above the events for the removed integration B will never be able to be ingested by Elasticsearch after the API key has been changed. This series of events is worse with the disk queue because the number of events can be larger, but this situation would apply to the memory queue as well.

We must also consider that every policy change does not necessarily cause a problem. For example, changing the number of output workers does not affect events in the queue.

For policy changes that do affect enqueued events, there are several paths forward we could take to solve this problem:

Decide that it is safe to drop events for integration B, and have a mechanism to do so reliably when the API key changes. This option is complicated by the shipper pipeline being unaware of agent policy changes and the ability to configure infinite retry of failed events.
Ensure all affected events have been successfully sent and removed from the queue before acknowledging the policy change. In the V2 agent control protocol the the agent will could send the shipper expected state of stopped which the shipper can take it as a signal to flush all events. The gent doesn't accept it to be done and the policy rolled out until that unit is reported back as observed "stopped". So as soon as it gets "stopped" as the expected state, it reports "stopping" (aka. starting the flush), then "stopped" (aka. completely flushed).

Option 2 avoids data loss, but is the most complex path forward. There are multiple ways we could ensure all events affected by a policy change are drained from the shipper queue before acknowledging the policy change:

Have the agent provision a second instance of the shipper process, with new events routed to the new second instance. The policy change is considered acknowledged when the original shipper exits successfully after flushing all events to the output. The system would need to handle the case where the first shipper never exits successfully. The primary downside with this solution is it temporarily doubles the number of queues and connections made the to the output.
Have the shipper internally provision a second instance of its data pipeline, with all new events routed to the new pipeline. This is the same as the first option but with the pipeline duplicated in a single shipper process. The number of connections can be kept constant but the number of queues is doubled.
A policy change emits a special meta event into the pipeline. When this event is read at the output the shipper knows all affected events have been flushed through the queue and it acknowledges the policy change. This avoids duplicating the queues and connections.

This is a complex issue with many possible solutions. Evaluate each of the proposed solutions (and consider new ones) to decide which path we should take to solve this issue. The outcome of this issue should be a meta issue with an implementation plan for solving this problem.

leehinman · 2022-09-26T20:12:07Z

closing in favor of #118

cmacknz · 2022-09-26T21:33:47Z

Rather than closing, I moved this to be tracked under #118 since I don't see another issue covering this behaviour there (unless I missed something).

cmacknz · 2022-10-18T19:33:44Z

Let's treat this like a design issue. The outcome for this issue should be a decision on which approach we take to handle this situation, with follow up issues created for implementation of that approach.

leehinman · 2022-10-31T14:22:43Z

Design doc here

leehinman · 2023-03-10T18:48:18Z

Closing this, #286 is implementation issue

cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.5-candidate labels Jun 7, 2022

cmacknz mentioned this issue Jun 7, 2022

[Meta] Elastic Agent Shipper Project #16

Open

100 tasks

cmacknz mentioned this issue Jun 24, 2022

Support delaying shutdown until the shipper queue is empty #68

Closed

jlind23 added v8.5.0 and removed 8.5-candidate labels Jul 8, 2022

jlind23 assigned rdner Jul 13, 2022

rdner added the estimation:Month Task that represents a month of work. label Jul 14, 2022

rdner assigned leehinman and unassigned rdner Sep 6, 2022

jlind23 added v8.6.0 and removed v8.5.0 labels Sep 7, 2022

cmacknz added the 8.6-candidate label Sep 14, 2022

leehinman closed this as completed Sep 26, 2022

cmacknz mentioned this issue Sep 26, 2022

[Meta] disk queue journey to GA #118

Open

13 tasks

cmacknz reopened this Sep 26, 2022

leehinman changed the title ~~Correctly handle enqueued events affected by agent policy changes~~ [Design] Correctly handle enqueued events affected by agent policy changes Oct 27, 2022

pierrehilbert changed the title ~~[Design] Correctly handle enqueued events affected by agent policy changes~~ Correctly handle enqueued events affected by agent policy changes Feb 7, 2023

leehinman mentioned this issue Mar 10, 2023

Implement design to handle queued events on policy update #286

Open

leehinman closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle enqueued events affected by agent policy changes #49

Correctly handle enqueued events affected by agent policy changes #49

cmacknz commented Jun 7, 2022

leehinman commented Sep 26, 2022

cmacknz commented Sep 26, 2022

cmacknz commented Oct 18, 2022

leehinman commented Oct 31, 2022

leehinman commented Mar 10, 2023

Correctly handle enqueued events affected by agent policy changes #49

Correctly handle enqueued events affected by agent policy changes #49

Comments

cmacknz commented Jun 7, 2022

leehinman commented Sep 26, 2022

cmacknz commented Sep 26, 2022

cmacknz commented Oct 18, 2022

leehinman commented Oct 31, 2022

leehinman commented Mar 10, 2023