-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues with output reloading #17381
Changes from 30 commits
ccefba0
7e5e57c
89aa8cb
03db189
6d160f4
c8a16c6
f28d0b7
d9bc4e5
5d3b10e
5c4daff
71b1d55
9ce4b54
014fd25
5c5e48a
6227a1f
ffd0822
b97becc
3aeebda
1208332
2039344
8a03547
e31e53a
dbf275a
a36cbb1
229623f
f09c5e0
896505c
a59a56e
041922b
5ff7219
15a1383
6e10331
6a6680c
09ccd8d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,7 @@ import ( | |
"github.com/elastic/beats/v7/libbeat/common" | ||
"github.com/elastic/beats/v7/libbeat/common/reload" | ||
"github.com/elastic/beats/v7/libbeat/outputs" | ||
"github.com/elastic/beats/v7/libbeat/publisher" | ||
"github.com/elastic/beats/v7/libbeat/publisher/queue" | ||
) | ||
|
||
|
@@ -34,7 +35,8 @@ type outputController struct { | |
monitors Monitors | ||
observer outputObserver | ||
|
||
queue queue.Queue | ||
queue queue.Queue | ||
workQueue workQueue | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is another change I made, to simplify things. Now the output controller creates a work queue in it's constructor and the same one is used by all output workers across time. This removes the need to drain the old work queue to a new one when outputs are reloaded and new output workers are created in the process. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good change! |
||
|
||
retryer *retryer | ||
consumer *eventConsumer | ||
|
@@ -50,7 +52,7 @@ type outputGroup struct { | |
timeToLive int // event lifetime | ||
} | ||
|
||
type workQueue chan *Batch | ||
type workQueue chan publisher.Batch | ||
|
||
// outputWorker instances pass events from the shared workQueue to the outputs.Client | ||
// instances. | ||
|
@@ -62,18 +64,19 @@ func newOutputController( | |
beat beat.Info, | ||
monitors Monitors, | ||
observer outputObserver, | ||
b queue.Queue, | ||
queue queue.Queue, | ||
) *outputController { | ||
c := &outputController{ | ||
beat: beat, | ||
monitors: monitors, | ||
observer: observer, | ||
queue: b, | ||
beat: beat, | ||
monitors: monitors, | ||
observer: observer, | ||
queue: queue, | ||
workQueue: makeWorkQueue(), | ||
} | ||
|
||
ctx := &batchContext{} | ||
c.consumer = newEventConsumer(monitors.Logger, b, ctx) | ||
c.retryer = newRetryer(monitors.Logger, observer, nil, c.consumer) | ||
c.consumer = newEventConsumer(monitors.Logger, queue, ctx) | ||
c.retryer = newRetryer(monitors.Logger, observer, c.workQueue, c.consumer) | ||
ctx.observer = observer | ||
ctx.retryer = c.retryer | ||
|
||
|
@@ -86,27 +89,26 @@ func (c *outputController) Close() error { | |
c.consumer.sigPause() | ||
c.consumer.close() | ||
c.retryer.close() | ||
close(c.workQueue) | ||
|
||
if c.out != nil { | ||
for _, out := range c.out.outputs { | ||
out.Close() | ||
} | ||
close(c.out.workQueue) | ||
} | ||
|
||
return nil | ||
} | ||
|
||
func (c *outputController) Set(outGrp outputs.Group) { | ||
// create new outputGroup with shared work queue | ||
// create new output group with the shared work queue | ||
clients := outGrp.Clients | ||
queue := makeWorkQueue() | ||
worker := make([]outputWorker, len(clients)) | ||
for i, client := range clients { | ||
worker[i] = makeClientWorker(c.observer, queue, client) | ||
worker[i] = makeClientWorker(c.observer, c.workQueue, client) | ||
} | ||
grp := &outputGroup{ | ||
workQueue: queue, | ||
workQueue: c.workQueue, | ||
outputs: worker, | ||
timeToLive: outGrp.Retry + 1, | ||
batchSize: outGrp.BatchSize, | ||
|
@@ -119,7 +121,6 @@ func (c *outputController) Set(outGrp outputs.Group) { | |
c.retryer.sigOutputRemoved() | ||
} | ||
} | ||
c.retryer.updOutput(queue) | ||
for range clients { | ||
c.retryer.sigOutputAdded() | ||
} | ||
|
@@ -136,12 +137,13 @@ func (c *outputController) Set(outGrp outputs.Group) { | |
|
||
// restart consumer (potentially blocked by retryer) | ||
c.consumer.sigContinue() | ||
c.consumer.sigUnWait() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found that this was needed to make the That said, I agree that this is something the retryer should automatically call internally rather than "leak" this responsibility to the controller. I'll look into why it's not doing that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this sounds like a potential bug to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in 6a6680c. |
||
|
||
c.observer.updateOutputGroup() | ||
} | ||
|
||
func makeWorkQueue() workQueue { | ||
return workQueue(make(chan *Batch, 0)) | ||
return workQueue(make(chan publisher.Batch, 0)) | ||
} | ||
|
||
// Reload the output | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a key change in getting the test to pass (i.e. ensuring that all batches are eventually published despite rapid reloading of outputs multiple times).
Consider the case where the consumer is paused (so
paused == true
here) but it still has a batch of events that it has consumed from the queue but not yet sent to the workqueue. In this scenario because of the!paused
clause here, we would fall into theelse
clause, which would setout = nil
. This would then cause the lastselect
in this method to block (unless, by luck of timing, a signal to unpause the consumer just happened to come along).Now, if we have a valid output group
c.out
and abatch
, we don't pay attention to thepaused
state; this ensures that this batch will be send to that output group's work queue via that finalselect
's finalcase
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow nice fix!