Introduce spooling to disk #6581

urso · 2018-03-16T14:43:29Z

This PR implements the queue.Queue interface, adding spooling to disk functionality to all beats. The queue interface requires all queues to provide 'connections' for Producers and Consumers. The interface also demands all ACKs to be executed asynchronously. Once an event is ACKed, it must not be presented to a Consumer (the outputs worker queue) anymore.

The new queue type is marked as Beta-feature.

In the spool, the events are stored in a dynamic ring buffer, in one single file only. The maximum file size must be configured at startup (default 100MiB). The file layer needs to use available file space for event data and additional meta data. Events are first written into a write buffer, which is flushed once it is full or if the flush timeout is triggered. No limit on event sizes is enforced. An event being bigger then the write buffer will still be accepted by the spool, but will trigger a flush right away. A successful flush also requires 2 fsync operations (1. for data 2. update file header). The spool blocks all producers, once it can not allocate any more space within the spool file. Writing never grows the file past the configured maximum file size. All producers are handled by the inBroker. There is no direct communication between producers and consumers. All required signaling and synchronisation is provided by the file layer.

Once the write buffer is flushed, a signal is returned to each individual Producer, notifying the producers of events being published. This signal is used by filebeat/winlogbeat to update the registry file.

Consumers are handled by the outBroker. Consumers request a batch of events. The broker reads up to N message from the file and forwards these. The reading process is 'readonly' and does not update the on-disk read pointers (only in-memory read pointers are updated).
The file is memory mapped for reading. This will increases the process it's reported memory usage significantly.

The outputs asynchronously ACK batches. The ACK signals are processed by the brokers ackLoop. Due to load-balancing or retries, ACKs can be received out of order. The broker guarantees, ACKs are sorted in the same order events have been read from the queue. Once a continuous set of events (starting from the last on-disk read pointer) is ACKed, the on-disk read pointer is update and space occupied by ACKed events is freed. As free space is tracked by the file, the file meta-data must be updated. If no more space for file meta data updates is available, there is a chance of the file potentially growing a few pages past max-size. Growing is required to guarantee progress (otherwise the spool might be stalled forever). In case the file did grow on ACK, the file layer will try to free the space with later write/ACK operations, potentially truncating the file again.

The file layer is provided by go-txfile. The file is split into pages of equal size. The layer provides transactional access to pages only. All writes (ACK, flushing the write buffer) are handled concurrently, using write transaction. The reader is isolated from concurrent writes/reads, using a read transaction. The last transaction state is never overwritten in a file. If a beat crashes during a write transaction, the most recent committed transaction is still available, so the beat can continue from the last known state upon restart. No additional repair-phase is required.

Known limitations (To be addressed in future PRs):

File maximum size can not be changed once file is generated:
- Add support to grow max size
- Add support to shrink max size. Shrinking could be made dynamic, trying to return space once pages at end of file are freed
- Add command to transfer the queue into a new queue file. In case fragmentation prevents dynamic shrinking or user doesn't want to wait.
Monitoring metrics do not report already available events upon beats restart (requires some more changes to the queue interface itself).
No file related metrics available yet.
If file is too small and write buffer to big, queue can become stalled. Potential solution requires (all):
- Limit maximum event size
- Generously preallocate meta-area (reserve pages for meta-data only on file creation)
- Ensure usable data area is always > 2*write buffer -> partial write buffer flush?
- Startup check validating combination of max_size, data vs. meta area size, event size, write buffer size

houndci-bot · 2018-03-16T14:43:44Z

libbeat/publisher/queue/spool/spool.go

+	handlers []func()
+}
+
+type Settings struct {


exported type Settings should have comment or be unexported

houndci-bot · 2018-03-16T14:43:45Z

libbeat/publisher/queue/queuetest/queuetest.go

@@ -8,7 +8,7 @@ import (
 	"github.com/elastic/beats/libbeat/publisher/queue"
 )

-type QueueFactory func() queue.Queue
+type QueueFactory func(t *testing.T) queue.Queue


exported type QueueFactory should have comment or be unexported

urso · 2018-03-16T16:21:33Z

libbeat/publisher/pipeline/stress/stress_test.go

@@ -68,7 +81,7 @@ func configTest(t *testing.T, typ string, configs []string, fn func(t *testing.T
 	for _, config := range configs {
 		config := config
 		t.Run(testName(typ, config), func(t *testing.T) {
-			t.Parallel()
+			// t.Parallel()


TODO: restore capability to run tests in parallel (will need randomized file names)

houndci-bot · 2018-03-19T20:09:25Z

libbeat/publisher/queue/spool/log.go

+type outLogger struct {
+}
+
+var defaultLogger logger = logp.NewLogger("spool")


should omit type logger from declaration of var defaultLogger; it will be inferred from the right-hand side

ruflin

I skimmed through the code and left some high level comments. It definitively needs a more detailed review also with knowledge of the file spooler code. One thing that is missing is docs for this PR.

One way I was looking at the code is if we could release this early. As the queue registers itself and only comes into action in case it's enabled it should have no affect on the rest of the code.

This is a feature that needs lots of automated and manual testing to make sure it's bullet proof. Because of this I'm tempted to merged this rather soonish as experimental or beta so people can start to play around with it and provide feedback.

For the testing one thing I'm especially curious is how it works across restarts. Does it continue sending events for example. To me it seems some additinal system tests would be best here.

ruflin · 2018-03-21T13:25:56Z

auditbeat/auditbeat.reference.yml

+  # is full or the flush_timeout is triggered.
+  # Once ACKed by the output, events are removed immediately from the queue,
+  # making space for new events to be persisted.
+  #spool:


Why not call this file or disk? It would make it more obvious that is persistent.

We should probably align with the naming/convention if/when possible with Logstash persistent queue.

In that case we are using queue.type accepting theses two values: memory (default) persisted

@ph @ruflin

Hmmm... introducing queue.type would be a bc-breaking change. As one can have only one queue type, the queue config follows the dictionary style config-pattern like:

queue.<type>:

Whats wrong with spool?

Right now this one implements a FIFO. For metricbeat use cases we might actually introduce another persistent LIFO queue type. This one should have another type. How would you name any of these now?

bc-breaking After looking at filebeat.reference.yml

#queue: # Queue type by name (default 'mem') # The memory queue will present all available events (up to the outputs # bulk_max_size) to the output, the moment the output is ready to server # another batch of events. #mem: # Max number of events the queue can buffer.

You are right, actually I think our config make it cleaner. @ruflin @kvch I would vote to keep @urso suggestion.
Concerning the naming: spool or disk or file, I think spool is a really common name for that kind of settings.

For the LIFO, FIFO, its more like a priority, It could also become more complex and take into consideration the kind of events, on a storage system disk usage vs cpu (probably a bad example.)

I don't think we should introduce a type because of the reasons mentioned by @urso. There can only be one queue at the time.

For me spooling can be to disk or memory. Historically we had a spooler in filebeat which kept the data in memory. Other options would be to call it file?

+1 on the proposal from @ph about the priority. FIFO or LIFO is a config option of the queue. It can mean in the background a completely different implementation but the user should not have to worry about that.

Regarding priority I was initially thinking the same. But if we have a priority setting, user must be allowed to change the setting between restarts.

But async ACK behaviour of the queue makes the on-disk structure a little more complicated. When introducing stack like functionality, we will end up with a many holes. That is, freeing space will be somewhat more complicated in the LIFO case. I'd like to solve the LIFO case separately, potentially merging both cases into a common file format, later in time. Priority based queue, using heaps might become even more complicated.

@urso Look like a good compromise.

@ruflin You have a point, Disk or File at the discretion of @urso.

ruflin · 2018-03-21T13:26:24Z

auditbeat/auditbeat.reference.yml

+    # will have no more effect.
+    #file:
+      # Location of spool file. This value must be configured.
+      #path: ""


Should this default to our path.data directory?

I believe we should provide a default and path.data make sense in that case, so we can remove the This value must be configured

Yeah, was thinking the same. Will default to ${path.data}/spool.dat.

ruflin · 2018-03-21T13:28:15Z

auditbeat/auditbeat.reference.yml

@@ -148,6 +148,50 @@ auditbeat.modules:
    # if the number of events stored in the queue is < min_flush_events.
    #flush.timeout: 1s

+  # The spool queue will store events in a local spool file, before


I assume only 1 queue can be used at the same time. Should we mention that in a comment on line 130?

We need to? If you configure 2 queue types, beats will report an error that only one is allowed. It's using the dictionary style plugin config we have in quite a few places in beats.

I mainly brought it up as people keep trying to use 2 outputs in Beats and they will for queues. So having it documented makes it easy to point people to it. So they know it already from the docs before running the docs.

ruflin · 2018-03-21T13:29:00Z

libbeat/publisher/pipeline/stress/data/placeholder

@@ -0,0 +1 @@
+// Do not remove


Perhaps add a note on why it should not be removed.

will change tests to create a temporary directories. Was just nice to have files available after tests (for inspection).

When removing the file, git will remove the directory -> tests would fail.

ruflin · 2018-03-21T13:29:17Z

libbeat/publisher/pipeline/stress/run.go

 	}

 	// reg := monitoring.NewRegistry()
 	pipeline, err := pipeline.Load(info, nil, config.Pipeline, config.Output)
 	if err != nil {
-		return err
+		return fmt.Errorf("Loading pipeline failed: %v", err)


start error lower case

ruflin · 2018-03-21T13:33:21Z

libbeat/publisher/queue/spool/codec.go

+const (
+	// Note: Never change order. Codec IDs must be not change in the future. Only
+	//       adding new IDs is allowed.
+	codecUnknown codecID = iota


Not sure I understand why we need different codes. It seems it could be configurable through the config but it's not listed in the reference file. I assume it would also cause issues if one changes the code after the beats was running.

It's good for testing to have the different codes but should we for now just set one?

@ruflin the codec is encoded in the header so we can choose the correct encoding on reading, so the queue on disk supports events with mixed codec, so users can change that settings and it should be fine even if there is already events on disk.

I see that we default to CBORL as the default serialization methods, we should probably benchmark that and use the fastest possible route and we would keep a trace of that.

Also, I am not sure at this point that we should support different format, it seems premature and will increase the support area.

The JSON codec is nice for debugging(assuming I would have to use an hexeditor). Checking out a file in an hexeditor can be complicated enough. No need to make debugging more complicated by having binary data only.

The UBJSON can be removed.

Benchmarks are available in the go-structform tests (using beats events):

BenchmarkDecodeBeatsEvents/packetbeat/structform-json-4 200 9210052 ns/op 74.12 MB/s 3251196 B/op 62062 allocs/op BenchmarkDecodeBeatsEvents/packetbeat/structform-ubjson-4 200 7440986 ns/op 87.19 MB/s 3171845 B/op 56362 allocs/op BenchmarkDecodeBeatsEvents/packetbeat/structform-cborl-4 200 7310081 ns/op 81.46 MB/s 3171266 B/op 56361 allocs/op BenchmarkDecodeBeatsEvents/metricbeat/structform-json-4 500 3333191 ns/op 62.41 MB/s 1387710 B/op 24852 allocs/op BenchmarkDecodeBeatsEvents/metricbeat/structform-ubjson-4 500 2955721 ns/op 72.92 MB/s 1387472 B/op 24758 allocs/op BenchmarkDecodeBeatsEvents/metricbeat/structform-cborl-4 500 2816873 ns/op 67.70 MB/s 1387136 B/op 24758 allocs/op BenchmarkDecodeBeatsEvents/filebeat/structform-json-4 30 46966885 ns/op 105.49 MB/s 15055009 B/op 390517 allocs/op BenchmarkDecodeBeatsEvents/filebeat/structform-ubjson-4 30 38593911 ns/op 126.92 MB/s 14874314 B/op 390499 allocs/op BenchmarkDecodeBeatsEvents/filebeat/structform-cborl-4 50 36702929 ns/op 123.75 MB/s 14874551 B/op 390500 allocs/op BenchmarkEncodeBeatsEvents/packetbeat/structform-json-4 200 5970464 ns/op 114.26 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/packetbeat/structform-ubjson-4 500 3749375 ns/op 173.04 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/packetbeat/structform-cborl-4 500 2829715 ns/op 210.43 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/metricbeat/structform-json-4 1000 2274028 ns/op 91.03 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/metricbeat/structform-ubjson-4 1000 1672348 ns/op 128.88 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/metricbeat/structform-cborl-4 1000 1236829 ns/op 154.18 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/filebeat/structform-json-4 50 32523323 ns/op 151.97 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/filebeat/structform-ubjson-4 100 22533875 ns/op 217.37 MB/s 0 B/op 0 allocs/op BenchmarkEncodeBeatsEvents/filebeat/structform-cborl-4 100 15919294 ns/op 285.32 MB/s 0 B/op 0 allocs/op

ubjson is faster on decode, cborl is much faster on encode :)

btw. the codec setting is not documented for now. But as I have some plans to replace the codec in the future, I added the codec to the header, so upgrades can be smooth (no need to drop old events).

@urso +1 for thinking ahead for that, since its beta we can the luxury of changing things.

For not documenting it: If the config is there, it should be in the reference file. We did this "trick" in the past and I think it didn't benefit us as people discovered the option and started using it. Better have it in the reference file as experimental and say what it does.

If I understand the comment from ph a user can have code A and then configure codec B on restart and everything will keep working smootly? The events are read in with codec A and the next time the file is written, it will be all in codec B.

Added codec setting to the reference file.

If I understand the comment from ph a user can have code A and then configure codec B on restart and everything will keep working smootly? The events are read in with codec A and the next time the file is written, it will be all in codec B.

The codec can be change between restarts. The setting is only used by the writer. The reader can easily deal with all codecs within a file.

ruflin · 2018-03-21T13:35:19Z

libbeat/publisher/queue/spool/config.go

+	}
+
+	if c.MaxSize < humanize.MiByte {
+		errs = append(errs, errors.New("max size must be larger 1MiB"))


What happens if someone has a multiline event of 2MB and the MaxSize here is 1.1MiB?

Well, there are some "limits" that are not supported.

If event is > total file size, we will block forever.

As event sizes are dynamic and we don't want to employ hard limits on event sizes, we have kind of a weird effect here. There are potential parameters combinations + event sizes that can block the queue. The best we could do is: introduce a max event size and drop events if they hit the limit. This way we can precompute the worst page-usage of an event and figure if max size must be even larger. It's kinda tricky, that's why I didn't do it yet.

ruflin · 2018-03-21T13:37:06Z

libbeat/_meta/config.reference.yml

+  # is full or the flush_timeout is triggered.
+  # Once ACKed by the output, events are removed immediately from the queue,
+  # making space for new events to be persisted.
+  #spool:


We should mark this beta at first, in the code as a log message, here in the config and in the docs.

ruflin · 2018-03-21T13:39:52Z

libbeat/publisher/queue/spool/inbroker.go

+// the producer. The producers event (after close) might be processed or
+// ignored in the future.
+//
+// stateBlocked transitions:


In the case of Filebeat this means the harvesters are blocked. In case of metricbeat does means the last events coming in will be dropped? So if the output becomes available again, all events in the queue will be sent and the events which came in between "queue full" and "output available again" will be lost.

I believe that would be the behavior that we want, not sure how it is done in the form of this current PR.

Few notes:

If the queue is blocked for a short time a small amount of metrics will be lost and metrics will smooth the small gaps, which should still give us a good view of the situation?

If the queue is blocked for a long period of time we could drop a huge amount metrics, I guess the monitoring cluster should detect that behavior by either, a: full queue for a long period of time or b: Didn't receive any metrics snapshot for some time.

It's the same behavior as for the in memory queue. If the queue is full, it is full! If queue is full, it blocks (no event is dropped).

Block vs. drop (beside having similar effects on metricbeat) do have subtle differences. Metricbeat could report/monitor long block periods by having a watchdog checking metricsets actually reporting events within the configured period (or multiple of the configured periods).

It's by design. Administrator pre-allocate/assign disk space for use by beats and beats guarantees (best effort) it does not use any more disk space once the queue is full. Better have beats block if queue is full then making a system go out of space.

kvch · 2018-03-22T10:51:06Z

metricbeat/metricbeat.reference.yml

+    #file:
+      # Location of spool file. This value must be configured.
+      #path: ""
+


My vim shows an unnecessary tab here.

kvch · 2018-03-22T10:51:23Z

metricbeat/metricbeat.reference.yml

+
+      # Configure file permissions if file is created. The default value is 0600.
+      #permissions: 0600
+


Also, an unncessary tab here.

kvch · 2018-03-22T17:55:59Z

libbeat/publisher/queue/spool/config.go

+
+type size uint64
+
+func (c *pathConfig) Validate() error {


Yay for Validate functions!

kvch · 2018-03-23T16:00:22Z

libbeat/publisher/queue/spool/spool.go

+	active atomic.Bool
+	done   chan struct{}
+
+	handlers []func()


I think calling the list of functions set by users to run when a spoolCtx is too general. I think handlers can be called whenever they are needed. But this list contains only a special kind of handlers which run during closing the session.
I would rather name it onCloseCBs, onCloseCallbacks or closingCallbacks. This change would make more obvious when these functions are called and what their role is.

Urgh... I felt dirty when introducing the handlers. No I'm feeling more "dirty" giving the impression we want multiple of these callbacks. Let's see if I can get rid of handlers or other callbacks.

Also not, these are for internal use only. No external lib/component is ever allowed to install a custom handler! The idea of spoolCtx is to provide a very small required subset of context.Context in order to deal with queue shutdown.

Haha... handlers is not used anymore for quite some time -> removing it \o/

kvch · 2018-03-23T16:09:15Z

libbeat/publisher/queue/spool/consume.go

+func (c *consumer) Get(sz int) (queue.Batch, error) {
+	log := c.ctx.logger
+
+	if c.closed.Load() || c.ctx.Closed() {


It would be nice to wrap c.closed.Load into a function named c.Closed. Just like it is done in spoolCtx.

ph

Good job @urso !

I've went through the code, look goods, I have noted a few things to changes. Can you check the logging statements in the source code? I believe you were using trace with formatted string before moving to log.Debug. So we need to clear out any unnecessary space.

On my side, I need to do the following:

Revise the tests, at first I think we are missing a few unit tests on some of the classes.
Real world human testing on it.

I expect we do theses tasks in followup PRs:

E2E integration tests?
Stress tests?
Recovery scenario?

ph · 2018-03-23T17:10:17Z

auditbeat/auditbeat.reference.yml

+    # will have no more effect.
+    #file:
+      # Location of spool file. This value must be configured.
+      #path: ""


I believe we should provide a default and path.data make sense in that case, so we can remove the This value must be configured

ph · 2018-03-23T17:12:08Z

auditbeat/auditbeat.reference.yml

+  # is full or the flush_timeout is triggered.
+  # Once ACKed by the output, events are removed immediately from the queue,
+  # making space for new events to be persisted.
+  #spool:


We should probably align with the naming/convention if/when possible with Logstash persistent queue.

In that case we are using queue.type accepting theses two values: memory (default) persisted

ph · 2018-03-23T17:19:29Z

auditbeat/auditbeat.reference.yml

+
+      # Maximum duration after which events are flushed, if the write buffer
+      # is not full yet. The default value is 1s.
+      #flush_timeout: 1s


This is related to fsync to disk, should we add a mention that lower values could impact performance?

I'd rather not document here the effects of lower and higher values. Maybe we want to add some of the tradeoffs to the final asciidoc.

ph · 2018-03-23T17:25:00Z

libbeat/publisher/queue/spool/config.go

+		if (m & 0200) == 0 {
+			errs = append(errs, errors.New("file must be writable by current user"))
+		}
+	}


Open question: Should we do a strict check on the permissions of the PQ, when starting up? It's there risk for sensitive data?

Yeah, an extra check would make sense. As queue is not encrypted there is indeed the chance of it holding sensitive data.

NewSpool checks file permissions if file exists.

ph · 2018-03-23T17:31:17Z

libbeat/publisher/queue/spool/codec.go

+type decoder struct {
+	// TODO: replace with more intelligent custom buffer, so to limit amount of memory the buffer
+	//       holds onto after reading a 'large' event.
+	//       Alternative use strucform decoder types (with shared io.Buffer).


Maybe remove the TODO and create an issue so we can discuss pros and cons of doing it.

ph · 2018-03-23T20:16:37Z

libbeat/publisher/queue/spool/outbroker.go

+				err := b.queue.ACK(uint(ackCh.total))
+				if err != nil {
+					log.Debug("ack failed with:", err)
+					time.Sleep(1 * time.Second)


My understanding of this part, is we want to give some backoff on the acking when en error occur.
Suggestion, use an exponential backoff with a fixed cap using 1s by default seems a high?

Question is how often this could happen. Only reason ACKing fails is due to disk/write failures (network shares).

With exponential backoff we might want to introduce more configurable parameters. I'd prefer not to introduce additional config parameters for dealing with this kind of fatal disk/io issues.

OK, good for now.

ph · 2018-03-23T20:23:35Z

libbeat/publisher/queue/spool/outbroker.go

+		b.active = req
+		b.total = total
+		b.timer.Start()
+		log.Debug("  outbroker (stateActive): switch to stateWithTimer")


uneeded space?

ph · 2018-03-23T20:24:09Z

libbeat/publisher/queue/spool/outbroker.go

+		b.initState()
+
+	case n := <-b.sigFlushed:
+		// yay, more events \o/


ph · 2018-03-23T20:36:59Z

libbeat/publisher/queue/spool/produce.go

+	states, s.clients = s.clients[:n], s.clients[n:]
+	s.mux.Unlock()
+	return states
+}


nitpick, I would have used a defer for the unlock.

ph · 2018-03-23T20:37:35Z

libbeat/publisher/queue/spool/spool.go

+	active atomic.Bool
+	done   chan struct{}
+
+	handlers []func()


kvch · 2018-03-26T12:36:03Z

I've finished the first round of review. I added comments with my questions/requests and agree with most of what @ph noted. I am waiting for your changes/responses.

I tested it manually and I haven't found anything problematic so far.

ph · 2018-03-26T14:17:12Z

I did some test manually too over the weekend, was green for me, but it still high level testing.

@urso @kvch I think we need to come up with a solution for stress testing this, do we want to do some chaos monkey suite for that?

urso · 2018-03-28T19:12:22Z

@ph > So we need to clear out any unnecessary space.

The spaces in logp.Debug are on purpose. It's like a 'multiline' log. The indentation helps me a little to keep context. :/

ph · 2018-03-29T18:32:04Z

The spaces in logp.Debug are on purpose. It's like a 'multiline' log. The indentation helps me a little to keep context. :/

@urso Using a contextual logger wouldn't help in that case?

l := logp.NewLogger("spool").With("ABC", "WYZ")
lo.Debugw("something wrong", "abc", 1234)

:(

ph · 2018-03-29T18:33:06Z

Some files need to be generated:

NOTICE.txt: needs update
auditbeat/auditbeat.reference.yml: needs update
filebeat/filebeat.reference.yml: needs update
heartbeat/heartbeat.reference.yml: needs update
metricbeat/metricbeat.reference.yml: needs update
packetbeat/docs/fields.asciidoc: needs update
packetbeat/packetbeat.reference.yml: needs update
winlogbeat/winlogbeat.reference.yml: needs update
make: *** [check] Error 1

urso · 2018-03-29T19:27:06Z

Using a contextual logger wouldn't help in that case?

All log statements are from same thread + only one thread does exist. As the state machines can return to very same state, the spaces help a little figuring which logs are part of the current execution of this state. With context based logging, I'd have to create an index and prepare a tmp logger for every step.

Also note, the package provides it's local logger, not relying on logp, which pretty much leaks the implementation of zap (sugar) implementation. The state machine is in the fast path, being executed on every single event. I'd rather prefer not to allocate another temporary context per event.

ph · 2018-03-29T19:51:13Z

Also note, the package provides it's local logger, not relying on logp, which pretty much leaks the implementation of zap (sugar) implementation. The state machine is in the fast path, being executed on every single event. I'd rather prefer not to allocate another temporary context per event.

Make sense, anyway its more for debugging so I think its OK in that case.

ph

Code look good with the changes, will do a bit of testing.

ph · 2018-03-29T19:39:42Z

auditbeat/auditbeat.reference.yml

+  # is full or the flush_timeout is triggered.
+  # Once ACKed by the output, events are removed immediately from the queue,
+  # making space for new events to be persisted.
+  #spool:


bc-breaking After looking at filebeat.reference.yml

#queue: # Queue type by name (default 'mem') # The memory queue will present all available events (up to the outputs # bulk_max_size) to the output, the moment the output is ready to server # another batch of events. #mem: # Max number of events the queue can buffer.

You are right, actually I think our config make it cleaner. @ruflin @kvch I would vote to keep @urso suggestion.
Concerning the naming: spool or disk or file, I think spool is a really common name for that kind of settings.

ph · 2018-03-29T19:42:10Z

auditbeat/auditbeat.reference.yml

+  # is full or the flush_timeout is triggered.
+  # Once ACKed by the output, events are removed immediately from the queue,
+  # making space for new events to be persisted.
+  #spool:


For the LIFO, FIFO, its more like a priority, It could also become more complex and take into consideration the kind of events, on a storage system disk usage vs cpu (probably a bad example.)

ph · 2018-03-29T19:44:17Z

libbeat/publisher/queue/spool/codec.go

+const (
+	// Note: Never change order. Codec IDs must be not change in the future. Only
+	//       adding new IDs is allowed.
+	codecUnknown codecID = iota


@urso +1 for thinking ahead for that, since its beta we can the luxury of changing things.

ph · 2018-03-29T19:44:55Z

libbeat/publisher/queue/spool/codec.go

+	case codecCBORL:
+		visitor = cborl.NewVisitor(&e.buf)
+	case codecUBJSON:
+		visitor = ubjson.NewVisitor(&e.buf)


Thanks for clarifying that!

ph · 2018-03-29T19:49:38Z

libbeat/publisher/queue/spool/spool.go

+		cfgPerm := settings.Mode.Perm()
+
+		// check if file has permissions set, that must not be set via config
+		if (perm | cfgPerm) != cfgPerm {


We might want to use common.IsStrictPerms() skip that?

ph · 2018-03-29T19:56:20Z

make check fails with an asciidoc from packetbeat, are you missing a commit?

ph · 2018-03-29T20:46:38Z

@urso I did some testing on the spool with the following scenario:

Scenario:

Filebeat
/tmp/ph.log 10000 events. (172KB)
Output only elasticsearch configured
Output Elasticsearch is down.
Started with debug -d "*"
Remove the registry between run and spool.dat

Memory queue

Start Filebeat with the default memory queues, FB should block because ES is down but you should see that we are trying to connect to it. You should also see a few publish statements with dump of the events.

2018-03-29T16:29:45.149-0400	DEBUG	[elasticsearch]	elasticsearch/client.go:666	ES Ping(url=http://localhost:9200)
2018-03-29T16:29:45.150-0400	DEBUG	[elasticsearch]	elasticsearch/client.go:670	Ping request failed with: Get http://localhost:9200: dial tcp 127.0.0.1:9200: getsockopt: connection refused

Spool

Now enable spooling to disk with the default settings, start FB you see one event in the logs but no connection error from ES.

2018-03-29T16:40:24.407-0400	DEBUG	[publish]	pipeline/processor.go:277	Publish event: {
  "@timestamp": "2018-03-29T20:40:24.406Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "doc",
    "version": "7.0.0-alpha1"
  },
  "source": "/tmp/ph.log",
  "offset": 14,
  "prospector": {
    "type": "log"
  },
  "input": {
    "type": "log"
  },
  "beat": {
    "name": "sashimi",
    "hostname": "sashimi",
    "version": "7.0.0-alpha1"
  },
  "message": "0 hello world"
}
2018-03-29T16:40:24.407-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 0, After: 0, Pending: 0
2018-03-29T16:40:24.407-0400	DEBUG	[registrar]	registrar/registrar.go:286	Registry file updated. 1 states written.
2018-03-29T16:40:34.407-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:34.407-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:34.407-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 0, After: 0, Pending: 0
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:362	Check file for harvesting: /tmp/ph.log
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:448	Update existing file for harvesting: /tmp/ph.log, offset: 14
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:500	Harvester for file is still running: /tmp/ph.log
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 1, After: 1, Pending: 0
2018-03-29T16:40:34.407-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 0, After: 0, Pending: 0
2018-03-29T16:40:44.409-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:44.409-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:44.409-0400	DEBUG	[input]	input/input.go:124	Run input
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 0, After: 0, Pending: 0
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:147	Start next scan
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:362	Check file for harvesting: /tmp/ph.log
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:448	Update existing file for harvesting: /tmp/ph.log, offset: 14
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:500	Harvester for file is still running: /tmp/ph.log
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 1, After: 1, Pending: 0
2018-03-29T16:40:44.409-0400	DEBUG	[input]	log/input.go:168	input states cleaned up. Before: 0, After: 0, Pending: 0

Starting Elasticsearch after doesn't change the situation.

ph · 2018-03-29T21:11:57Z

as discussed in slack @urso I've run the stress tests and they are all green on my machine.

PASS
ok  	github.com/elastic/beats/libbeat/publisher/pipeline/stress	101.419s

urso · 2018-04-02T17:49:19Z

@ph Fixed filebeat hanging. Fix with details is in this commit. Plus safety-net commit, in case user set flush events to -1.

ph · 2018-04-03T14:14:25Z

@urso You last commit fixed the situation and the beats now recover from the situation and send events to Elasticsearch.

I have done a bit more testing I have found another issue, this PR break the update to disk of the registry file.

Scenario

Environment

Elasticsearch is stopped.
Spool default settings

Tests1

Start Filebeat
read a 10000 events files
Lets FB run for a 3 minutes
Offset is accurate in the Log

2018-04-03T10:01:31.863-0400    DEBUG   [input] log/input.go:448        Update existing file for harvesting: /tmp/ph.log, offset: 168890
2018-04-03T10:01:31.864-0400    DEBUG   [input] log/input.go:500        Harvester for file is still running: /tmp/ph.log

Offset is not in sync in registry on disk?

[{"source":"/tmp/ph.log","offset":0,"timestamp":"2018-04-03T10:01:01.859088-04:00","ttl":-1,"type":"log","FileStateOS":{"inode":4311757072,"device":16777220}}]

Start Elasticsearch
FB recover and send the 10 000 events to ES http://localhost:9200/filebeat-7.0.0-alpha1-2018.04.03/_count
Offset is not updated on the registry on disk.
Offset is OK in memory.
Restarting FB reread all the events.
Registry on disk is not updated.

ph · 2018-04-03T14:34:19Z

I did a few tests concerning when ES was unreachable and recovering in a middle of a read, I didn't see any problem there.

ph · 2018-04-03T15:29:53Z

Found an issue with the Logstash output which might be linked to issue with the registry.

Scenario

Spooling to disk is off.

Use the Logstash output without SSL default settings (pipelining: 2)
FB Read a 500 000 events files. (EOF is reached)
Check Elasticsearch 500 000 events are present.

Spooling to disk is on.

Use the Logstash output without SSL default settings (pipelining: 2)
FB Read a 500 000 events files. (EOF is reached)
Check Elastricsearch only ~31K events are present.
No more events are sent

ph · 2018-04-03T15:46:00Z

Same issue with redis, I would expect kafka to have the same problem. goroutine dump at https://gist.github.com/ph/410ccdbe48cd00cd42603a0ec47cdc6a

urso · 2018-04-03T18:53:18Z

Checking the stack-trace + filebeat offset update problems, it was a missing signal from the queue to the publisher pipeline that the file has been flushed. Signaling is done via channels. After a few flushes the signalling channels have been filled up and the producers ACK handling has deadlocked.

ph · 2018-04-03T20:00:09Z

@urso more testing Logstash with recovering and killing instance. It worked, I had some duplicates, which are expected, but didn't lose events.

Can we cleanup this log statement a bit, I think we should have a space between collected and 2048.

2018-04-03T15:56:47.446-0400    DEBUG   [spool] spool/outbroker.go:239    outbroker (stateActive): events collected2048 2048 <nil>

Also, the registry is now correctly updated, and logs show correctly.

ph

LGTM, 🎉 we have one faillure for the stress tests, I presume its a timing issue on Travis?

urso · 2018-04-03T21:17:49Z

Couldn't reproduce stress tests failing. I'm still investigating. CI should be at least somewhat stable before merging.

ruflin · 2018-04-04T12:18:49Z

I think this small change is worth a changelog entry ;-)

ph · 2018-04-04T18:07:01Z

Amazing work @urso ! waiting on Jenkins green.

This change implements the `queue.Queue` interface, adding spooling to disk functionality to all beats. The queue interface requires all queues to provide 'connections' for Producers and Consumers The interface also demands all ACKs to be executed asynchronously Once an event is ACKed, it must not be presented to a Consumer (the outputs worker queue) anymore. The new queue type is marked as Beta-feature. In the spool, the events are stored in a dynamic ring buffer, in one single file only. The maximum file size must be configured at startup (default 100MiB). The file layer needs to use available file space for event data and additional meta data. Events are first written into a write buffer, which is flushed once it is full or if the flush timeout is triggered. No limit on event sizes is enforced. An event being bigger then the write buffer will still be accepted by the spool, but will trigger a flush right away. A successful flush also requires 2 fsync operations (1. for data 2. update file header). The spool blocks all producers, once it can not allocate any more space within the spool file. Writing never grows the file past the configured maximum file size. All producers are handled by the `inBroker`. There is no direct communication between producers and consumers. All required signaling and synchronisation is provided by the file layer. Once the write buffer is flushed, a signal is returned to each individual Producer, notifying the producers of events being published. This signal is used by filebeat/winlogbeat to update the registry file. Consumers are handled by the `outBroker`. Consumers request a batch of events. The broker reads up to N message from the file and forwards these. The reading process is 'readonly' and does not update the on-disk read pointers (only in-memory read pointers are updated). The file is memory mapped for reading. This will increases the process it's reported memory usage significantly. The outputs asynchronously ACK batches. The ACK signals are processed by the brokers `ackLoop`. Due to load-balancing or retries, ACKs can be received out of order. The broker guarantees, ACKs are sorted in the same order events have been read from the queue. Once a continuous set of events (starting from the last on-disk read pointer) is ACKed, the on-disk read pointer is update and space occupied by ACKed events is freed. As free space is tracked by the file, the file meta-data must be updated. If no more space for file meta data updates is available, there is a chance of the file potentially growing a few pages past max-size. Growing is required to guarantee progress (otherwise the spool might be stalled forever). In case the file did grow on ACK, the file layer will try to free the space with later write/ACK operations, potentially truncating the file again. The file layer is provided by [go-txfile](github.com/elastic/go-txfile). The file is split into pages of equal size. The layer provides transactional access to pages only. All writes (ACK, flushing the write buffer) are handled concurrently, using write transaction. The reader is isolated from concurrent writes/reads, using a read transaction. The last transaction state is never overwritten in a file. If a beat crashes during a write transaction, the most recent committed transaction is still available, so the beat can continue from the last known state upon restart. No additional repair-phase is required. Known limitations (To be addressed in future PRs): - File maximum size can not be changed once file is generated: - Add support to grow max size - Add support to shrink max size. Shrinking could be made dynamic, trying to return space once pages at end of file are freed - Add command to transfer the queue into a new queue file. In case fragmentation prevents dynamic shrinking or user doesn't want to wait. - Monitoring metrics do not report already available events upon beats restart (requires some more changes to the queue interface itself). - No file related metrics available yet. - If file is too small and write buffer to big, queue can become stalled. Potential solution requires (all): - Limit maximum event size - Generously preallocate meta-area (reserve pages for meta-data only on file creation) - Ensure usable data area is always > 2*write buffer -> partial write buffer flush? - Startup check validating combination of max_size, data vs. meta area size, event size, write buffer size

- Improve stability by increasing test duration, timeouts and watchdog timer - Add test start/stop messages if run with `-v`. Help with travis timing out tests with 10min without any output - Add all active go-routine stack traces to errors

ph · 2018-04-05T19:11:45Z

One of the metricbeat testsuite failed on Travis related to a MySQL test, it just took too much time to execute and the job was killed by their watchdog. Its the first time I've seen this error, I believe it's a red herring and I've restarted the job.

ruflin · 2018-04-06T06:21:01Z

vendor/vendor.json

@@ -1223,6 +1287,18 @@
 			"revision": "8e703b9968693c15f25cabb6ba8be4370cf431d0",
 			"revisionTime": "2016-08-17T18:24:57Z"
 		},
+		{
+			"checksumSHA1": "H7tCgNt2ajKK4FBJIDNlevu9MAc=",
+			"path": "github.com/urso/go-bin",


@urso Could we move this and the dependecy below to the elastic org?

urso added the in progress Pull request is currently in progress. label Mar 16, 2018

houndci-bot reviewed Mar 16, 2018

View reviewed changes

urso commented Mar 16, 2018

View reviewed changes

houndci-bot reviewed Mar 19, 2018

View reviewed changes

urso force-pushed the feature/queue-spool branch from 7321180 to dd26ed8 Compare March 19, 2018 21:12

ruflin reviewed Mar 21, 2018

View reviewed changes

kvch reviewed Mar 22, 2018

View reviewed changes

kvch reviewed Mar 23, 2018

View reviewed changes

ph suggested changes Mar 23, 2018

View reviewed changes

urso force-pushed the feature/queue-spool branch from 0f4bcb2 to 45a3b9a Compare March 29, 2018 15:17

urso added the needs_docs label Mar 29, 2018

ph reviewed Mar 29, 2018

View reviewed changes

urso force-pushed the feature/queue-spool branch from afc6ec0 to be34586 Compare April 2, 2018 16:35

urso changed the title ~~[WIP] Introduce spooling to disk~~ Introduce spooling to disk Apr 3, 2018

urso added review and removed in progress Pull request is currently in progress. labels Apr 3, 2018

ph approved these changes Apr 3, 2018

View reviewed changes

elastic deleted a comment from houndci-bot Apr 4, 2018

urso force-pushed the feature/queue-spool branch from 3a6caad to 2679ba7 Compare April 4, 2018 16:30

ph approved these changes Apr 4, 2018

View reviewed changes

kvch approved these changes Apr 4, 2018

View reviewed changes

urso added 4 commits April 5, 2018 18:26

Do not init pipeline event semaphore with values <= 0

3ed5ab5

Add spool to pipeline stress tests

bad5e97

Improve pipeline stress test

55e6483

- Improve stability by increasing test duration, timeouts and watchdog timer - Add test start/stop messages if run with `-v`. Help with travis timing out tests with 10min without any output - Add all active go-routine stack traces to errors

urso force-pushed the feature/queue-spool branch from 2679ba7 to 55e6483 Compare April 5, 2018 16:26

ph merged commit 8ffb220 into elastic:master Apr 5, 2018

ruflin reviewed Apr 6, 2018

View reviewed changes

andrewkroh mentioned this pull request Apr 12, 2018

Auditbeat GA #5432

Closed

9 tasks

urso mentioned this pull request Apr 13, 2018

Spooling to disk GA #6859

Closed

39 tasks

voiprodrigo mentioned this pull request Apr 13, 2018

Output buffer persistence influxdata/telegraf#802

Closed

urso deleted the feature/queue-spool branch February 19, 2019 18:54

urso removed the needs_docs label Nov 15, 2019


		# Configure file permissions if file is created. The default value is 0600.
		#permissions: 0600

Introduce spooling to disk #6581

Introduce spooling to disk #6581

Conversation

urso commented Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kvch commented Mar 26, 2018

ph commented Mar 26, 2018 • edited Loading

urso commented Mar 28, 2018

ph commented Mar 29, 2018

ph commented Mar 29, 2018

urso commented Mar 29, 2018 • edited Loading

ph commented Mar 29, 2018

ph left a comment

urso commented Mar 16, 2018 •

edited

Loading

urso Mar 28, 2018 •

edited

Loading

urso Mar 28, 2018 •

edited

Loading

urso Mar 29, 2018 •

edited

Loading

urso Apr 2, 2018 •

edited

Loading

urso Mar 28, 2018 •

edited

Loading

ph commented Mar 26, 2018 •

edited

Loading

urso commented Mar 29, 2018 •

edited

Loading