Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spooling to disk GA #6859

Closed
3 of 39 tasks
urso opened this issue Apr 13, 2018 · 6 comments
Closed
3 of 39 tasks

Spooling to disk GA #6859

urso opened this issue Apr 13, 2018 · 6 comments
Assignees
Labels
ext-goal External goal of an iteration libbeat meta Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team v7.14.0

Comments

@urso
Copy link

urso commented Apr 13, 2018

Add spooling to disk to beats. Spooling all events to disk is useful for beats if the output is blocked or not fast enough to deal with bursts of events. With spooling to disk available, metricbeat modules will not be blocked and filebeat has a way of copying events from very fast rotating log files.

Requirements:

  • Consistency and ability to recover on failures/crash
  • Limit/Configurable queue size in disk space usage. Block if queue is full.
  • Async Producer ACK signal
    • Beats pipeline requires async ACK of last N events on flush. This gives filebeat the chance of updating the registry file, the time events have been flushed to the spool file.
  • Async Conusmer ACKing
    • Events must only be removed from queue, after async ACK signal from output has been received. Allow for resends between restarts if dequeues events have not been ACKed yet.

Tasks:

  • Add spooling to disk feature Introduce spooling to disk #6581
  • Add documentation Add file spool to queue docs #6902
  • Fix exported queue monitoring metrics
  • Telemetry on configured queue type
  • Add new IO metrics
  • (optional) Support for Write-Ahead-Log file to reduce number of costly fsync operations: External Write Ahead Log with relaxed guarantees go-txfile#25
  • testing:
    • libbeat end-to-end test for growing/shrinking existing spool files (see TestResizeFile in go-txfile)
    • Separate regular stress testing (related: MacOS X Panic when running test for github.com/elastic/beats/libbeat/publisher/queue/spool #8490). See: libbeat/scripts/cmd/stress_pipeline and libbeat/publisher/pipeline/stress.
    • Check spool file does not break if disk has not enough space to finish a write transaction:
      • Windows NTFS
      • MacOS
      • Ext3
      • Ext4
      • XFS
      • btrfs
    • Improve unit test coverage:
      • Failing IO operations
      • Full queue blocked -> unblock if events are ACKed
      • Shutdown with/without pending events
      • check ACK signals are send if buffers are flushed
      • ACK loop correctly combines ACK counts if former ACK IO op failed
      • Flush timeout for producer/consumer part of spool
      • Test support/encoding of timestamps: Spool to disk not working with time.Time fields #10099 (consider to have special encoding of timestamp -> recover go type when parsing)
  • Resilience improvements:
    • correctly Handle go-txfile errors to prevent potential deadlock
    • (optional) Introduce per event checksum. Without checksum parsing might fail anyways
    • Introduce per event page checksum. (Bump queue version + support for reading old/new queue)
    • Optional startup check that queue linking is not broken (all pages are reachable)
    • Try to repair by checking/reusing second-last transaction state
    • Reclaim unreachable pages if no existing on-disk transaction can be recovered.
    • Top-level queue of queues -> allow co-existance of old and new event schema on upgrades + reduce amount of data loss if on-disk structures are broken.
  • Debugging support
    • CLI tool to report file internals/structure/metrics
    • CLI tool to print all events in queue to JSON
    • (optional) special Beat command/CLI too to drain spool file to ES/Logstash
  • Reported issues to be investigated:
@opsnull
Copy link

opsnull commented Aug 17, 2018

any update ?

@ph
Copy link
Contributor

ph commented Aug 21, 2018

@opsnull an initial beta version of spooling to disk was included in the 6.3.0 release.

@monicasarbu monicasarbu changed the title Spooling to disk Spooling to disk GA Feb 28, 2019
@andrewkroh
Copy link
Member

Regarding "Re-evaluate monitoring metrics", it would be useful to be able to observe the number of events in the queue as well as the age of the oldest item.

@urso urso assigned urso, ph and kvch and unassigned ph, urso and kvch Nov 17, 2019
@urso urso removed the enhancement label Dec 10, 2019
@ph
Copy link
Contributor

ph commented Dec 11, 2019

Looking at the updated list and the Check spool file does not break if disk has not enough space to finish a write transaction I think that list is prioritized? Not sure how common btrfs is in production? Where redhat is dropping support https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html. But I believe for Suse btrfs is the default..

@ph
Copy link
Contributor

ph commented Dec 11, 2019

For me, as a dev, I think having CLI tooling to inspect and recover the PQ is really an important thing and we should probably do it first. Been able to extract the actual data outside of beats would be beneficial and would give better confidence.

@andresrc andresrc added Team:Services (Deprecated) Label for the former Integrations-Services team [zube]: Inbox and removed [zube]: Meta labels Jan 27, 2020
@andresrc andresrc unassigned kvch and ph Mar 3, 2020
@andresrc andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020
@andresrc andresrc added the ext-goal External goal of an iteration label Jul 31, 2020
@faec faec mentioned this issue Nov 16, 2020
9 tasks
@ph ph added the v7.14.0 label Apr 28, 2021
@jlind23
Copy link
Collaborator

jlind23 commented Mar 31, 2022

Closing it for now as it will be done through the shippers work - elastic/elastic-agent-shipper#7
cc @cmacknz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext-goal External goal of an iteration libbeat meta Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team v7.14.0
Projects
None yet
Development

No branches or pull requests

7 participants