Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_unix_stream_syslog fails at CI #2604

Closed
MOZGIII opened this issue May 14, 2020 · 10 comments · Fixed by #2618
Closed

test_unix_stream_syslog fails at CI #2604

MOZGIII opened this issue May 14, 2020 · 10 comments · Fixed by #2618
Assignees
Labels
domain: topology Anything related to Vector's topology code type: bug A code related bug.

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented May 14, 2020

The test_unix_stream_syslog occasionally fails at CI: https://github.com/timberio/vector/pull/2579/checks?check_run_id=674631518. This is from my docker-enabling branch, but there's nothing that suggests we should let this pass. On the contrary - this new setup is great because it uncovers this previously hidden problem.

The failing assertion is this one: https://github.com/timberio/vector/blob/0a0f8d5b6478237d7402ca9e356c9ff4ea6f8497/tests/syslog.rs#L186

This is a rather odd issue. How I see it, the problem is that when the topology stops, the messages in the inner buffers aren't released properly. Isn't the new two-phase shutdown system supposed to help solve this issue?

There are other tests that fail in similar conditions - and the way we solved that was by adding a timeout, effectively muting the problem. But I think there's a real issue somewhere there, and we get to the bottom of it.

Related:

@binarylogic
Copy link
Contributor

@ktff would you mind taking a look? Assuming it is, in fact, related to the shut down changes.

@binarylogic binarylogic added domain: topology Anything related to Vector's topology code type: bug A code related bug. labels May 14, 2020
@MOZGIII
Copy link
Contributor Author

MOZGIII commented May 14, 2020

It may be that the recent shutdown changes are supposed to help with thins, but they're not enough and some additional work has to be done at each source/sink - what would explain this. Or maybe the recent changes to the shutdown code don't address this problem. It's also possible that the issue is due to overlooks in how sinks/sources are built - for instance, I stumbled upon the code at the file source that seems to be added explicitly to correctly handle the shutdown races:

https://github.com/timberio/vector/blob/1bfecc66814800cb32fd7175d2626b9fcd6b52bf/src/sources/file/mod.rs#L284-L301

But this seems like non-trivial stuff, and some sources might have skipped a similar thing when it was needed.

@ktff
Copy link
Contributor

ktff commented May 15, 2020

It's a simpler issue. There are buffers/frame of bytes we received but haven't yet decoded in the source. In the unix stream case, it's a batch of bytes we received from the OS which will be seen as delivered to the senders side, but we haven't yet decoded them while the shutdown process only takes care of delivering decoded events.

I completely missed that detail while wiring ShutdownSignal.

Sources that have this buffer:

  • vector
    • Frame in FramedRead.
  • syslog
    • Frames in UdpFramed, FrameRead.
  • socket
    • Frames in UdpFramed, FrameRead.
  • stdin
    • Buffer in stdin channel.
  • journald
    • Buffer in BufReader.
  • statsd
    • Frame in UdpFramed.

Sources that maybe have this buffer deep in some library:

  • kafka
  • docker

Fix

Modifying UdpFramed and FrameRead to expose state of their buffers would fix this for most of the sources. Implementing this feature directly in tokio would be the best way. Alternatively we could implement UdpFramed and FrameRead our self or find library with it.

BufReader exposes state of their buffer, we should just use that.

Then all that remains is stdin channel, for which I' don't think we can do anything about it.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented May 15, 2020

I wonder if the workaround I added to the buffering tests is relevant:
https://github.com/timberio/vector/blob/1eaee144b7be0da821c172486f3079f8e5e7c5a4/tests/buffering.rs#L112-L117

@ktff
Copy link
Contributor

ktff commented May 15, 2020

Yes, it's probably the same issue.

@ktff
Copy link
Contributor

ktff commented May 16, 2020

For sources in udp mode this doesn't need to be changed since it doesn't have any delivery guarantee.

Also since there are buffers/log files in docker and journald outside of the sources, journald and docker sources don't need to be changed.

With that only sources with FramedRead need to be fixed.

@ktff
Copy link
Contributor

ktff commented May 16, 2020

kafka source doesn't have this problem.

@ktff
Copy link
Contributor

ktff commented May 16, 2020

I wonder if the workaround I added to the buffering tests is relevant:

@MOZGIII The problem in that case is probably that the channel used for sending to the mock source has a buffer of 1 message. If that's true then the tests should only fail with difference of 1 event.

@lukesteensen
Copy link
Member

Modifying UdpFramed and FrameRead to expose state of their buffers would fix this for most of the sources. Implementing this feature directly in tokio would be the best way. Alternatively we could implement UdpFramed and FrameRead our self or find library with it.

From what I can tell, the tokio Framed abstraction does have some mechanism for dealing with this situation:

https://github.com/tokio-rs/tokio/blob/9b6744cc8eca65719b92d22932a47eb47387ce00/tokio-util/src/codec/framed_impl.rs#L124-L133

This makes me think that we should try to make sure the resource wrapped with Framed returns EOF properly, so that we can just drain the resulting stream.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented May 21, 2020

I caught this error locally too, just wanted to mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: topology Anything related to Vector's topology code type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants