-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector locks up in some pipelines configurations. #11633
Comments
To add on to this, what we're seeing here may not be isolated to this configuration but simply easy to reproduce with it. We have started to see soak runs fail because of NaN exceptions. If you pull the underlying capture data what's evident is Vector will accept load briefly and then soft-lock. For instance, this soak run fails but when pulling the underlying capture files for fluent-elasticsearch you see: So, Vector ran load the whole time. On the other hand you can find samples that look like this: This is http-to-http-noacks from the same linked run. All of this is to say there seem to be related problems with other configs but it's not clear if the problem identified above is unique to this config or is a symptom of a general problem. |
This commit adds an assert to fail the analysis if a variant has no samples, as opposed to kicking out an inexplicable exception. This probably relates to issue #11633 and appears to impact http-to-http-noacks most frequently. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
We have started to see -- REF #11711 -- that this soak in particular has "lockup" behavior in the soak rig, see #11633. It's presence in the soaks is causing users to re-run the soaks multiple times until a run happens to go through with this soak not locking up, undesirable. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
As part of this issue, we should also re-enable the |
Ah, yes, thank you for calling that out. Totally spaced on doing so. |
We have started to see -- REF #11711 -- that this soak in particular has "lockup" behavior in the soak rig, see #11633. It's presence in the soaks is causing users to re-run the soaks multiple times until a run happens to go through with this soak not locking up, undesirable. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit adds an assert to fail the analysis if a variant has no samples, as opposed to kicking out an inexplicable exception. This probably relates to issue #11633 and appears to impact http-to-http-noacks most frequently. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit adds an assert to fail the analysis if a variant has no samples, as opposed to kicking out an inexplicable exception. This probably relates to issue #11633 and appears to impact http-to-http-noacks most frequently. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
To share more details in this on-going investigation, we've arrived at
This was tested on 9aabd36, a recent edition of Run the following commands for VECTOR_THREADS=8 firejail --noprofile --cpu=2,3,4,5,6,7,8,9 target/release/vector -c .tmp/pipeline-lock-up/http-to-blackhole.toml firejail --noprofile --cpu=1 target/release/http_gen --config-path .tmp/pipelines-lock-up/http-to-http-noack.yaml With Lading generating very large payloads in this setup, Vector locks up almost immediately (evident in the internal metrics or the info logs showing blackhole received events). AnalysisWe've likely ruled out a few potential issues:
So far, lock up seems like a race condition/deadlock problem. The current candidate for investigation is LimitedSender/LimitedReceiver which leverage semaphores to coordinate work, though it's still unclear why permits might not be freed. If you compile the project with
And run until a lock up occurs, you'll see lines in
One hypothesis from @blt is that the underlying Note
|
I went for a walk and thought about this some more. I think it's possible that the waker notification will arrive before data is available in the underlying queue but I'm not sure that this would cost anything more than additional polls. So, more latency but not a full lock up. |
Oh, it might also be interesting to see if you can replicate the problem with #11732. |
A note for the community
Problem
As a part of my work on #10144 I've recently started to notice that vector -- see config below -- reliably locks up under load. When I say "lock" I mean that the load generation from lading halts as Vector signals back-off, blackhole sink no longer receives from upstream in the topology. Something important somewhere in Vector fills up and the program halts work.
Using the configuration files below you may reproduce this like so:
From the lading root:
Vector's blackhole will print and then taper off.
Configuration
The referenced bootstrap log file is:
The text was updated successfully, but these errors were encountered: