-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix agent silent exit upon pipelines reloading #10346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a WIP and you aren't expecting a full code-review just yet, but I have some early feedback on the implementations of PipelineState
and PipelinesRegistry
.
success = create_block.call | ||
state.set_terminated(!success) | ||
else | ||
logger.error("Attempted to create a pipeline that already exists", :pipeline_id => pipeline_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident from this code that this error message is accurate; from my reading, to hit this we have to be attempting to create a pipeline that (a) already exists, and (b) was previously in a terminated state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case where you are trying to create a new pipeline with an id that is already in the registry but the pipeline itself it not in a terminated state so we refuse to create it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if this makes sense
@yaauie thanks for the early feedback, all good, will definitely followup. For now I am still at the stage of making that work correctly and making sure all tests are passing, then I will definitely proceed to refactor some of this per your suggestions. |
@yaauie
It's up to you if you want to wait for these items to be completed to make another review pass |
There is ONE last nasty failing test in xpack metrics and I am pretty sure its a test timing issue but can't put my finger on the problem as I can't reproduce locally. A lots of timing-dependant/brittle tests have been fixed in this PR making these tests a lot more resilient to timing problems but this one is tenacious. |
I am now able to reproduce x-pack test failure(s). fix incoming. |
8ec88b0
to
dccbf77
Compare
Green! ✅ |
Updated description to list changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is definitely on the right track, especially considering the remaining checkboxes in your progress issue-comment. I have left a variety of comments in-line.
def create_pipeline(pipeline_id, pipeline, &create_block) | ||
success = false | ||
|
||
@states.compute(pipeline_id) do |_, state| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of complexity here to facilitate the creation and/or re-use of the PipelineState
object; can it be made simpler? Can we pull any of this complexity into PipelineState
itself? Is the PipelineState
object even worth sharing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we're leveraging the ConcurrentHashMap
concurrency features and in particular the compute
method semantics which allows us to avoid using a global lock which was the original purpose of using a ConcurrentHashMap
. In all fairness, the only place the state object is leaked outside the PipelineRegistry
is in the reload
method and I think we could avoid that, I'll check. Other then that I think it is relatively straight forward. I'll go ahead and see if we can avoir leaking the state object and will also add comments to explain the logic. I did add a comment in PipelineRegistry#initialize
method about the compute
method usage.
Thanks a lot @yaauie for the review pass - all really good comments, will submit discussed changes and create a followup issue for what I think we could defer to new PR(s). Let me know if you are good with that. |
I'm okay with all of the above-mentioned deferrals for other PRs for cleanup, and think that this PR is a good balance between solving the problem at-hand and not going too far into the weeds. You had one outstanding comment about trying to eliminate leaking the state, and I trust you to either resolve or defer that. 👍 |
This time I am pretty sure I nailed the metrics specs timing issues for good 🤞 |
I will not followup on these remaining tasks for now:
Also, I tried writing a spec to «assess all Agent#execute exit conditions» but was not successful. The conditions to make that happen are super hard to reproduce in specs and the testing for non-exit or for exit of So at this point I would be ready to move forward with the PR + creating the discussed followup issues. |
A complementary note about not removing the states upon pipeline termination: the only impact at this point is that the The only visible user-facing impact will be in this INFO log line
where these non running pipelines will be reported. Personally I think this is actually valuable information and the change is non-functionnal since its only a INFO log line change and will not break BWC. Maybe @jsvd you may want to chime in on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏼
Thanks @yaauie for the review. |
a72a033
to
9b0e3bd
Compare
Unrelated |
Intermittent and timing related xpack Metrics spec failure which does not reproduce locally, will open followup issue for this. Should not prevent merging #10371 |
[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading
[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading
[6.5 clean backport of #10346] fix agent silent exit upon pipelines reloading
Fixes #10345
This PR introduces a new
PipelinesRegistry
class to encapsulate the oldAgent
@pipelines
ConcurrentHashMap
.One of the problem was the termination condition of the agent which was checking if all pipelines threads were not
Threah#alive?
and assumed it could exit. The problem is that the reload action did shutdown a pipeline to later on restart a new one, and in that period, all pipelines thread could have been seen as notThreah#alive?
. This PR solves this condition by taking into account the reloading sequence as a period where a pipeline is not considered dead regardless if shutdown part of the reload accured or not.Main Changes
PipelinesRegistry
is an abstraction for the registration a management of the pipeline states which replaces the old@pipelines
instance variable which had a getter and was directly used in other parts of the code@finished_run
AtomicBoolean
@finished_execution
to always be true regardless of pipeline termination condition@finished_run
and@finished_execution
Pipeline#wait_until_started
to use@finished_run
and document logiccreate_pipeline
,reload_pipeline
andterminate_pipeline
@pipelines
to use new registry collection methodsOther Changes
generator
plugin in specs with a newdummyblockinginput
. while fixing tests these were making it almost impossible to follow the debug traces of the build because the generator inputs were outputting gigamounts of debug logs, one per emitted event. Also generator input is a waste of ressource and probably slows down the tests too.HookRegistry#remove_hooks
method to the metrics input to be able to cleanup the global and staticLogStash::PLUGIN_REGISTRY
when finished - this was causing weird problems in the metrics specs where thepipeline_started
action was being fires on unused instances of the plugin when instantiated multiple times in the specs.x-pack/spec/monitoring/inputs/metrics_spec.rb
to be more resilient to timing related errors (specs were always passing locally but not on slower Jenkins execution).