fix agent silent exit upon pipelines reloading #10346

colinsurprenant · 2019-01-23T05:34:10Z

This PR introduces a new PipelinesRegistry class to encapsulate the old Agent @pipelines ConcurrentHashMap.

One of the problem was the termination condition of the agent which was checking if all pipelines threads were not Threah#alive? and assumed it could exit. The problem is that the reload action did shutdown a pipeline to later on restart a new one, and in that period, all pipelines thread could have been seen as not Threah#alive?. This PR solves this condition by taking into account the reloading sequence as a period where a pipeline is not considered dead regardless if shutdown part of the reload accured or not.

Main Changes

new PipelinesRegistry is an abstraction for the registration a management of the pipeline states which replaces the old @pipelines instance variable which had a getter and was directly used in other parts of the code
correctly pass action thread parameters
fix periodic_pollers.rb wrong parameter names
introduce new @finished_run AtomicBoolean
change @finished_execution to always be true regardless of pipeline termination condition
document semantic of both @finished_run and @finished_execution
refactor non-autoreload agent loop
change Pipeline#wait_until_started to use @finished_run and document logic
change pipeline actions to use new registry create_pipeline, reload_pipeline and terminate_pipeline
change usage of @pipelines to use new registry collection methods

Other Changes

change usage of the generator plugin in specs with a new dummyblockinginput. while fixing tests these were making it almost impossible to follow the debug traces of the build because the generator inputs were outputting gigamounts of debug logs, one per emitted event. Also generator input is a waste of ressource and probably slows down the tests too.
made changes to make some specs more resilient to timing problems when testing the actions and agent.
added new HookRegistry#remove_hooks method to the metrics input to be able to cleanup the global and static LogStash::PLUGIN_REGISTRY when finished - this was causing weird problems in the metrics specs where the pipeline_started action was being fires on unused instances of the plugin when instantiated multiple times in the specs.
refactored x-pack/spec/monitoring/inputs/metrics_spec.rb to be more resilient to timing related errors (specs were always passing locally but not on slower Jenkins execution).

yaauie

I know this is a WIP and you aren't expecting a full code-review just yet, but I have some early feedback on the implementations of PipelineState and PipelinesRegistry.

logstash-core/lib/logstash/pipelines_registry.rb

yaauie · 2019-01-23T16:31:33Z

logstash-core/lib/logstash/pipelines_registry.rb

+              success = create_block.call
+              state.set_terminated(!success)
+            else
+              logger.error("Attempted to create a pipeline that already exists", :pipeline_id => pipeline_id)


I'm not confident from this code that this error message is accurate; from my reading, to hit this we have to be attempting to create a pipeline that (a) already exists, and (b) was previously in a terminated state.

This is the case where you are trying to create a new pipeline with an id that is already in the registry but the pipeline itself it not in a terminated state so we refuse to create it.

let me know if this makes sense

logstash-core/lib/logstash/pipelines_registry.rb

colinsurprenant · 2019-01-23T19:21:18Z

@yaauie thanks for the early feedback, all good, will definitely followup. For now I am still at the stage of making that work correctly and making sure all tests are passing, then I will definitely proceed to refactor some of this per your suggestions.

colinsurprenant · 2019-01-24T23:22:31Z

@yaauie
Tests are actually green at this point - the only failure is the infamous unrelated WebMock::NetConnectNotAllowedError:.

Note that some specs changes might seems unrelated, for example the use of dummyblockinginput instead of the generator; the generator plugin actually generates gigamout of debug logs and made it super difficult to trace logs in the Jenkins console, also, using a unbounded generator when events generation is actually not required for the tests is a waste and ultimately slows tests down.
I will followup with some of your refactor suggestions
I will be pushing a few minor cleanup + better comments
A spec should be added for the new PipelinesRegistry class
Verify if we should remove states upon pipeline stop to avoid memory leaks
Verify if we should make the PipelineState object immutable
add a spec to assess all Agent#execute exit conditions to make sure we don't change behaviours other than for the bugfix

It's up to you if you want to wait for these items to be completed to make another review pass

colinsurprenant · 2019-01-25T17:46:21Z

There is ONE last nasty failing test in xpack metrics and I am pretty sure its a test timing issue but can't put my finger on the problem as I can't reproduce locally. A lots of timing-dependant/brittle tests have been fixed in this PR making these tests a lot more resilient to timing problems but this one is tenacious.

colinsurprenant · 2019-01-25T18:42:25Z

I am now able to reproduce x-pack test failure(s). fix incoming.

colinsurprenant · 2019-01-27T04:53:03Z

Green! ✅

colinsurprenant · 2019-01-28T23:24:33Z

Updated description to list changes.

yaauie

I think this is definitely on the right track, especially considering the remaining checkboxes in your progress issue-comment. I have left a variety of comments in-line.

logstash-core/lib/logstash/pipeline_action/reload.rb

yaauie · 2019-01-28T23:59:49Z

logstash-core/lib/logstash/pipelines_registry.rb

+    def create_pipeline(pipeline_id, pipeline, &create_block)
+      success = false
+
+      @states.compute(pipeline_id) do |_, state|


There is a lot of complexity here to facilitate the creation and/or re-use of the PipelineState object; can it be made simpler? Can we pull any of this complexity into PipelineState itself? Is the PipelineState object even worth sharing?

Well, we're leveraging the ConcurrentHashMap concurrency features and in particular the compute method semantics which allows us to avoid using a global lock which was the original purpose of using a ConcurrentHashMap. In all fairness, the only place the state object is leaked outside the PipelineRegistry is in the reload method and I think we could avoid that, I'll check. Other then that I think it is relatively straight forward. I'll go ahead and see if we can avoir leaking the state object and will also add comments to explain the logic. I did add a comment in PipelineRegistry#initialize method about the compute method usage.

logstash-core/lib/logstash/pipelines_registry.rb

logstash-core/lib/logstash/state_resolver.rb

logstash-core/lib/logstash/agent.rb

logstash-core/lib/logstash/instrument/periodic_poller/dlq.rb

logstash-core/lib/logstash/pipeline_action/reload.rb

logstash-core/lib/logstash/pipelines_registry.rb

logstash-core/spec/logstash/pipeline_action/reload_spec.rb

colinsurprenant · 2019-01-29T04:45:45Z

Thanks a lot @yaauie for the review pass - all really good comments, will submit discussed changes and create a followup issue for what I think we could defer to new PR(s). Let me know if you are good with that.

yaauie · 2019-01-29T06:47:27Z

I'm okay with all of the above-mentioned deferrals for other PRs for cleanup, and think that this PR is a good balance between solving the problem at-hand and not going too far into the weeds.

You had one outstanding comment about trying to eliminate leaking the state, and I trust you to either resolve or defer that.

👍

colinsurprenant · 2019-01-29T18:46:34Z

This time I am pretty sure I nailed the metrics specs timing issues for good 🤞

colinsurprenant · 2019-01-29T20:57:25Z

@yaauie

I will not followup on these remaining tasks for now:

Verify if we should remove states upon pipeline stop to avoid memory leaks
Verify if we should make the PipelineState object immutable
It's ok like this for now and the potential danger for a memory leak is very remote; someone would have to keep on changing pipeline names continuously for this to become a potential problem I believe. I will also create a followup issue for this.

Also, I tried writing a spec to «assess all Agent#execute exit conditions» but was not successful. The conditions to make that happen are super hard to reproduce in specs and the testing for non-exit or for exit of Agent#execute is also challenging. For now I will rely on the manual tests I did where the fix actually solves the manual reproduction scenario. Also, I checked and the Agent specs are fairly extensive so I am confident we are ok for that.

So at this point I would be ready to move forward with the PR + creating the discussed followup issues.

colinsurprenant · 2019-01-29T21:59:39Z

A complementary note about not removing the states upon pipeline termination: the only impact at this point is that the Agent#non_running_pipelines will contain all pipelines that were terminated and never reloaded. The only condition where this will happen is when running multiple pipelines and removing or changing the name of a pipeline and issuing a reload which will terminate the delete/removed/old name pipeline definition.

The only visible user-facing impact will be in this INFO log line

    logger.info(
        "Pipelines running",
        :count => running_pipelines.size,
        :running_pipelines => running_pipelines.keys,
        :non_running_pipelines => non_running_pipelines.keys
    ) if converge_result.success? && converge_result.total > 0

where these non running pipelines will be reported. Personally I think this is actually valuable information and the change is non-functionnal since its only a INFO log line change and will not break BWC.

Maybe @jsvd you may want to chime in on this?

yaauie

LGTM 👍🏼

colinsurprenant · 2019-01-30T22:41:56Z

Thanks @yaauie for the review.
Will be creating followup issues, rebasing and merging shortly. Also it cleanly backports down to 6.5 so I suggest we do that.

colinsurprenant · 2019-01-31T20:05:50Z

Unrelated WebMock::NetConnectNotAllowedError: test failure as reported in #10274

colinsurprenant · 2019-01-31T20:29:11Z

Intermittent and timing related xpack Metrics spec failure which does not reproduce locally, will open followup issue for this. Should not prevent merging #10371

[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading

[6.5 clean backport of #10346] fix agent silent exit upon pipelines reloading

yaauie reviewed Jan 23, 2019

View reviewed changes

colinsurprenant force-pushed the silent_exit branch 3 times, most recently from 8ec88b0 to dccbf77 Compare January 27, 2019 03:56

colinsurprenant mentioned this pull request Jan 28, 2019

[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading #10355

Merged

colinsurprenant added the bug label Jan 28, 2019

yaauie reviewed Jan 29, 2019

View reviewed changes

colinsurprenant changed the title ~~[WIP] fix agent silent exit upon pipelines reloading~~ fix agent silent exit upon pipelines reloading Jan 29, 2019

colinsurprenant mentioned this pull request Jan 30, 2019

[6.5 clean backport of #10346] fix agent silent exit upon pipelines reloading #10367

Merged

yaauie approved these changes Jan 30, 2019

View reviewed changes

fix agent silent exit upon pipelines reloading

9b0e3bd

colinsurprenant force-pushed the silent_exit branch from a72a033 to 9b0e3bd Compare January 31, 2019 18:47

colinsurprenant added v6.7.0 v7.0.0-beta1 and removed v6.7.0 labels Jan 31, 2019

colinsurprenant mentioned this pull request Jan 31, 2019

[6.6 clean backport of #10346] fix agent silent exit upon pipelines reloading #10370

Merged

colinsurprenant merged commit f08b8c5 into elastic:master Jan 31, 2019

colinsurprenant added a commit that referenced this pull request Jan 31, 2019

fix agent silent exit upon pipelines reloading (#10355)

a4a72bf

[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading

colinsurprenant added a commit that referenced this pull request Jan 31, 2019

fix agent silent exit upon pipelines reloading (#10370)

8bef18a

[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading

colinsurprenant added a commit that referenced this pull request Feb 1, 2019

fix agent silent exit upon pipelines reloading (#10367)

892bb56

[6.5 clean backport of #10346] fix agent silent exit upon pipelines reloading

This was referenced Feb 6, 2019

verify if a PipelineState could hold a nil pipeline #10405

Open

refactor Agent to remove the need to passing self to pipeline actions #10406

Open

Refactor Agent/PipelineAction/PipelinesRegistry block passing strategy #10407

Open

yaauie mentioned this pull request Feb 7, 2019

ast/lir: simplify concurrent use of AST, which is globally stateful #10415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix agent silent exit upon pipelines reloading #10346

fix agent silent exit upon pipelines reloading #10346

colinsurprenant commented Jan 23, 2019 •

edited

Loading

yaauie left a comment

yaauie Jan 23, 2019

colinsurprenant Jan 24, 2019

colinsurprenant Jan 29, 2019

colinsurprenant commented Jan 23, 2019

colinsurprenant commented Jan 24, 2019 •

edited

Loading

colinsurprenant commented Jan 25, 2019

colinsurprenant commented Jan 25, 2019

colinsurprenant commented Jan 27, 2019

colinsurprenant commented Jan 28, 2019

yaauie left a comment

yaauie Jan 28, 2019

colinsurprenant Jan 29, 2019

colinsurprenant commented Jan 29, 2019

yaauie commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

yaauie left a comment

colinsurprenant commented Jan 30, 2019

colinsurprenant commented Jan 31, 2019

colinsurprenant commented Jan 31, 2019

fix agent silent exit upon pipelines reloading #10346

fix agent silent exit upon pipelines reloading #10346

Conversation

colinsurprenant commented Jan 23, 2019 • edited Loading

Main Changes

Other Changes

yaauie left a comment

Choose a reason for hiding this comment

yaauie Jan 23, 2019

Choose a reason for hiding this comment

colinsurprenant Jan 24, 2019

Choose a reason for hiding this comment

colinsurprenant Jan 29, 2019

Choose a reason for hiding this comment

colinsurprenant commented Jan 23, 2019

colinsurprenant commented Jan 24, 2019 • edited Loading

colinsurprenant commented Jan 25, 2019

colinsurprenant commented Jan 25, 2019

colinsurprenant commented Jan 27, 2019

colinsurprenant commented Jan 28, 2019

yaauie left a comment

Choose a reason for hiding this comment

yaauie Jan 28, 2019

Choose a reason for hiding this comment

colinsurprenant Jan 29, 2019

Choose a reason for hiding this comment

colinsurprenant commented Jan 29, 2019

yaauie commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

colinsurprenant commented Jan 29, 2019

yaauie left a comment

Choose a reason for hiding this comment

colinsurprenant commented Jan 30, 2019

colinsurprenant commented Jan 31, 2019

colinsurprenant commented Jan 31, 2019

colinsurprenant commented Jan 23, 2019 •

edited

Loading

colinsurprenant commented Jan 24, 2019 •

edited

Loading