Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix agent silent exit upon pipelines reloading #10346

Merged
merged 1 commit into from
Jan 31, 2019

Conversation

colinsurprenant
Copy link
Contributor

@colinsurprenant colinsurprenant commented Jan 23, 2019

Fixes #10345

This PR introduces a new PipelinesRegistry class to encapsulate the old Agent @pipelines ConcurrentHashMap.

One of the problem was the termination condition of the agent which was checking if all pipelines threads were not Threah#alive? and assumed it could exit. The problem is that the reload action did shutdown a pipeline to later on restart a new one, and in that period, all pipelines thread could have been seen as not Threah#alive?. This PR solves this condition by taking into account the reloading sequence as a period where a pipeline is not considered dead regardless if shutdown part of the reload accured or not.

Main Changes

  • new PipelinesRegistry is an abstraction for the registration a management of the pipeline states which replaces the old @pipelines instance variable which had a getter and was directly used in other parts of the code
  • correctly pass action thread parameters
  • fix periodic_pollers.rb wrong parameter names
  • introduce new @finished_run AtomicBoolean
  • change @finished_execution to always be true regardless of pipeline termination condition
  • document semantic of both @finished_run and @finished_execution
  • refactor non-autoreload agent loop
  • change Pipeline#wait_until_started to use @finished_run and document logic
  • change pipeline actions to use new registry create_pipeline, reload_pipeline and terminate_pipeline
  • change usage of @pipelines to use new registry collection methods

Other Changes

  • change usage of the generator plugin in specs with a new dummyblockinginput. while fixing tests these were making it almost impossible to follow the debug traces of the build because the generator inputs were outputting gigamounts of debug logs, one per emitted event. Also generator input is a waste of ressource and probably slows down the tests too.
  • made changes to make some specs more resilient to timing problems when testing the actions and agent.
  • added new HookRegistry#remove_hooks method to the metrics input to be able to cleanup the global and static LogStash::PLUGIN_REGISTRY when finished - this was causing weird problems in the metrics specs where the pipeline_started action was being fires on unused instances of the plugin when instantiated multiple times in the specs.
  • refactored x-pack/spec/monitoring/inputs/metrics_spec.rb to be more resilient to timing related errors (specs were always passing locally but not on slower Jenkins execution).

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a WIP and you aren't expecting a full code-review just yet, but I have some early feedback on the implementations of PipelineState and PipelinesRegistry.

success = create_block.call
state.set_terminated(!success)
else
logger.error("Attempted to create a pipeline that already exists", :pipeline_id => pipeline_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident from this code that this error message is accurate; from my reading, to hit this we have to be attempting to create a pipeline that (a) already exists, and (b) was previously in a terminated state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case where you are trying to create a new pipeline with an id that is already in the registry but the pipeline itself it not in a terminated state so we refuse to create it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if this makes sense

@colinsurprenant
Copy link
Contributor Author

@yaauie thanks for the early feedback, all good, will definitely followup. For now I am still at the stage of making that work correctly and making sure all tests are passing, then I will definitely proceed to refactor some of this per your suggestions.

@colinsurprenant
Copy link
Contributor Author

colinsurprenant commented Jan 24, 2019

@yaauie
Tests are actually green at this point - the only failure is the infamous unrelated WebMock::NetConnectNotAllowedError:.

  • Note that some specs changes might seems unrelated, for example the use of dummyblockinginput instead of the generator; the generator plugin actually generates gigamout of debug logs and made it super difficult to trace logs in the Jenkins console, also, using a unbounded generator when events generation is actually not required for the tests is a waste and ultimately slows tests down.

  • I will followup with some of your refactor suggestions

  • I will be pushing a few minor cleanup + better comments

  • A spec should be added for the new PipelinesRegistry class

  • Verify if we should remove states upon pipeline stop to avoid memory leaks

  • Verify if we should make the PipelineState object immutable

  • add a spec to assess all Agent#execute exit conditions to make sure we don't change behaviours other than for the bugfix

It's up to you if you want to wait for these items to be completed to make another review pass

@colinsurprenant
Copy link
Contributor Author

There is ONE last nasty failing test in xpack metrics and I am pretty sure its a test timing issue but can't put my finger on the problem as I can't reproduce locally. A lots of timing-dependant/brittle tests have been fixed in this PR making these tests a lot more resilient to timing problems but this one is tenacious.

@colinsurprenant
Copy link
Contributor Author

I am now able to reproduce x-pack test failure(s). fix incoming.

@colinsurprenant colinsurprenant force-pushed the silent_exit branch 3 times, most recently from 8ec88b0 to dccbf77 Compare January 27, 2019 03:56
@colinsurprenant
Copy link
Contributor Author

Green! ✅

@colinsurprenant
Copy link
Contributor Author

Updated description to list changes.

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is definitely on the right track, especially considering the remaining checkboxes in your progress issue-comment. I have left a variety of comments in-line.

def create_pipeline(pipeline_id, pipeline, &create_block)
success = false

@states.compute(pipeline_id) do |_, state|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of complexity here to facilitate the creation and/or re-use of the PipelineState object; can it be made simpler? Can we pull any of this complexity into PipelineState itself? Is the PipelineState object even worth sharing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we're leveraging the ConcurrentHashMap concurrency features and in particular the compute method semantics which allows us to avoid using a global lock which was the original purpose of using a ConcurrentHashMap. In all fairness, the only place the state object is leaked outside the PipelineRegistry is in the reload method and I think we could avoid that, I'll check. Other then that I think it is relatively straight forward. I'll go ahead and see if we can avoir leaking the state object and will also add comments to explain the logic. I did add a comment in PipelineRegistry#initialize method about the compute method usage.

@colinsurprenant
Copy link
Contributor Author

Thanks a lot @yaauie for the review pass - all really good comments, will submit discussed changes and create a followup issue for what I think we could defer to new PR(s). Let me know if you are good with that.

@yaauie
Copy link
Member

yaauie commented Jan 29, 2019

I'm okay with all of the above-mentioned deferrals for other PRs for cleanup, and think that this PR is a good balance between solving the problem at-hand and not going too far into the weeds.

You had one outstanding comment about trying to eliminate leaking the state, and I trust you to either resolve or defer that.

👍

@colinsurprenant
Copy link
Contributor Author

This time I am pretty sure I nailed the metrics specs timing issues for good 🤞

@colinsurprenant colinsurprenant changed the title [WIP] fix agent silent exit upon pipelines reloading fix agent silent exit upon pipelines reloading Jan 29, 2019
@colinsurprenant
Copy link
Contributor Author

@yaauie

I will not followup on these remaining tasks for now:

  • Verify if we should remove states upon pipeline stop to avoid memory leaks
  • Verify if we should make the PipelineState object immutable
    It's ok like this for now and the potential danger for a memory leak is very remote; someone would have to keep on changing pipeline names continuously for this to become a potential problem I believe. I will also create a followup issue for this.

Also, I tried writing a spec to «assess all Agent#execute exit conditions» but was not successful. The conditions to make that happen are super hard to reproduce in specs and the testing for non-exit or for exit of Agent#execute is also challenging. For now I will rely on the manual tests I did where the fix actually solves the manual reproduction scenario. Also, I checked and the Agent specs are fairly extensive so I am confident we are ok for that.

So at this point I would be ready to move forward with the PR + creating the discussed followup issues.

@colinsurprenant
Copy link
Contributor Author

A complementary note about not removing the states upon pipeline termination: the only impact at this point is that the Agent#non_running_pipelines will contain all pipelines that were terminated and never reloaded. The only condition where this will happen is when running multiple pipelines and removing or changing the name of a pipeline and issuing a reload which will terminate the delete/removed/old name pipeline definition.

The only visible user-facing impact will be in this INFO log line

    logger.info(
        "Pipelines running",
        :count => running_pipelines.size,
        :running_pipelines => running_pipelines.keys,
        :non_running_pipelines => non_running_pipelines.keys
    ) if converge_result.success? && converge_result.total > 0

where these non running pipelines will be reported. Personally I think this is actually valuable information and the change is non-functionnal since its only a INFO log line change and will not break BWC.

Maybe @jsvd you may want to chime in on this?

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

@colinsurprenant
Copy link
Contributor Author

Thanks @yaauie for the review.
Will be creating followup issues, rebasing and merging shortly. Also it cleanly backports down to 6.5 so I suggest we do that.

@colinsurprenant
Copy link
Contributor Author

Unrelated WebMock::NetConnectNotAllowedError: test failure as reported in #10274

@colinsurprenant
Copy link
Contributor Author

Intermittent and timing related xpack Metrics spec failure which does not reproduce locally, will open followup issue for this. Should not prevent merging #10371

@colinsurprenant colinsurprenant merged commit f08b8c5 into elastic:master Jan 31, 2019
colinsurprenant added a commit that referenced this pull request Jan 31, 2019
[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading
colinsurprenant added a commit that referenced this pull request Jan 31, 2019
[6.x clean backport of #10346] fix agent silent exit upon pipelines reloading
colinsurprenant added a commit that referenced this pull request Feb 1, 2019
[6.5 clean backport of #10346] fix agent silent exit upon pipelines reloading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

silent exit on pipelines reloading
2 participants