-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #33446: Allow setting the time until a worker is considered lost #227
Conversation
manifests/init.pp
Outdated
@@ -132,6 +132,9 @@ | |||
# available, likely results in performance degradation due to I/O blocking and is not recommended in most cases. Modifying this parameter should | |||
# be done incrementally with benchmarking at each step to determine an optimal value for your deployment. | |||
# | |||
# @param worker_ttl | |||
# The number of seconds before a worker should be considered lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify here that this applies to the pulpcore workers for the tasking system, not the content or API gunicorn workers? Since that is likely to be confusing to many/most users.
It would be great to offer some guidance as well regarding tuning this parameter and in what circumstances might that need arise (during large content migrations? other workloads as well?). Can this be disabled by setting it to 0? What are the advantages and drawbacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify here that this applies to the pulpcore workers for the tasking system, not the content or API gunicorn workers? Since that is likely to be confusing to many/most users.
I can add a bit more info. This text was taken directly from Pulp.
It would be great to offer some guidance as well regarding tuning this parameter and in what circumstances might that need arise (during large content migrations? other workloads as well?). Can this be disabled by setting it to 0? What are the advantages and drawbacks?
It would be great to have that, but since we don't know that I would opt not to include anything related to it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec/acceptance/basic_spec.rb
contains a way to test Django settings directly:
describe command("PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager diffsettings") do
its(:stdout) { is_expected.to match(/^USE_NEW_WORKER_TYPE = True/) }
its(:exit_status) { is_expected.to eq 0 }
end
Note that while this is diffsettings
, that difference is computed from the defaults for Django, not for Pulpcore specifically. So even Pulpcore defaults may be tested here as long as they are not Django defaults.
With this, even in the case where the value is undefined
here you can test the running configuration should match WORKER_TTL = 30
. (Although I recommended setting an explicit default, I still think this is neat and worth drawing attention to)
It would be nice, and quite simple, to have the defaults as well as overrides tested in both spec/classes
and spec/acceptance
This change implements the WORKER_TTL setting which lets Pulp decide how long to wait for a worker to respond before considering it gone and cleaning up it's database and re-assiging work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the logic of only setting values if we explicitly want to override the default. @wbclark?
@@ -79,6 +79,11 @@ class { 'pulpcore': | |||
its(:exit_status) { is_expected.to eq 0 } | |||
end | |||
|
|||
describe command("PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager diffsettings") do | |||
its(:stdout) { is_expected.to match(/^WORKER_TTL = 30/) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this in the diffsettings
command above? That makes the execution faster since it doesn't need to call diffsettings
multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the logic of only setting values if we explicitly want to override the default. @wbclark?
My biggest concern with it is discoverability, so I'd like to see at least documentation like If this is undefined, Pulp will use a default value of 30 seconds
That has the same concern as explicitly setting the same default -- if the default changes on Pulp's side then how would we know?
In fact, since we have the diffsettings
test now, that test would inform us if Pulp's default changed, and then we could update both the test and the parameter documentation with the new default from Pulp.
So thinking about it more, I think leaving it optional and undefined by default is ideal, with documentation explaining the behavior.
I think we should not mention 30 seconds. That happens to be Pulp's default today, but what if 3.16 moves it 60? We can update the docs, but if we want to support both 3.15 and 3.16 then it becomes complex. I'd be fine with something like
Why would we care? Let's trust Pulp to make reasonable choices. |
My concern comes from past frustrations as an installer user, before I was an installer developer, reading options and having no idea what they do and what is the default behavior. I think you should be able to look at If Pulp changes the default to 60 and we support for a time Pulp releases with different defaults, the diffsettings test should catch that once the release with the change becomes available in repositories. Then we still have a lot of options, such as As an example of where it would be useful, let's say I am an installer user and I have seen some worker timeouts. I want to try tuning this setting to see if it alleviates the problem. I find the setting in What if, as in your example, Pulp changed its default? I now have to check what version of Pulp my system has and make sure that I am looking at the correct version of the Pulp documentation. Of course I /should/ do that, but it would be easy to overlook. I might change this value to 45 seconds, thinking that is an increase over the default 30, when in case it was a decrease over the newer default of 60. Now I might even be more confused and frustrated as the timeouts became much more frequent. I think the effort on our end would be small, compared to removing a lot of uncertainty for users and reducing the effort required for them to find clear and helpful information.
I don't think it has anything to do with a lack of trust in Pulp. It is more about making understanding the configuration transparent and accessible for installer users. |
To be fair, this module doesn't show up at all since it's an internal module. Based on my experience I think in this case it's good to have a preference for a very minimal config. IMHO this is not a setting that many users will change and really classify as an advanced feature. If you're playing around with that, I think it's reasonable to expect users to know Pulp already. |
@jlsherrill do you know how often this would be changed? To me it looks like pulp_installer doesn't expose it (https://docs.pulpproject.org/pulpcore/configuration/settings.html). If a decent number of users would change this, we would be more open that this is an option but if it's just a handful it would be better to keep it slightly hidden. That has some implementation implications. |
I know that we had to tune the similar setting for Pulp2 quite often on "less than optimal but still relevant" setups, but I think your question is "hiera vs installer option", and I'd answer Hiera. |
Although Pulp2 and Pulp3 are different technologies and we should not necessarily expect the same amount of tuning to be required for that reason, the diversity of deployments and workloads is essentially the same so I think it would be unsurprising if tuning is frequently required. After all, we expect it to be required in at least some cases, which is the reason for this PR and Redmine #33446 in the first place. That being said, I have a strong belief that installer option is greatly preferable to hiera, for the reason that an installer option does not require any puppet knowledge (in my downstream experience, a majority of users do not possess this, even many senior level engineers for whom this is their primary technical area that they work with on a daily basis). Also, the installer option is not difficult to implement or maintain and requires exposing the parameter in one more module, which we already do for similar parameters, and already did for similar parameters for Pulp2. Finally if the concern is reducing the sprawl of To be honest, I am surprised to see so many people voicing a preference for hiera in this case, as I have not seen articulated any particular drawbacks to the installer option while the benefits are very significant from my perspective. (Hence my attempt to speculate on what those might be, in the preceding paragraph... please do point out if there is something else I am missing). |
IIRC this option was added to Pulp to begin with, due to Katello needs. So based on that alone, I wouldn't expect adding it to their installer to be a priority, as their installer is not used by Katello. Also IIRC, Pulp has implemented two very different versions of this feature, in order to also deliver a hardcoded version which didn't introduce any new setting at all to an older Pulp3 release that was still used by Katello at the time. So perhaps that development history may have something to do with why their installer does not expose it? I wouldn't assume that is intentional (and I would guess that pulp_installer would accept such a PR if anybody went out of their way to create it) and even if it does turn out to be deliberate I don't see a particular reason that should matter here, given that this module doesn't aim for 1-to-1 correspondence with pulp_installer. |
Also, I did promise @jlsherrill that I would pick this one up as @ehelms is currently on leave. I will do that shortly. That version will keep the default as |
Ultimately, this is what I would be very happy with as a starting point: #230 I previously opened #229 as well, but I've marked that as a draft because it also contains quite a different style for the tests while delivering the feature using the current style should be a higher priority, hence the follow up linked above. |
They are different, sure, but the logic "worker sends heartbeats, is considered dead when not in X seconds" is the same, so given similar setups, I wouldn't expect drastically different behavior. But it's also probably not that important right now
True, and I kinda see the argument for
I think, my main pain point with it, is that any |
Merged as of 69bdebc |
This change implements the WORKER_TTL setting which lets Pulp decide
how long to wait for a worker to respond before considering it gone
and cleaning up it's database and re-assiging work.