Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to control the PCS transition deadline #391

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jsollom-hpe
Copy link
Contributor

@jsollom-hpe jsollom-hpe commented Oct 30, 2024

Summary and Scope

Provide an option to control the PCS transition deadline

Is this change backwards incompatible, backwards compatible, or a backwards compatible bugfix?

Issues and Related PRs

List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.

  • Resolves [issue id](issue link)
  • Change will also be needed in <insert branch name here>
  • Future work required by [issue id](issue link)
  • Documentation changes required in [issue id](issue link)
  • Merge with/before/after <insert PR URL here>

Testing

List the environments in which these changes were tested.

Tested on:

  • drax
  • Local development environment
  • Virtual Shasta

Test description:

I set the deadline to two minutes rather than the default one minute. I successfully rebooted all four compute nodes on drax using this more generous default.

How were the changes tested and success verified? If schema changes were part of this change, how were those handled in your upgrade/downgrade testing?

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)?
  • Were continuous integration tests run? If not, why?
  • Was upgrade tested? If not, why?
  • Was downgrade tested? If not, why?
  • Were new tests (or test issues/Jiras) created for this change?

Risks and Mitigations

None.
Are there known issues with these changes? Any other special considerations?

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@jsollom-hpe jsollom-hpe requested a review from a team as a code owner October 30, 2024 16:43
Copy link
Contributor

@jsl-hpe jsl-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approval is for sanity of the code alone. This does not mean or indicate I want it merged, simply that the changes that I have reviewed appear sane and reasonable.

@@ -29,6 +29,7 @@
# code should either import this dict directly, or (preferably) access
# its values indirectly using a DefaultOptions object
DEFAULTS = {
'pcs_transition_deadline': 60,
Copy link
Contributor

@mharding-hpe mharding-hpe Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern I have is that I notice we are defaulting to 60 minutes for the deadline. Prior to this PR, our calls to PCS did not specify a deadline, meaning it would end up using the PCS default value of 5 minutes.

With a deadline of 60 minutes, for a node which is not behaving nicely with PCS, this could result in BOS creating a bunch of transitions in PCS for the same node, before any of these transitions hit their deadlines. I don't know enough about how PCS works to know how it handles this situation, or if there is the potential for problems. This just is different than the CFS option we added, where by default everything would behave the same as before, and things would only be different if the option was set by the user to a non-default value.

Thinking about it now, it almost seems like it would make more sense for us to specify a deadline that is no larger than the time left before BOS gives up on the node, right? Like if we're trying to power on a node, and there's only 4 more minutes before BOS is going to timeout and power off the node to retry, then it seems like our power on call shouldn't have a deadline more than 4 minutes (since a transition after that point doesn't help BOS anyway, right?).

It doesn't mean we couldn't also have an option like this, but it seems like we'd want to take the lower of the two values -- our time left before timeout and whatever is specified by this option.

Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants