Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to control the PCS transition deadline #391

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
### Added
- BOS option to control how long a deadline it gives PCS to complete its transition

## [2.30.5] - 2024-10-15
### Fixed
- Fix per-bootset CFS setting
Expand Down
7 changes: 7 additions & 0 deletions api/openapi.yaml.in
Original file line number Diff line number Diff line change
Expand Up @@ -976,6 +976,13 @@ components:
Options for the Boot Orchestration Service.
type: object
properties:
pcs_transition_deadline:
type: integer
description: |
The amount of time (in minutes) to set the deadline for a PCS pcs_transition_deadline
example: 1
minimum: 1
maximum: 1440
cfs_read_timeout:
type: integer
description: |
Expand Down
5 changes: 5 additions & 0 deletions src/bos/common/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
# code should either import this dict directly, or (preferably) access
# its values indirectly using a DefaultOptions object
DEFAULTS = {
'pcs_transition_deadline': 60,
Copy link
Contributor

@mharding-hpe mharding-hpe Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern I have is that I notice we are defaulting to 60 minutes for the deadline. Prior to this PR, our calls to PCS did not specify a deadline, meaning it would end up using the PCS default value of 5 minutes.

With a deadline of 60 minutes, for a node which is not behaving nicely with PCS, this could result in BOS creating a bunch of transitions in PCS for the same node, before any of these transitions hit their deadlines. I don't know enough about how PCS works to know how it handles this situation, or if there is the potential for problems. This just is different than the CFS option we added, where by default everything would behave the same as before, and things would only be different if the option was set by the user to a non-default value.

Thinking about it now, it almost seems like it would make more sense for us to specify a deadline that is no larger than the time left before BOS gives up on the node, right? Like if we're trying to power on a node, and there's only 4 more minutes before BOS is going to timeout and power off the node to retry, then it seems like our power on call shouldn't have a deadline more than 4 minutes (since a transition after that point doesn't help BOS anyway, right?).

It doesn't mean we couldn't also have an option like this, but it seems like we'd want to take the lower of the two values -- our time left before timeout and whatever is specified by this option.

Does that make sense?

'cfs_read_timeout': 20,
'cleanup_completed_session_ttl': "7d",
'clear_stage': False,
Expand Down Expand Up @@ -63,6 +64,10 @@ def get_option(self, key: str) -> Any:
# All these do is convert the response to the appropriate type for the option,
# and return it.

@property
def pcs_transition_deadline(self) -> int:
return int(self.get_option('pcs_transition_deadline'))

@property
def cfs_read_timeout(self) -> int:
return int(self.get_option('cfs_read_timeout'))
Expand Down
2 changes: 1 addition & 1 deletion src/bos/operators/power_off_forceful.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def filters(self):
def _act(self, components):
if components:
component_ids = [component['id'] for component in components]
pcs.force_off(nodes=component_ids)
pcs.force_off(nodes=component_ids, task_deadline_minutes=options.pcs_transition_deadline)
return components


Expand Down
3 changes: 2 additions & 1 deletion src/bos/operators/power_off_graceful.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

from bos.common.values import Action, Status
from bos.operators.utils.clients import pcs
from bos.operators.utils.clients.bos.options import options
from bos.operators.base import BaseOperator, main
from bos.operators.filters import BOSQuery, HSMState

Expand Down Expand Up @@ -55,7 +56,7 @@ def filters(self):
def _act(self, components):
if components:
component_ids = [component['id'] for component in components]
pcs.soft_off(component_ids)
pcs.soft_off(component_ids, task_deadline_minutes=options.pcs_transition_deadline)
return components


Expand Down
3 changes: 2 additions & 1 deletion src/bos/operators/power_on.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from bos.operators.utils.clients import pcs
from bos.operators.utils.clients.ims import tag_image
from bos.operators.utils.clients.cfs import set_cfs
from bos.operators.utils.clients.bos.options import options
from bos.operators.base import BaseOperator, main
from bos.operators.filters import BOSQuery, HSMState
from bos.server.dbs.boot_artifacts import record_boot_artifacts
Expand Down Expand Up @@ -88,7 +89,7 @@ def _act(self, components: Union[List[dict],None]):
raise Exception(f"Error encountered setting CFS information: {e}") from e
component_ids = [component['id'] for component in components]
try:
pcs.power_on(component_ids)
pcs.power_on(component_ids, task_deadline_minutes=options.pcs_transition_deadline)
except Exception as e:
raise Exception(f"Error encountered calling CAPMC to power on: {e}") from e
return components
Expand Down
Loading