Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-7485 control: Implement system reint to act on all pools #15551

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Dec 3, 2024

Add dmg system reint command to reintegrate a set of storage nodes or
ranks from all the pools they belong to. Takes --ranks or --rank-hosts in
ranged format.

  • Shorten variable naming from Reintegrate to Reint
  • Don't export variables unnecessarily in cmd/dmg
  • Improve reporting of protobuf unmarshal errors
  • Add system reint to {cmd/dmg/,lib/control/,server/mgmt_}system.go
  • Add unit test coverage for new code

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Dec 3, 2024
@tanabarr tanabarr self-assigned this Dec 3, 2024
Copy link

github-actions bot commented Dec 3, 2024

Ticket title is 'dmg command to drain and reintegrate nodes from all pools'
Status is 'In Review'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-7485

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/304/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/519/log

Base automatically changed from tanabarr/control-drainpools-pernode to master December 3, 2024 18:56
@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/375/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/360/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/369/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/359/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/364/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15551/5/testReport/

tanabarr and others added 8 commits December 11, 2024 18:18
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-reintpools-pernode branch from 45f3e40 to d96bbfd Compare December 11, 2024 18:18
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15551/6/testReport/

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…-stack/daos into tanabarr/control-reintpools-pernode

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/7/execution/node/1211/log

@daltonbohning
Copy link
Contributor

I suspect the copyright GHA is failing because the workflow is coming from a merge of this PR + master, but the source tree used is just this PR. Something I'll need to consider in the future for new GHA. Anyway, the copyright GHA is not required so can be ignored

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15551/8/testReport/

…intpools-pernode

Features: pool
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr marked this pull request as ready for review December 16, 2024 11:03
@tanabarr tanabarr requested review from a team as code owners December 16, 2024 11:03
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/9/execution/node/364/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/9/execution/node/361/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/9/execution/node/346/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/9/execution/node/367/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/9/execution/node/521/log

Features: pool
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/10/execution/node/1130/log

…intpools-pernode

Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Copy link
Contributor

@mjmac mjmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, looks OK to me. I have some quibbles with the (re-)naming, though.

return
}

func printSysOsaResults(out io.Writer, results []*control.SystemOsaResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper should have a comment. Also, what is Osa? Maybe a better name for that whole concept and family of types/functions might be PoolMembershipUpdates? Ultimately, the functionality is less about the system and more about the changes to pools, right?

@@ -134,7 +134,7 @@ func (m MgmtMethod) String() string {
MethodPoolExclude: "PoolExclude",
MethodPoolDrain: "PoolDrain",
MethodPoolExtend: "PoolExtend",
MethodPoolReintegrate: "PoolReintegrate",
MethodPoolReint: "PoolReint",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename? The other methods generally use full words, and Reint is less clear than Reintegrate without context. Plus, it makes this PR much noisier than it would be without the rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just long and sticks out in the C code because of its length, wanted to be consistent. is it a strong enough objection that you want me to revert all the changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These symbols are for humans, not computers. Truncating a word in an API surface because it formats inconveniently seems strange to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think partly why long names stand out in C code is because of the conventions used in DAOS. E.g. the DAOS C code tends to do

my_return_var = some_long_function_name(some_long_argument_name1,
					some_long_argument_name2,
					some_long_argument_name3)

Whereas, IMO, for this reason and several others actually, I think this is much more reasonable

my_return_var = some_long_function_name(
	some_long_argument_name1,
	some_long_argument_name2,
	some_long_argument_name3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We all come from different programming experiences and it's always about striking a balance but I generally find that people coming from C backgrounds are uncomfortable with long variable names which people from other backgrounds may be more comfortable with. So when interacting with C engine code I tend to try to be considerate to the preferences of the relevant code owners. "Reint" is an abbreviation that's been used variously around
the code base and is relatively intuitive and unambiguous. @mjmac are you requesting that I revert the reame?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point with regard to the C code, but why does the Go code need to be truncated, too? Go norms focus on readability and maintainability. I don't have a problem with the C code conforming to the style in that codebase, but I do think the Go code should be reverted to the more verbose and readable conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -566,41 +566,49 @@ func SystemExclude(ctx context.Context, rpcClient UnaryInvoker, req *SystemExclu
return resp, convertMSResponse(ur, resp)
}

// SystemOsaResult describes the result of an OSA operation on a pool's ranks.
Copy link
Contributor

@mjmac mjmac Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, this is not a very human-friendly type name. It requires the reader to go figure out what Osa or OSA is. That's more of an internal DAOS name, IMO. Better to describe the types and functions in terms of why and how they would be used, e.g. SystemPoolMembershipUpdate or something along those lines. Yes, it's more verbose, but this is a public API, and I don't think brevity is a virtue in public types and documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Membership is an overloaded term IMO but if you insist I can change to SystemPoolMembershipUpdate, is that required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put yourself in the shoes of an API user who has not been hacking on DAOS for a couple of years. What the heck is a/an Osa? Sure, it can be looked up, but the bar for adding to an API surface should be higher than "it compiles and lets me move on to the next bug, go figure it out", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm re-factoring to reduce duplication as per suggestion from @wangshilong and will rename appropriately

…failures

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…intpools-pernode

Features: control
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane
Development

Successfully merging this pull request may close these issues.

4 participants