Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-7485 control: Implement dmg system drain to act on all hosts #15506

Merged
merged 9 commits into from
Dec 3, 2024

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Nov 15, 2024

Add dmg system drain command to drain a set of storage nodes or ranks
from all the pools they belong too. Takes --ranks or --rank-hosts in
ranged format. Improve unit test coverage for lib/control, cmd/dmg and
server/mgmt_system system related functions.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Nov 15, 2024

Ticket title is 'dmg command to drain and reintegrate nodes from all pools'
Status is 'In Review'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-7485

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15506/1/testReport/

Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr tanabarr force-pushed the tanabarr/control-drainpools-pernode branch from 8ae53a2 to 488acf7 Compare November 18, 2024 10:57
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ainpools-pernode

Features: control
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr tanabarr self-assigned this Nov 21, 2024
@tanabarr tanabarr added control-plane work on the management infrastructure of the DAOS Control Plane usability Changes specific to user facing tools or behaviour. labels Nov 21, 2024
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/3/execution/node/1506/log

…ainpools-pernode

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr tanabarr marked this pull request as ready for review November 22, 2024 22:24
@tanabarr tanabarr requested review from a team as code owners November 22, 2024 22:24
@tanabarr
Copy link
Contributor Author

Documentation and functional tests to be added in subsequent PRs. Ping reviewers.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/4/execution/node/1476/log

reporting and use labels over uuid when available

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ainpools-pernode

Features: control
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/7/execution/node/1537/log

Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

@@ -1072,6 +1073,148 @@ func (svc *mgmtSvc) SystemExclude(ctx context.Context, req *mgmtpb.SystemExclude
return resp, nil
}

func (svc *mgmtSvc) SystemDrain(ctx context.Context, req *mgmtpb.SystemDrainReq) (*mgmtpb.SystemDrainResp, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, this method could probably be split.

@tanabarr
Copy link
Contributor Author

tanabarr commented Dec 2, 2024

NLT failing on unrelated dfuse Valgrind issue, requesting forced landing

@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 2, 2024
@@ -1072,6 +1073,148 @@ func (svc *mgmtSvc) SystemExclude(ctx context.Context, req *mgmtpb.SystemExclude
return resp, nil
}

func (svc *mgmtSvc) SystemDrain(ctx context.Context, req *mgmtpb.SystemDrainReq) (*mgmtpb.SystemDrainResp, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a documentation comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in subsequent PR for SystemReint if that's okay

Comment on lines +1152 to +1153
// Use our incoming request and just replace relevant parameters on each iteration.
drainReq.Id = id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - this optimization feels a little premature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in subsequent PR for SystemReint if that's okay, will move into the inner loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is really minor, just felt like it added a small amount of confusion. If we are sending out lots of these drainReqs it may be justified to optimize. I leave it to your judgment whether to change it or not.


drainResp := &mgmtpb.PoolDrainResp{}
if err = proto.Unmarshal(drpcResp.Body, drainResp); err != nil {
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error()
errMsg = errors.Wrap(err, "unmarshal PoolDrain response").Error()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in subsequent PR for SystemReint if that's okay

drainResp := &mgmtpb.PoolDrainResp{}
if err = proto.Unmarshal(drpcResp.Body, drainResp); err != nil {
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error()
drainResp.Status = int32(daos.IOInvalid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dRPC we have an error code for this kind of failure: drpc.UnmarshalingPayloadFailure()

IMO when aggregating these results, we really want to flag cases like this as dRPC communication errors, rather than making them look like daos_engine errors.

Copy link
Contributor Author

@tanabarr tanabarr Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I didn't add this code, just edited. Will fix in subsequent PR for SystemReint if that's okay

src/control/lib/control/system.go Show resolved Hide resolved
}

func (cmd *baseExcludeCmd) execute(clear bool) error {
// Execute is run when systemStartCmd activates.
func (cmd *systemStartCmd) Execute(_ []string) (errOut error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - better not to shuffle the commands around if you can avoid it. Makes it look like more was changed than actually was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agreed but this has to be balanced against ensuring we have some ordering consistency across files which makes it easier to develop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed in general, although in this case it made the changes harder to review. In the diff, systemStartCmd moved (presumably didn't change?) and was treated as new code. The new command ended up where systemStartCmd used to be and the diff portrays it as if you edited that command. I don't need a revert or anything, since I've already gotten through it, but in future it would be best to make those kinds of code moves separately from big PRs like this.

@tanabarr tanabarr requested a review from kjacque December 3, 2024 13:11
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with addressing small fixes with the next PR in this series. Thanks Tom.

@tanabarr tanabarr requested a review from a team December 3, 2024 16:45
@daltonbohning daltonbohning merged commit 21a881a into master Dec 3, 2024
55 of 57 checks passed
@daltonbohning daltonbohning deleted the tanabarr/control-drainpools-pernode branch December 3, 2024 18:56
@tanabarr
Copy link
Contributor Author

tanabarr commented Dec 3, 2024

review comments addressed in #15551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. usability Changes specific to user facing tools or behaviour.
Development

Successfully merging this pull request may close these issues.

5 participants