-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-7485 control: Implement dmg system drain to act on all hosts #15506
Conversation
Ticket title is 'dmg command to drain and reintegrate nodes from all pools' |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15506/1/testReport/ |
Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
8ae53a2
to
488acf7
Compare
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ainpools-pernode Features: control Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/3/execution/node/1506/log |
…ainpools-pernode Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Documentation and functional tests to be added in subsequent PRs. Ping reviewers. |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/4/execution/node/1476/log |
reporting and use labels over uuid when available Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ainpools-pernode Features: control Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15506/7/execution/node/1537/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
@@ -1072,6 +1073,148 @@ func (svc *mgmtSvc) SystemExclude(ctx context.Context, req *mgmtpb.SystemExclude | |||
return resp, nil | |||
} | |||
|
|||
func (svc *mgmtSvc) SystemDrain(ctx context.Context, req *mgmtpb.SystemDrainReq) (*mgmtpb.SystemDrainResp, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, this method could probably be split.
NLT failing on unrelated dfuse Valgrind issue, requesting forced landing |
@@ -1072,6 +1073,148 @@ func (svc *mgmtSvc) SystemExclude(ctx context.Context, req *mgmtpb.SystemExclude | |||
return resp, nil | |||
} | |||
|
|||
func (svc *mgmtSvc) SystemDrain(ctx context.Context, req *mgmtpb.SystemDrainReq) (*mgmtpb.SystemDrainResp, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a documentation comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in subsequent PR for SystemReint if that's okay
// Use our incoming request and just replace relevant parameters on each iteration. | ||
drainReq.Id = id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor - this optimization feels a little premature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in subsequent PR for SystemReint if that's okay, will move into the inner loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is really minor, just felt like it added a small amount of confusion. If we are sending out lots of these drainReqs it may be justified to optimize. I leave it to your judgment whether to change it or not.
|
||
drainResp := &mgmtpb.PoolDrainResp{} | ||
if err = proto.Unmarshal(drpcResp.Body, drainResp); err != nil { | ||
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error() | |
errMsg = errors.Wrap(err, "unmarshal PoolDrain response").Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in subsequent PR for SystemReint if that's okay
drainResp := &mgmtpb.PoolDrainResp{} | ||
if err = proto.Unmarshal(drpcResp.Body, drainResp); err != nil { | ||
errMsg = errors.Wrap(err, "unmarshal PoolEvict response").Error() | ||
drainResp.Status = int32(daos.IOInvalid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dRPC we have an error code for this kind of failure: drpc.UnmarshalingPayloadFailure()
IMO when aggregating these results, we really want to flag cases like this as dRPC communication errors, rather than making them look like daos_engine errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I didn't add this code, just edited. Will fix in subsequent PR for SystemReint if that's okay
} | ||
|
||
func (cmd *baseExcludeCmd) execute(clear bool) error { | ||
// Execute is run when systemStartCmd activates. | ||
func (cmd *systemStartCmd) Execute(_ []string) (errOut error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor - better not to shuffle the commands around if you can avoid it. Makes it look like more was changed than actually was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes agreed but this has to be balanced against ensuring we have some ordering consistency across files which makes it easier to develop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed in general, although in this case it made the changes harder to review. In the diff, systemStartCmd
moved (presumably didn't change?) and was treated as new code. The new command ended up where systemStartCmd
used to be and the diff portrays it as if you edited that command. I don't need a revert or anything, since I've already gotten through it, but in future it would be best to make those kinds of code moves separately from big PRs like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with addressing small fixes with the next PR in this series. Thanks Tom.
review comments addressed in #15551 |
Add dmg system drain command to drain a set of storage nodes or ranks
from all the pools they belong too. Takes --ranks or --rank-hosts in
ranged format. Improve unit test coverage for lib/control, cmd/dmg and
server/mgmt_system system related functions.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: