-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16477 mgmt: return suspect engines for pool healthy query (#15458) #15512
Conversation
Ticket title is 'Provide admin interface to query hanging engines after massive failure' |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/1/testReport/ |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/1/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/1/testReport/ |
e646c43
to
e8e6dd7
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/2/testReport/ |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/2/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/2/testReport/ |
e8e6dd7
to
e7a7d80
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/3/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/3/testReport/ |
e7a7d80
to
e415290
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/4/testReport/ |
e415290
to
1c57dd0
Compare
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15512/5/testReport/ |
1c57dd0
to
0ef9e87
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15512/6/execution/node/1131/log |
1f7ed6a
to
67bee2d
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15512/8/execution/node/1098/log |
* DAOS-16477 mgmt: return suspect engines for pool healthy query After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them. This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command. An example of output of dmg pool query --health-only: Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info: - Disabled ranks: 1 - Suspect ranks: 2 - Rebuild busy, 0 objs, 0 recs Features: DmgPoolQueryRanks skip-nlt: true Required-githooks: true Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Phil Henderson <phillip.henderson@intel.com> Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
67bee2d
to
95bdcdf
Compare
Port is for 2.6, reviewers please raise priority this...Thanks!!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to backport a few other master commits related to "suspect"->"dead" and the documentation, is that right? One might want to use git log <path>
to find out which.
Yes, we need backport another two PRs(one rename and one fix as I remeber) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Nice work which should be very helpful for support tasks :-)
rc == -DER_NONEXIST ? -1 : state.sms_status); | ||
if (rc == -DER_NONEXIST || state.sms_status == SWIM_MEMBER_DEAD) { | ||
rc = d_rank_list_append(rank_list, doms[i].do_comp.co_rank); | ||
if (rc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT from my understanding an assert could be more suited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might fail because of out of memory, I suppose to return error here is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
@daos-stack/daos-gatekeeper would you consider to force land this RP? thanks! |
Build/Run DAOS/NLT test failure looks a environment failure:
|
After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.
This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.
An example of output of dmg pool query --health-only:
Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: