Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

Merged
merged 8 commits into from
Nov 16, 2024

Conversation

wangshilong
Copy link
Contributor

After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:

  • Disabled ranks: 1
  • Suspect ranks: 2
  • Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Copy link

github-actions bot commented Nov 6, 2024

Ticket title is 'Provide admin interface to query hanging engines after massive failure'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-16477

@wangshilong wangshilong marked this pull request as ready for review November 7, 2024 00:47
@wangshilong wangshilong requested review from a team as code owners November 7, 2024 00:47
@wangshilong
Copy link
Contributor Author

To reviewers: This PR landed before but got reverted because of conflicts with MD-on-SSD phase2 PR, i refreshed the PR with master and removed a walkaround in the pool query tests(since rebuild bug fixed)

liw
liw previously approved these changes Nov 7, 2024
Copy link
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C changes look good to me.

tanabarr
tanabarr previously approved these changes Nov 7, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/1/execution/node/1478/log

Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is at least one fix needed. I also have a couple questions about the intent of some changes.

src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/util/server_utils_params.py Outdated Show resolved Hide resolved
Required-githooks: true
Features: DmgPoolQueryRanks
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
@wangshilong wangshilong dismissed stale reviews from tanabarr and liw via 92c6882 November 11, 2024 03:51
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15458/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/2/execution/node/1196/log

@wangshilong
Copy link
Contributor Author

@phender would you help fix CI env issue? thanks!

Features: DmgPoolQueryRanks

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
"CRT_EVENT_DELAY=1",
"COVFILE=/tmp/test.cov"],
EnvironmentalVariable("D_LOG_FILE_APPEND_PID", "1"),
EnvironmentalVariable("DAOS_POOL_RF", "4", True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Couldn't we just let all of these be overridden and then we don't need the new class EnvironmentalVariable at all?

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15458/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/3/execution/node/1150/log

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C and Go changes LGTM. Comment is minor.

Comment on lines +1429 to +1430
flags |= DEFAULT_QUERY_BITS;
assert_int_equal(ds_mgmt_pool_query_info_in.pi_bits, DEFAULT_QUERY_BITS | flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant bitwise OR

Suggested change
flags |= DEFAULT_QUERY_BITS;
assert_int_equal(ds_mgmt_pool_query_info_in.pi_bits, DEFAULT_QUERY_BITS | flags);
assert_int_equal(ds_mgmt_pool_query_info_in.pi_bits, DEFAULT_QUERY_BITS | flags);

@wangshilong wangshilong requested a review from a team November 16, 2024 03:07
@gnailzenh gnailzenh merged commit f512f48 into master Nov 16, 2024
53 of 54 checks passed
@gnailzenh gnailzenh deleted the shilongw/DAOS-16477_1 branch November 16, 2024 06:18
wangshilong pushed a commit that referenced this pull request Nov 18, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 20, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 21, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 22, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 22, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 22, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 25, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Nov 25, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
jolivier23 pushed a commit that referenced this pull request Dec 9, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
jolivier23 pushed a commit that referenced this pull request Dec 9, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
jolivier23 pushed a commit that referenced this pull request Dec 10, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
jolivier23 pushed a commit that referenced this pull request Dec 11, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
jolivier23 pushed a commit that referenced this pull request Dec 11, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
wangshilong pushed a commit that referenced this pull request Dec 12, 2024
* DAOS-16477 mgmt: return suspect engines for pool healthy query

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
mjmac pushed a commit that referenced this pull request Dec 12, 2024
After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Required-githooks: true

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
@mjmac mjmac mentioned this pull request Dec 12, 2024
mjmac pushed a commit that referenced this pull request Dec 12, 2024
After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Required-githooks: true

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
mchaarawi pushed a commit that referenced this pull request Jan 9, 2025
… (#15512)

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants