Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-1946 md: keep service up on on metadata full condition #2077

Merged
merged 5 commits into from
Mar 18, 2020

Conversation

kccain
Copy link
Contributor

@kccain kccain commented Mar 10, 2020

Cherry picked commit a614cb1 PR #1956
and commit 0bb952e808906c1575e33a345a6d95a4c3f5bc2 PR #2057 from daos
master branch to release/0.9 branch.

Change rdb_raft state checking code so that when -DER_NOSPACE
condition is observed when appending to the raft log, it is handled
like the -DER_NOMEM case (become follower, step down). Also trigger
rdb log compaction aggressively seeking to reclaim space. Before this
change, stopping the service may leave it "dead" impacting subsequent
resource destroy operations (e.g., pool destroy).

The metadatafill test is being disabled again. While it passes
frequently with the above change to DAOS, intermittently it fails
with different symptoms when metadata storage is exhausted.

@kccain kccain force-pushed the kccain/daos_1946_0p9 branch from 044e8bb to 7794dbd Compare March 10, 2020 20:11
@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-2077/2/execution/node/632/log

@kccain kccain force-pushed the kccain/daos_1946_0p9 branch 2 times, most recently from e3b6262 to 5cb4940 Compare March 13, 2020 17:13
@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-2077/6/testReport/(root)/

@kccain
Copy link
Contributor Author

kccain commented Mar 16, 2020

One functional test failure, could be DAOS-4302. I've updated that ticket with a tarball of the artifacts from this build #7 run.

@kccain kccain requested a review from a team March 16, 2020 13:01
kccain added 4 commits March 18, 2020 10:27
Cherry picked commit a614cb1 PR #1956
from daos master branch to release/0.9 branch.

Change rdb_raft state checking code so that when -DER_NOSPACE
condition is observed when appending to the raft log, it is handled
like the -DER_NOMEM case (become follower, step down). Also trigger
rdb log compaction aggressively seeking to reclaim space. Before this
change, stopping the service may leave it "dead" impacting subsequent
resource destroy operations (e.g., pool destroy).

Re-enable the metadatafill test and run it with multiple (4) servers
and pool service replicas (3). Adjust the maximum number of containers
to approximately 98% of what can be accommodated in a metadata
capacity of 128MB.

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

Conflicts:
	src/tests/ftest/server/metadata.py
Cherry picked:
commit a614cb1 PR #1956 and
commit 0bb952e808906c1575e33a345a6d95a4c3f5bc2 PR #2057
from daos master branch to release/0.9 branch.

Change rdb_raft state checking code so that when -DER_NOSPACE
condition is observed when appending to the raft log, it is handled
like the -DER_NOMEM case (become follower, step down). Also trigger
rdb log compaction aggressively seeking to reclaim space. Before this
change, stopping the service may leave it "dead" impacting subsequent
resource destroy operations (e.g., pool destroy).

The metadatafill test is being disabled again. While it passes
frequently with the above change to DAOS, intermittently it fails
with different symptoms when metadata storage is exhausted.

Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
this PR but blocking its progress in CI.

Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
that occurs in metadata_add_remove (tag metadata_free_space) test.
Separate PR being prepared to address the issue on master.

Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
@kccain kccain force-pushed the kccain/daos_1946_0p9 branch from 8f433de to 7e71e2f Compare March 18, 2020 14:54
Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
@kccain kccain force-pushed the kccain/daos_1946_0p9 branch from 7e71e2f to 7430fa5 Compare March 18, 2020 14:56
@daosbuild1
Copy link
Collaborator

Test stage Test CentOS 7 RPMs completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-2077/12/execution/node/507/log

@kccain
Copy link
Contributor Author

kccain commented Mar 18, 2020

@daos-stack/daos-gatekeeper this one is ready now for review/landing - after a rebase, undoing the Dockerfile.leap.15 change that had been affecting multiple developers but solved with a different PR, and retesting successfully.

@jolivier23 jolivier23 merged commit 1d6e80e into release/0.9 Mar 18, 2020
@jolivier23 jolivier23 deleted the kccain/daos_1946_0p9 branch March 18, 2020 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants