-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-1946 md: keep service up on on metadata full condition #2077
Conversation
044e8bb
to
7794dbd
Compare
Test stage Functional_Hardware_Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-2077/2/execution/node/632/log |
e3b6262
to
5cb4940
Compare
Test stage Functional_Hardware_Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-2077/6/testReport/(root)/ |
One functional test failure, could be DAOS-4302. I've updated that ticket with a tarball of the artifacts from this build #7 run. |
Cherry picked commit a614cb1 PR #1956 from daos master branch to release/0.9 branch. Change rdb_raft state checking code so that when -DER_NOSPACE condition is observed when appending to the raft log, it is handled like the -DER_NOMEM case (become follower, step down). Also trigger rdb log compaction aggressively seeking to reclaim space. Before this change, stopping the service may leave it "dead" impacting subsequent resource destroy operations (e.g., pool destroy). Re-enable the metadatafill test and run it with multiple (4) servers and pool service replicas (3). Adjust the maximum number of containers to approximately 98% of what can be accommodated in a metadata capacity of 128MB. Signed-off-by: Ken Cain <kenneth.c.cain@intel.com> Conflicts: src/tests/ftest/server/metadata.py
Cherry picked: commit a614cb1 PR #1956 and commit 0bb952e808906c1575e33a345a6d95a4c3f5bc2 PR #2057 from daos master branch to release/0.9 branch. Change rdb_raft state checking code so that when -DER_NOSPACE condition is observed when appending to the raft log, it is handled like the -DER_NOMEM case (become follower, step down). Also trigger rdb log compaction aggressively seeking to reclaim space. Before this change, stopping the service may leave it "dead" impacting subsequent resource destroy operations (e.g., pool destroy). The metadatafill test is being disabled again. While it passes frequently with the above change to DAOS, intermittently it fails with different symptoms when metadata storage is exhausted. Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
this PR but blocking its progress in CI. Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
that occurs in metadata_add_remove (tag metadata_free_space) test. Separate PR being prepared to address the issue on master. Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
8f433de
to
7e71e2f
Compare
Test-tag-hw-large: pr,hw,large metadatafill metadata_free_space Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
7e71e2f
to
7430fa5
Compare
Test stage Test CentOS 7 RPMs completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-2077/12/execution/node/507/log |
@daos-stack/daos-gatekeeper this one is ready now for review/landing - after a rebase, undoing the Dockerfile.leap.15 change that had been affecting multiple developers but solved with a different PR, and retesting successfully. |
Cherry picked commit a614cb1 PR #1956
and commit 0bb952e808906c1575e33a345a6d95a4c3f5bc2 PR #2057 from daos
master branch to release/0.9 branch.
Change rdb_raft state checking code so that when -DER_NOSPACE
condition is observed when appending to the raft log, it is handled
like the -DER_NOMEM case (become follower, step down). Also trigger
rdb log compaction aggressively seeking to reclaim space. Before this
change, stopping the service may leave it "dead" impacting subsequent
resource destroy operations (e.g., pool destroy).
The metadatafill test is being disabled again. While it passes
frequently with the above change to DAOS, intermittently it fails
with different symptoms when metadata storage is exhausted.