-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-6967 vos: Data loss on updates after uncommitted punch #4899
Conversation
If a punch of a parent object/key is uncommited, a new incarnation log entry needs to be inserted. This reguressed with the addition of minor epochs. Added a regression test that exercises the problematic case. Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
* Modify aggregate to bypass aggregation if child tree was aborted * Fix an issue with incarnation log. We don't need to abort aggregation if the uncommitted entry is after the aggregation period so move the check. Note, there is an issue with single value still because it doesn't abort the aggregation under such cases so this is a WIP Skip-test: true Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
btr_check_availability for single value but I think this is best left for 2.0 Modify the reset to also reset the ta_reprobe flags Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Note, I have ideas for lots more VOS level testing....but it doesn't seem practical for a 1.2 patch at this point |
Failure is due to DAOS-7016 |
src/common/btree.c
Outdated
@@ -3658,6 +3665,10 @@ dbtree_iter_delete(daos_handle_t ih, void *args) | |||
if (rc != 0) | |||
return rc; | |||
|
|||
/** The probed entry is not committed so return -DER_TX_BUSY */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why an aborted entry can't be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This needs to be for in-progress entries. We need another way to determine if an entry is aborted. I don't think the dtx_status stuff is working for single value
src/vos/vos_aggregate.c
Outdated
@@ -409,6 +413,16 @@ vos_agg_sv(daos_handle_t ih, vos_iter_entry_t *entry, | |||
|
|||
delete: | |||
rc = agg_del_entry(ih, agg_param->ap_umm, entry, acts); | |||
if (rc == -DER_TX_BUSY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the uncommitted check done above (by checking the entry->ie_dtx_state)?
src/vos/vos_aggregate.c
Outdated
@@ -180,7 +183,8 @@ agg_del_entry(daos_handle_t ih, struct umem_instance *umm, | |||
rc = umem_tx_commit(umm); | |||
|
|||
if (rc) { | |||
D_ERROR("Failed to delete entry: "DF_RC"\n", DP_RC(rc)); | |||
D_CDEBUG(rc == -DER_TX_BUSY, DB_TRACE, DLOG_ERR, | |||
"Failed to delete entry: "DF_RC"\n", DP_RC(rc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think current aggregation can ensure not deleting prepared entries, no?
* moving forward. | ||
* moving forward. We do, however, need to ensure we do not | ||
* aggregate anything in the parent path. Otherwise, we could | ||
* orphan the current entry due to incarnation log semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see the "incarnation log semantics" here, could you explain a bit?
I don't see how this "ap_skip_obj/dkey/akey" works as well, the intention is to set the skip flag for current epoch range aggregation? Or for all epoch ranges aggregation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with you offline but for everyone's benefit, the issue is that the incarnation log doesn't contain enough information to know at the key level that there are in-progress updates that we may orphan by aggregating the ilog. The example I used is this one
- committed update at 1
- in progress update at 2 - this will not generate an incarnation log entry
- punch at 3
From the ilog perspective, we can delete everything but the subtree will not be empty due to the inprogress update.
may need re-probe. Evtree reprobe does the sort over again anyway. Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
Test stage NLT completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-4899/7/display/redirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
break; | ||
D_DEBUG(DB_EPC, "Hit uncommitted single value at epoch:" | ||
DF_X64"\n", entry->ie_epoch); | ||
return -DER_TX_BUSY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments above need be revised, the original design is to skip the uncommitted epoch and continue on the next lower epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will fix that.
if (acts & VOS_ITER_CB_RESTART) { | ||
daos_anchor_set_zero(anchor); | ||
probe_anchor = NULL; | ||
goto probe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply re-probe won't work for sorted evtree iteration, you need to re-prepare the sorted entries on restart, since the deletion before restart have invalidated the visibility of evtree entries.
I think a simpler way is to rely on next round of aggregation to aggregate the same epoch range again. (we just need to skip updating HAE when aggregation aborted on hitting aborted or uncommitted entries.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to re-prepare. For sorted iterator, probe is what re-reads all the entries and sorts them so it works to simply re-probe in this case. See evt_iter_probe_sorted. I suppose it may be a bit weird for upper levels but it's only really used by evtree so maybe it's fine. I originally did re-prepare but this actually requires a reprobe of the parent iterator and is more complicated.
We could do as you say and just abort but what is the advantage of doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was thinking reprobe won't re-read all entries.
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
return true; | ||
|
||
return minor_epc <= agg_arg->aa_punched_minor; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (entry->ie_id.id_punch_minor_eph > entry->ie_id.id_update_minor_eph &&
entry->ie_id.id_punch_minor_eph > agg_arg->aa_punched_minor &&
entry->ie_id.id_epoch == agg_arg->aa_punched)
Then it is a punched entry or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the entry is punched (e.g. punch_minor > update_minor) then we will determine later that the entry is punched. But this check is only about whether it is punched entirely by the parent.
@@ -683,11 +681,18 @@ vos_iterate_internal(vos_iter_param_t *param, vos_iter_type_t type, | |||
if (acts & VOS_ITER_CB_ABORT) | |||
break; | |||
|
|||
if (acts & VOS_ITER_CB_RESTART) { | |||
daos_anchor_set_zero(anchor); | |||
probe_anchor = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set "probe_anchor" as NULL may cause some potential trouble in our current implementation, please check evt_iter_probe_sorted(), it assumes that the input parameter "rect" and "anchor" will not be both NULL. But that is not true in your patch. I am not quite sure for that yet, but need some detailed check/test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anchor is NULL, it passes EVT_ITER_FIRST to evt_iter_probe. In this case, it doesn't reach the scenario your speak of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
src/vos/vos_aggregate.c
Outdated
*acts |= VOS_ITER_CB_ABORT; | ||
if (rc == -DER_CSUM) | ||
agg_param->ap_csum_err = true; | ||
if (rc == -DER_TX_RESTART) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will be always false since you have checked "if (rc == -DER_CSUM || rc == -DER_TX_BUSY)" at L1834.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I guess I never actually hit this case because the case that was causing issues was that aborted entries were getting ignored entirely. I'll fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this a bit
assert_memory_equal(buf, "xxxxx", 5); | ||
|
||
memset(buf, 'x', sizeof(buf)); | ||
epoch += 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unnecessary to bump the epoch again for another fetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True but it also doesn't hurt anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not hurt anything, just redundant.
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
assert_memory_equal(buf, "xxxxx", 5); | ||
|
||
memset(buf, 'x', sizeof(buf)); | ||
epoch += 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not hurt anything, just redundant.
@@ -683,11 +681,18 @@ vos_iterate_internal(vos_iter_param_t *param, vos_iter_type_t type, | |||
if (acts & VOS_ITER_CB_ABORT) | |||
break; | |||
|
|||
if (acts & VOS_ITER_CB_RESTART) { | |||
daos_anchor_set_zero(anchor); | |||
probe_anchor = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
The main issue we are trying to prevent is removal of a subtree that isn't empty. This typically means that we have a bug in aggregation or in the incarnation log This patch does the following: Modifies the btree code so that it returns aborted entries during aggregation. Modify ilog_aggregate so it takes a minor epoch into account. Previously, if we had a punch of a parent, it would remove the child at the same major epoch even if it had updates afterward. Modify the incarnation log to ensure we take the punch of a parent into account when doing an update 4.Modify aggregation so that we skip aggregation of parent objects if a child iterator hits -DER_TX_BUSY. This avoids removing orphaned subtrees. Aggregation of the incarnation log was returning -DER_TX_BUSY for entries that are after the aggregated range. Moved the check to after the range check. Modify evtree aggregation to restart the current tree in case of removing an aborted entry rather than returning -DER_TX_BUSY which would cause upper layer to abort too. Fixed a few other issues with aggregation. Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
…5059) The main issue we are trying to prevent is removal of a subtree that isn't empty. This typically means that we have a bug in aggregation or in the incarnation log This patch does the following: Modifies the btree code so that it returns aborted entries during aggregation. Modify ilog_aggregate so it takes a minor epoch into account. Previously, if we had a punch of a parent, it would remove the child at the same major epoch even if it had updates afterward. Modify the incarnation log to ensure we take the punch of a parent into account when doing an update 4.Modify aggregation so that we skip aggregation of parent objects if a child iterator hits -DER_TX_BUSY. This avoids removing orphaned subtrees. Aggregation of the incarnation log was returning -DER_TX_BUSY for entries that are after the aggregated range. Moved the check to after the range check. Modify evtree aggregation to restart the current tree in case of removing an aborted entry rather than returning -DER_TX_BUSY which would cause upper layer to abort too. Fixed a few other issues with aggregation. Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
The main issue we are trying to prevent is removal of a subtree that isn't empty.
This typically means that we have a bug in aggregation or in the incarnation log
This patch does the following:
4.Modify aggregation so that we skip aggregation of parent objects if a child iterator hits -DER_TX_BUSY. This avoids removing orphaned subtrees.
Signed-off-by: Jeff Olivier jeffrey.v.olivier@intel.com