-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(localstore): stale push index items #2534
Conversation
df9d66f
to
f79dcd3
Compare
if err != nil { | ||
return true, err | ||
} | ||
if !has { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the biggest reason I've seen for this to happen is that the timestamp on the retrievalDataIndex item is different than the stored timestamp from the pushIndex. When the pusher gets it from the index it doesn't get the index's timestamp, so when it tries to remove it from the index, it cannot find the entry because it uses the stored timestamp from the retrievalDataIndex's item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes, I've also seen it happen that the item is in the push index, but no longer in the retrieval index. I fixed both of these issues in the push index iterator by checking the existence and timestamps and fixing it appropriately; removing or updating the push index entry based on the double-checks from the retrievalDataIndex during the push index iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you prove that the timestamps are different? i have never seen such a case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find @ldeffenb, you are probably referring to localstore's db.setSync mode_set.go:138 where the deletion will fail as the item is not in the retrievalDataIndex and StoreTimestamp is 0 which would lead to invalid db.pushIndex.DeleteInBatch on that line. That is correct and this is the best effort in that situation to try to delete, as no timestamp can be found. We could iterate over the pushIndex to find the address in the key and delete the key. This is potentially expensive and did not want to do that, but it is one solution. I think that this PR will remove the missing retrievalDataIndex item, at least at the current state of the code. Adding the check in SubscribePush, as you did is another nice safety measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my code from SubscribePush just inside the if err != nil
from the db.retrievalDataIndex.Get(item)
if errors.Is(err, leveldb.ErrNotFound) {
db.logger.Debugf("SubscribePush:chunk %s err %v", swarm.NewAddress(item.Address), err)
err2 := db.pushIndex.Delete(item)
if err2 != nil {
db.logger.Debugf("SubscribePush:chunk %s delete err %v", swarm.NewAddress(item.Address), err2)
}
}
return false, nil
//return true, err
}
if item.StoreTimestamp != dataItem.StoreTimestamp {
db.logger.Debugf("SubscribePush:chunk %s itemStore %d != %d dataItem.Store now %d", swarm.NewAddress(item.Address), item.StoreTimestamp, dataItem.StoreTimestamp, now())
err2 := db.pushIndex.Delete(item)
if err2 != nil {
db.logger.Debugf("SubscribePush:chunk %s pushIndex.Delete err %v", swarm.NewAddress(item.Address), err2)
} else {
item.StoreTimestamp = dataItem.StoreTimestamp
err3 := db.pushIndex.Put(item)
if err3 != nil {
db.logger.Debugf("SubscribePush:chunk %s pushIndex.Put err %v", swarm.NewAddress(item.Address), err3)
} else {
db.logger.Debugf("SubscribePush:chunk %s pushIndex corrected to StoreTimestamp %d", swarm.NewAddress(item.Address), item.StoreTimestamp)
}
}
return false, nil
}
And yes, I've seen the debug log showing that the timestamps are different. It seems to happen mostly when a bee node crashes while something is actively pushing data into it and then re-uploads the same data after the node restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very worrisome that such problem is present on crash, as leveldb Batch should provide atomicity on operations that we do in leveldb. We should investigate on data inconsistencies on crashes. I do not recall that something similar is performed earlier, I may be wrong about that. Thanks for a valuable insight and for sharing the solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about writing an index consistency checker that would invoke at startup and at least report any discrepancies that it finds. Such a thing could also be used to "prime" several metrics that are rather misleading as they are now. bee_pusher_total_to_push
comes to mind which only includes NEW chunks but does not include any that were already in the pusher queue at startup.
So many ideas, so little time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you prove that the timestamps are different? i have never seen such a case
The easiest way to tell this has happened is the following log sequence in setSync that comes after the chunk is successfully pushed. And the same chunk pushes over and over and over again which is how I first discovered it.
i, err = db.pushIndex.Get(item)
if err != nil {
if errors.Is(err, leveldb.ErrNotFound) {
// we handle this error internally, since this is an internal inconsistency of the indices
// this error can happen if the chunk is put with ModePutRequest or ModePutSync
// but this function is called with ModeSetSync
db.logger.Debugf("localstore: chunk with address %s timestamp %d not found in push index", addr, item.StoreTimestamp)
} else {
return 0, err
}
}
I added the timestamp to this log in my fork. The comment before the log was already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ldeffenb for valuable details. It is very strange that the storage time is changed as every Put call explicitly checks if the chunk is already present in the database before it stores it and that is done under the batchMu mutex lock to ensure that there is no data races even on the database level. It is very interesting how it is possible to have different store timestamps in retrievalDataIndex and pushIndex, as that would mean that the chunk is overwritten, or that there is a more subtle bug related to store timestamp handling in localstore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it should not happen, I only know that it DOES happen in some rare condition, but always when there is heavy uploading activity going on. I have not searched through the original store/put logic to see if I could figure that out though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @acud and @mrekucci)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @acud)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks.
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
gc not removing the pushindex items, resulting in a corrupt index when subscribe push cannot find the chunks that pop out of the index. a migration is added to remove the stale entries.
This change is