fix(localstore): stale push index items #2534

acud · 2021-09-22T16:15:41Z

gc not removing the pushindex items, resulting in a corrupt index when subscribe push cannot find the chunks that pop out of the index. a migration is added to remove the stale entries.

This change is

pkg/localstore/gc.go

ldeffenb · 2021-09-22T16:51:27Z

pkg/localstore/migration_dead_push.go

+		if err != nil {
+			return true, err
+		}
+		if !has {


Actually, the biggest reason I've seen for this to happen is that the timestamp on the retrievalDataIndex item is different than the stored timestamp from the pushIndex. When the pusher gets it from the index it doesn't get the index's timestamp, so when it tries to remove it from the index, it cannot find the entry because it uses the stored timestamp from the retrievalDataIndex's item.

But yes, I've also seen it happen that the item is in the push index, but no longer in the retrieval index. I fixed both of these issues in the push index iterator by checking the existence and timestamps and fixing it appropriately; removing or updating the push index entry based on the double-checks from the retrievalDataIndex during the push index iteration.

can you prove that the timestamps are different? i have never seen such a case

Good find @ldeffenb, you are probably referring to localstore's db.setSync mode_set.go:138 where the deletion will fail as the item is not in the retrievalDataIndex and StoreTimestamp is 0 which would lead to invalid db.pushIndex.DeleteInBatch on that line. That is correct and this is the best effort in that situation to try to delete, as no timestamp can be found. We could iterate over the pushIndex to find the address in the key and delete the key. This is potentially expensive and did not want to do that, but it is one solution. I think that this PR will remove the missing retrievalDataIndex item, at least at the current state of the code. Adding the check in SubscribePush, as you did is another nice safety measure.

Here's my code from SubscribePush just inside the if err != nil from the db.retrievalDataIndex.Get(item)

if errors.Is(err, leveldb.ErrNotFound) { db.logger.Debugf("SubscribePush:chunk %s err %v", swarm.NewAddress(item.Address), err) err2 := db.pushIndex.Delete(item) if err2 != nil { db.logger.Debugf("SubscribePush:chunk %s delete err %v", swarm.NewAddress(item.Address), err2) } } return false, nil //return true, err } if item.StoreTimestamp != dataItem.StoreTimestamp { db.logger.Debugf("SubscribePush:chunk %s itemStore %d != %d dataItem.Store now %d", swarm.NewAddress(item.Address), item.StoreTimestamp, dataItem.StoreTimestamp, now()) err2 := db.pushIndex.Delete(item) if err2 != nil { db.logger.Debugf("SubscribePush:chunk %s pushIndex.Delete err %v", swarm.NewAddress(item.Address), err2) } else { item.StoreTimestamp = dataItem.StoreTimestamp err3 := db.pushIndex.Put(item) if err3 != nil { db.logger.Debugf("SubscribePush:chunk %s pushIndex.Put err %v", swarm.NewAddress(item.Address), err3) } else { db.logger.Debugf("SubscribePush:chunk %s pushIndex corrected to StoreTimestamp %d", swarm.NewAddress(item.Address), item.StoreTimestamp) } } return false, nil }

And yes, I've seen the debug log showing that the timestamps are different. It seems to happen mostly when a bee node crashes while something is actively pushing data into it and then re-uploads the same data after the node restarts.

It is very worrisome that such problem is present on crash, as leveldb Batch should provide atomicity on operations that we do in leveldb. We should investigate on data inconsistencies on crashes. I do not recall that something similar is performed earlier, I may be wrong about that. Thanks for a valuable insight and for sharing the solution.

I've thought about writing an index consistency checker that would invoke at startup and at least report any discrepancies that it finds. Such a thing could also be used to "prime" several metrics that are rather misleading as they are now. bee_pusher_total_to_push comes to mind which only includes NEW chunks but does not include any that were already in the pusher queue at startup.

So many ideas, so little time!

can you prove that the timestamps are different? i have never seen such a case

The easiest way to tell this has happened is the following log sequence in setSync that comes after the chunk is successfully pushed. And the same chunk pushes over and over and over again which is how I first discovered it.

i, err = db.pushIndex.Get(item) if err != nil { if errors.Is(err, leveldb.ErrNotFound) { // we handle this error internally, since this is an internal inconsistency of the indices // this error can happen if the chunk is put with ModePutRequest or ModePutSync // but this function is called with ModeSetSync db.logger.Debugf("localstore: chunk with address %s timestamp %d not found in push index", addr, item.StoreTimestamp) } else { return 0, err } }

I added the timestamp to this log in my fork. The comment before the log was already present.

Thanks @ldeffenb for valuable details. It is very strange that the storage time is changed as every Put call explicitly checks if the chunk is already present in the database before it stores it and that is done under the batchMu mutex lock to ensure that there is no data races even on the database level. It is very interesting how it is possible to have different store timestamps in retrievalDataIndex and pushIndex, as that would mean that the chunk is overwritten, or that there is a more subtle bug related to store timestamp handling in localstore.

I agree that it should not happen, I only know that it DOES happen in some rare condition, but always when there is heavy uploading activity going on. I have not searched through the original store/put logic to see if I could figure that out though.

mrekucci

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @acud and @mrekucci)

mrekucci

Reviewed 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @acud)

janos

Looks good. Thanks.

sonarqubecloud · 2021-09-24T19:13:05Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
50.8% Duplication

bee-runner bot added the pull-request label Sep 22, 2021

acud force-pushed the gc-rm-push branch 3 times, most recently from df9d66f to f79dcd3 Compare September 22, 2021 16:42

acud self-assigned this Sep 22, 2021

acud requested review from janos and mrekucci September 22, 2021 16:44

acud changed the title ~~fix: gc remove from push index too~~ fix(localstore): stale push index items Sep 22, 2021

janos reviewed Sep 22, 2021

View reviewed changes

pkg/localstore/gc.go Show resolved Hide resolved

ldeffenb reviewed Sep 22, 2021

View reviewed changes

acud force-pushed the gc-rm-push branch from f79dcd3 to 3baaf27 Compare September 22, 2021 16:54

mrekucci reviewed Sep 22, 2021

View reviewed changes

mrekucci self-requested a review September 22, 2021 16:55

mrekucci approved these changes Sep 22, 2021

View reviewed changes

acud force-pushed the gc-rm-push branch from 457aea1 to 49e929a Compare September 22, 2021 17:07

janos approved these changes Sep 22, 2021

View reviewed changes

aloknerurkar approved these changes Sep 23, 2021

View reviewed changes

fix: push index dangling item

7c5907d

acud force-pushed the gc-rm-push branch from 49e929a to 7c5907d Compare September 24, 2021 19:12

acud merged commit df75bef into master Sep 24, 2021

acud deleted the gc-rm-push branch September 24, 2021 19:38

acud added this to the v1.2.0 milestone Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(localstore): stale push index items #2534

fix(localstore): stale push index items #2534

acud commented Sep 22, 2021 •

edited

Loading

ldeffenb Sep 22, 2021

ldeffenb Sep 22, 2021

acud Sep 22, 2021

janos Sep 22, 2021

ldeffenb Sep 22, 2021

janos Sep 22, 2021

ldeffenb Sep 22, 2021

ldeffenb Sep 22, 2021

janos Sep 22, 2021

ldeffenb Sep 22, 2021

mrekucci left a comment

mrekucci left a comment

janos left a comment

sonarqubecloud bot commented Sep 24, 2021

fix(localstore): stale push index items #2534

fix(localstore): stale push index items #2534

Conversation

acud commented Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrekucci left a comment

Choose a reason for hiding this comment

mrekucci left a comment

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Sep 24, 2021

acud commented Sep 22, 2021 •

edited

Loading