Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(localstore): stale push index items #2534

Merged
merged 1 commit into from
Sep 24, 2021
Merged

fix(localstore): stale push index items #2534

merged 1 commit into from
Sep 24, 2021

Conversation

acud
Copy link
Member

@acud acud commented Sep 22, 2021

gc not removing the pushindex items, resulting in a corrupt index when subscribe push cannot find the chunks that pop out of the index. a migration is added to remove the stale entries.


This change is Reviewable

@acud acud force-pushed the gc-rm-push branch 3 times, most recently from df9d66f to f79dcd3 Compare September 22, 2021 16:42
@acud acud self-assigned this Sep 22, 2021
@acud acud requested review from janos and mrekucci September 22, 2021 16:44
@acud acud changed the title fix: gc remove from push index too fix(localstore): stale push index items Sep 22, 2021
if err != nil {
return true, err
}
if !has {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the biggest reason I've seen for this to happen is that the timestamp on the retrievalDataIndex item is different than the stored timestamp from the pushIndex. When the pusher gets it from the index it doesn't get the index's timestamp, so when it tries to remove it from the index, it cannot find the entry because it uses the stored timestamp from the retrievalDataIndex's item.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, I've also seen it happen that the item is in the push index, but no longer in the retrieval index. I fixed both of these issues in the push index iterator by checking the existence and timestamps and fixing it appropriately; removing or updating the push index entry based on the double-checks from the retrievalDataIndex during the push index iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you prove that the timestamps are different? i have never seen such a case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find @ldeffenb, you are probably referring to localstore's db.setSync mode_set.go:138 where the deletion will fail as the item is not in the retrievalDataIndex and StoreTimestamp is 0 which would lead to invalid db.pushIndex.DeleteInBatch on that line. That is correct and this is the best effort in that situation to try to delete, as no timestamp can be found. We could iterate over the pushIndex to find the address in the key and delete the key. This is potentially expensive and did not want to do that, but it is one solution. I think that this PR will remove the missing retrievalDataIndex item, at least at the current state of the code. Adding the check in SubscribePush, as you did is another nice safety measure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my code from SubscribePush just inside the if err != nil from the db.retrievalDataIndex.Get(item)

if errors.Is(err, leveldb.ErrNotFound) {
	db.logger.Debugf("SubscribePush:chunk %s err %v", swarm.NewAddress(item.Address), err)
	err2 := db.pushIndex.Delete(item)
	if err2 != nil {
		db.logger.Debugf("SubscribePush:chunk %s delete err %v", swarm.NewAddress(item.Address), err2)
	}
}
return false, nil
						//return true, err
					}
if item.StoreTimestamp != dataItem.StoreTimestamp {
	db.logger.Debugf("SubscribePush:chunk %s itemStore %d != %d dataItem.Store now %d", swarm.NewAddress(item.Address), item.StoreTimestamp, dataItem.StoreTimestamp, now())
	err2 := db.pushIndex.Delete(item)
	if err2 != nil {
		db.logger.Debugf("SubscribePush:chunk %s pushIndex.Delete err %v", swarm.NewAddress(item.Address), err2)
	} else {
		item.StoreTimestamp = dataItem.StoreTimestamp
		err3 := db.pushIndex.Put(item)
		if err3 != nil {
			db.logger.Debugf("SubscribePush:chunk %s pushIndex.Put err %v", swarm.NewAddress(item.Address), err3)
		} else {
			db.logger.Debugf("SubscribePush:chunk %s pushIndex corrected to StoreTimestamp %d", swarm.NewAddress(item.Address), item.StoreTimestamp)
		}
	}
	return false, nil
}

And yes, I've seen the debug log showing that the timestamps are different. It seems to happen mostly when a bee node crashes while something is actively pushing data into it and then re-uploads the same data after the node restarts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very worrisome that such problem is present on crash, as leveldb Batch should provide atomicity on operations that we do in leveldb. We should investigate on data inconsistencies on crashes. I do not recall that something similar is performed earlier, I may be wrong about that. Thanks for a valuable insight and for sharing the solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about writing an index consistency checker that would invoke at startup and at least report any discrepancies that it finds. Such a thing could also be used to "prime" several metrics that are rather misleading as they are now. bee_pusher_total_to_push comes to mind which only includes NEW chunks but does not include any that were already in the pusher queue at startup.

So many ideas, so little time!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you prove that the timestamps are different? i have never seen such a case

The easiest way to tell this has happened is the following log sequence in setSync that comes after the chunk is successfully pushed. And the same chunk pushes over and over and over again which is how I first discovered it.

	i, err = db.pushIndex.Get(item)
	if err != nil {
		if errors.Is(err, leveldb.ErrNotFound) {
			// we handle this error internally, since this is an internal inconsistency of the indices
			// this error can happen if the chunk is put with ModePutRequest or ModePutSync
			// but this function is called with ModeSetSync
			db.logger.Debugf("localstore: chunk with address %s timestamp %d not found in push index", addr, item.StoreTimestamp)
		} else {
			return 0, err
		}
	}

I added the timestamp to this log in my fork. The comment before the log was already present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ldeffenb for valuable details. It is very strange that the storage time is changed as every Put call explicitly checks if the chunk is already present in the database before it stores it and that is done under the batchMu mutex lock to ensure that there is no data races even on the database level. It is very interesting how it is possible to have different store timestamps in retrievalDataIndex and pushIndex, as that would mean that the chunk is overwritten, or that there is a more subtle bug related to store timestamp handling in localstore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should not happen, I only know that it DOES happen in some rare condition, but always when there is heavy uploading activity going on. I have not searched through the original store/put logic to see if I could figure that out though.

Copy link
Contributor

@mrekucci mrekucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @acud and @mrekucci)

@mrekucci mrekucci self-requested a review September 22, 2021 16:55
Copy link
Contributor

@mrekucci mrekucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @acud)

Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
50.8% 50.8% Duplication

@acud acud merged commit df75bef into master Sep 24, 2021
@acud acud deleted the gc-rm-push branch September 24, 2021 19:38
@acud acud added this to the v1.2.0 milestone Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants