Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trie refcount bug #2568

Closed
mikhailOK opened this issue Apr 29, 2020 · 7 comments · Fixed by #2638
Closed

Trie refcount bug #2568

mikhailOK opened this issue Apr 29, 2020 · 7 comments · Fixed by #2638
Assignees
Labels
A-storage Area: storage and databases P-critical Priority: critical

Comments

@mikhailOK
Copy link
Contributor

  • Trie nodes are stored in storage with refcounts
  • TrieChanges stores a delta of refcount
  • StoreUpdate stores a database transaction with exact writes of refcount
  • Applying TrieChanges to storage reads refcount from storage, increments it by delta and creates a write for StoreUpdate
  • Writing multiple TrieChanges into StoreUpdate in a single transaction will always read refcount before the transaction, so instead of the sum of increments only one is applied.

This is most likely what causes #2500 #2544 #2556

@mikhailOK mikhailOK self-assigned this Apr 29, 2020
@mikhailOK mikhailOK added P-critical Priority: critical A-storage Area: storage and databases labels Apr 29, 2020
@mikhailOK
Copy link
Contributor Author

The plan is to fix this by separating trie nodes from different shards in storage - this will also
allow us to solve #2569.

@mikhailOK
Copy link
Contributor Author

Turns out separating shards is not quite enough, because gc code wants to delete multiple blocks in one StoreUpdate.
The new plan is to either change gc code, or make the alternative fix (refcount-aware StoreUpdate). Discussing with @Kouprin

@bowenwang1996
Copy link
Collaborator

I don't understand why this cannot happen on a single shard. It seems that

Writing multiple TrieChanges into StoreUpdate in a single transaction will always read refcount before the transaction, so instead of the sum of increments only one is applied.

is not related to the number of shards.

@bowenwang1996
Copy link
Collaborator

bowenwang1996 commented May 8, 2020

On betanet I have observed a couple times that rpc nodes crashed on this line https://github.com/nearprotocol/nearcore/blob/b2d469e3da338bf933219ef351cc050075febcd0/neard/src/runtime.rs#L883 However, I have not observed crashing of validator nodes myself, even though one such incident was reported to me by one of the validators.

@mikhailOK
Copy link
Contributor Author

@bowenwang1996 TrieUpdate is created for a chunk. Multiple chunks means either multiple shards (fixed by separating tries) or a multi-block StoreUpdate (#2613)

@bowenwang1996
Copy link
Collaborator

@mikhailOK is it true that #2613 is possible in a single shard?

@mikhailOK
Copy link
Contributor Author

#2613 only leads to gc not deleting data. It's blocking because the fix to this task adds an assert that fails on #2613.

mikhailOK added a commit that referenced this issue May 13, 2020
Separate tries from different shards.
Each shard has a Trie with its own prefix in storage and its own
cache.

Fixes #2568

Test plan
---------
debug_assert in StoreUpdate::commit
mikhailOK added a commit that referenced this issue May 17, 2020
* fix(storage): fix refcount bug

Separate tries from different shards.
Each shard has a Trie with its own prefix in storage and its own
cache.

Fixes #2568

Test plan
---------
debug_assert in StoreUpdate::commit

* rename trie -> tries

* gc tests with 1 or 2 shards

* get_shard_id_and_hash

* refactoring

* some renames
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases P-critical Priority: critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants