Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resharding) - Resharding mapping state update #12232

Merged
merged 8 commits into from
Oct 18, 2024

Conversation

staffik
Copy link
Contributor

@staffik staffik commented Oct 16, 2024

The previous PR introduced mapping for read operations.

This PR extends that functionality to write operations and adds some testing for State mapping.

Following the Zulip discussion, we decided to implement a panic inside the TrieStoreUpdateAdapter methods. Other strategies considered were:

  1. Propagating the error instead of panicking: This was rejected because the error would need to be propagated through multiple layers that currently don't expect errors. Additionally, an error here would indicate a misconfiguration in the database, justifying the use of panic.
  2. Performing the mapping later in TrieStoreUpdateAdapter::commit(): This would require iterating through all DBOps, parsing each operation, extracting the shard_uid from the database key, mapping it, and re-encoding. This approach would make TrieStoreUpdateAdapter dependent on the internal workings of DBTransaction. Also, StoreUpdate::merge() makes me feel uneasy.

@staffik staffik requested a review from a team as a code owner October 16, 2024 08:39
@staffik staffik force-pushed the resharding-mapping-state-update branch from d404027 to 198e73e Compare October 16, 2024 08:42
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 78.04878% with 27 lines in your changes missing coverage. Please review.

Project coverage is 71.64%. Comparing base (ba6c707) to head (8c86921).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
core/store/src/adapter/trie_store.rs 86.66% 14 Missing ⚠️
nearcore/src/entity_debug.rs 0.00% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12232      +/-   ##
==========================================
+ Coverage   71.62%   71.64%   +0.01%     
==========================================
  Files         837      837              
  Lines      167105   167189      +84     
  Branches   167105   167189      +84     
==========================================
+ Hits       119696   119783      +87     
- Misses      42180    42184       +4     
+ Partials     5229     5222       -7     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.25% <0.00%> (-0.01%) ⬇️
integration-tests 38.85% <18.69%> (-0.03%) ⬇️
linux 71.26% <78.04%> (+0.01%) ⬆️
linux-nightly 71.21% <78.04%> (+0.01%) ⬆️
macos 54.33% <78.04%> (+0.02%) ⬆️
pytests 1.57% <0.00%> (-0.01%) ⬇️
sanity-checks 1.37% <0.00%> (-0.01%) ⬇️
unittests 65.45% <78.04%> (+0.02%) ⬆️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but I'm concerned about the storage clone - see comments

&self,
shard_uid: ShardUId,
) -> Result<ShardUId, StorageError> {
let store = TrieStoreAdapter::new(Store::new(self.store_update.storage.clone()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bad idea to clone the storage everytime. Is there any way to do it by reference? What's the typical way to access data from store in here? Can we do it the same way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just create a helper function here for accessing stuff from raw store and just use that instead?

Copy link
Contributor Author

@staffik staffik Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the typical way to access data from store in here

From what I see elsewhere in the code we do not access data from within StoreUpdate.

accessing stuff from raw store

Do you mean using storage: Arc<dyn Database> member of StoreUpdate directly?
StoreUpdate does not contain Store.

I would like to add Store to TrieStoreUpdateAdapter.
But it is problematic because of this:

impl<'a> TrieStoreUpdateAdapter<'a> {
    pub fn new(store_update: &'a mut StoreUpdate) -> Self {
        Self { store_update: StoreUpdateHolder::Reference(store_update) }
    }

The easiest way (changing 4 lines of code) would be to replace:

pub struct StoreUpdate {
    transaction: DBTransaction,
    storage: Arc<dyn Database>,
}

with

pub struct StoreUpdate {
    transaction: DBTransaction,
    store: Store,
}

where Store is just a wrapper for storage:

pub struct Store {
    storage: Arc<dyn Database>,
}

@wacban @shreyan-gupta Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied the above change in 09a5ca7 so you can see it.

core/store/src/adapter/trie_store.rs Outdated Show resolved Hide resolved
core/store/src/adapter/trie_store.rs Show resolved Hide resolved
core/store/src/adapter/trie_store.rs Outdated Show resolved Hide resolved
core/store/src/adapter/trie_store.rs Show resolved Hide resolved
core/store/src/adapter/trie_store.rs Show resolved Hide resolved
// The data is now visible at both `parent_shard` and `child_shard`.
assert_eq!(*store.get(child_shard, &dummy_hash).unwrap(), [0]);
assert_eq!(*store.get(parent_shard, &dummy_hash).unwrap(), [0]);
// Remove the data using `parent_shard` UId.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a case where you remove it by using child_shard UId?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should allow decreasing refcounts using parent shard at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once resharded we should not use parent shard.
This test shows what the current behavior is.
Do you mean we should detect such situation and panic, and reflect the panic in this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The garbage collection will keep using the parent shard id for a few epochs after resharding. Then it will start using the child shard id. Both should work.

&self,
shard_uid: ShardUId,
) -> Result<ShardUId, StorageError> {
let store = TrieStoreAdapter::new(Store::new(self.store_update.storage.clone()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just create a helper function here for accessing stuff from raw store and just use that instead?

core/store/src/adapter/trie_store.rs Outdated Show resolved Hide resolved
core/store/src/adapter/trie_store.rs Show resolved Hide resolved
// The data is now visible at both `parent_shard` and `child_shard`.
assert_eq!(*store.get(child_shard, &dummy_hash).unwrap(), [0]);
assert_eq!(*store.get(parent_shard, &dummy_hash).unwrap(), [0]);
// Remove the data using `parent_shard` UId.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should allow decreasing refcounts using parent shard at all?

@staffik staffik requested a review from wacban October 17, 2024 10:28
pub fn decrement_refcount_by(
&mut self,
shard_uid: ShardUId,
hash: &CryptoHash,
decrement: NonZero<u32>,
) {
let mapped_shard_uid =
self.read_shard_uid_mapping_from_db(shard_uid).expect("decrement_refcount_by failed");
read_shard_uid_mapping_from_db(&self.store_update().store, shard_uid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep it as a method on self?

Copy link
Contributor Author

@staffik staffik Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wacban Does ac1751f do what you mean?

If we go this way, then I think there is one more thing to do:

impl<'a> TrieStoreUpdateAdapter<'a> {
    fn get_key_from_shard_uid_and_hash(&self, shard_uid: ShardUId, hash: &CryptoHash) -> [u8; 40] {
        self.store_update.store.trie_store().get_key_from_shard_uid_and_hash(shard_uid, hash)
    }

Here, calling trie_store() indirectly clones store. I would like to create StoreHolder that would allow to pass store by reference, analogously how StoreUpdateHolder works. @shreyan-gupta wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's getting more complicated instead of less. Feel free to keep as is to keep it simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoided cloning in 432a96d

core/store/src/adapter/trie_store.rs Outdated Show resolved Hide resolved
@@ -459,7 +459,7 @@ impl Store {
/// Keeps track of current changes to the database and can commit all of them to the database.
pub struct StoreUpdate {
transaction: DBTransaction,
storage: Arc<dyn Database>,
store: Store,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clue what sort of abstraction layers are we breaking here. I guess if it compiles and all tests pass it should be fine. 🙏 In compiler we trust. 🙏

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good stuff!

core/store/src/adapter/trie_store.rs Outdated Show resolved Hide resolved
/// Replaces shard_uid prefix with a mapped value according to mapping strategy in Resharding V3.
/// For this, it does extra read from `DBCol::StateShardUIdMapping`.
pub fn get(&self, shard_uid: ShardUId, hash: &CryptoHash) -> Result<Arc<[u8]>, StorageError> {
let mapped_shard_uid = self.read_shard_uid_mapping_from_db(shard_uid)?;
let mapped_shard_uid = get_mapped_shard_uid(&self.store, shard_uid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it makes me wonder if we can just move this to inside of get_key_from_shard_uid_and_hash :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went step further in ac1751f.
Now get_key_from_shard_uid_and_hash does not need to be exposed at all.
Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

core/store/src/adapter/trie_store.rs Show resolved Hide resolved
@staffik staffik added this pull request to the merge queue Oct 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 18, 2024
@staffik
Copy link
Contributor Author

staffik commented Oct 18, 2024

sync_empty_state failed on nayduck. Weird 🤔 looking into it.

@staffik staffik force-pushed the resharding-mapping-state-update branch from 432a96d to 8c86921 Compare October 18, 2024 09:40
@staffik
Copy link
Contributor Author

staffik commented Oct 18, 2024

sync_empty_state failed on nayduck. Weird 🤔 looking into it.

Weird, it passes locally (run 10 times).
I triggered nayduck manually again and this time it passed.
Looking at sync_empty_state nayduck history it does not seem flaky either.
Looking at the error message it looks like some random network issue happened.
Will ignore it for now and do second attempt to merge, especially that this test seems unrelated to this PR.

@staffik staffik added this pull request to the merge queue Oct 18, 2024
Merged via the queue into master with commit 31ad13e Oct 18, 2024
28 of 30 checks passed
@staffik staffik deleted the resharding-mapping-state-update branch October 18, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants