Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime hits unknown perfomance cliff in the estimator #4771

Closed
matklad opened this issue Sep 1, 2021 · 43 comments
Closed

Runtime hits unknown perfomance cliff in the estimator #4771

matklad opened this issue Sep 1, 2021 · 43 comments
Assignees
Labels
A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc)

Comments

@matklad
Copy link
Contributor

matklad commented Sep 1, 2021

See

matklad@80b44a8

In that commit, I add a benchmark which sends 150 blocks, where each
block contains 100 deploy code transaction.

EDIT: #4771 (comment) is modernized version.

The benchmark is implemented in two ways:

  • on top of parameter estimator testbed
  • on top of runtime tester

In the first case, I observe a very stage behavior, where at some point
block processsing times severely degrades.

To reproduce, checkout this commit and run

 $ cargo test --release --package runtime-params-estimator --lib --features required – v2::fast --exact --nocapture
 $ cargo test --release --package runtime-params-estimator --lib --features required – v2::slow --exact --nocapture

Look for Runtime::apply lines in the output, they they tell you how
long it takes to process one apply.

In the fast case, every apply takes about 16ms for me. In the slow
case, apply start at about 30ms, but then abruptly go to 600ms.

Would be great to understand why the time suddenly balloons!

@matklad matklad added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-core Team: issues relevant to the core team and removed T-core Team: issues relevant to the core team labels Sep 1, 2021
@bowenwang1996 bowenwang1996 added the T-node Team: issues relevant to the node experience team label Sep 1, 2021
matklad added a commit to matklad/nearcore that referenced this issue Sep 2, 2021
When debugging near#4771, I had to copy-paste tracing_timings module to get
a useful profiling output, which isn't ideal. So I improved and
published that code to crates.io as `tracing-span-tree`.

In particular, this new implementation prints the trace in the correct
order, and it also has an `.aggregate(true)` mode for cases where you
run the profiled code in a loop and want a summary.
@mina86
Copy link
Contributor

mina86 commented Sep 6, 2021

Quick look indicates that the slowdown comes from the storage. I’ve
added some more tracing and here’s output with a fast
process_transaction invocation followed by a slow one (they are
within the same apply call):

147.38µs         Runtime::process_transaction
  145.70µs         verify_and_charge_transaction
    48.12µs          validate_transaction
    46.70µs          get_account
      46.15µs          TrieUpdate::get
    49.21µs          get_access_key
      48.65µs          TrieUpdate::get
…
2.64ms           Runtime::process_transaction
  2.64ms           verify_and_charge_transaction
    49.03µs          validate_transaction
    1.30ms           get_account
      1.30ms           TrieUpdate::get
    1.29ms           get_access_key
      1.29ms           TrieUpdate::get

TrieUpdate::get is suddenly 25 times slower and it’s called quite
a bit so it adds up.

@bowenwang1996
Copy link
Collaborator

@mina86 is it because the value is not cached and you have to do a trie traversal?

@mina86
Copy link
Contributor

mina86 commented Sep 6, 2021

It doesn’t look like cache issues. I haven’t measured cache
hit ratio or counted how often we hit the storage, but just by looking
at logs the numbers seem similar for fast and slow cases. On the
other hand, Store::get becomes ten times slower all of a sudden:

      74.73µs          Trie::Update::get
        68.40µs          Trie::lookup
          7.45µs           Store::get
          3.50µs           Store::get
          3.16µs           Store::get
          2.10µs           Store::get
          2.82µs           Store::get
          2.79µs           Store::get
          3.38µs           Store::get
          4.09µs           Store::get
          3.77µs           Store::get
          3.07µs           Store::get
          2.45µs           Store::get
          3.23µs           Store::get
          2.91µs           Store::get
          2.31µs           Store::get
          2.42µs           Store::get
          2.63µs           Store::get
          2.59µs           Store::get
        4.83µs           Store::get

vs.

      1.31ms           Trie::Update::get
        1.24ms           Trie::lookup
          86.48µs          Store::get
          72.27µs          Store::get
          70.31µs          Store::get
          69.17µs          Store::get
          68.85µs          Store::get
          68.40µs          Store::get
          68.75µs          Store::get
          69.13µs          Store::get
          69.77µs          Store::get
          68.91µs          Store::get
          68.82µs          Store::get
          69.21µs          Store::get
          68.84µs          Store::get
          78.35µs          Store::get
          75.67µs          Store::get
          76.41µs          Store::get
          75.47µs          Store::get
        69.45µs          Store::get

So I’m now looking at that.

@mina86
Copy link
Contributor

mina86 commented Sep 6, 2021

The difference is caused by how signer of the transaction is chosen.
In the fast test signers are chosen in sequence from all the available
accounts. In the slow tests, signers are chosen at random from unused
accounts. Making the latter use sequential accounts fixes the issue:

--- a/runtime/runtime-params-estimator/src/v2/support.rs
+++ b/runtime/runtime-params-estimator/src/v2/support.rs
@@ -202,12 +202,12 @@ impl<'a, 'c> TransactionBuilder<'a, 'c> {
         self.testbed.accounts[account_index].clone()
     }
     pub(crate) fn random_unused_account(&mut self) -> AccountId {
-        loop {
-            let account = self.random_account();
+        for account in &self.testbed.accounts {
             if self.testbed.used_accounts.insert(account.clone()) {
-                return account;
+                return account.clone();
             }
         }
+        panic!("Not enough accounts");
     }
     pub(crate) fn random_account_pair(&mut self) -> (AccountId, AccountId) {
         let first = self.random_account();

I didn’t look into how exactly this affects the database to cause the slowdown.

@mina86
Copy link
Contributor

mina86 commented Sep 6, 2021

Perhaps I’ve spoken too soon. The above is correct and changing
random_unused_account implementation does speed up the slow test but
the opposite does not slow down the fast test:

--- a/runtime/runtime-params-estimator/src/v2.rs
+++ b/runtime/runtime-params-estimator/src/v2.rs
@@ -449,10 +449,20 @@ fn fast() {
         use_in_memory_store: false,
     };

-    let mut i = 0;
+    let mut used = std::collections::HashSet::new();
+    let mut rng = rand::thread_rng();
+
     for h in 0..n_blocks {
         let mut block = BlockConfig::at_height((h + 1) as u64);
         for _ in 0..block_size {
+
+            let i = loop {
+                let i = rng.gen_range(0, num_accounts as u32);
+                if used.insert(i) {
+                    break i as usize;
+                }
+            };
+
             let signer_id = accounts[i].clone();
             let receiver_id = signer_id.clone();
             let signer =
@@ -464,7 +474,6 @@ fn fast() {
                 signer,
                 actions: vec![Action::DeployContract(DeployContractAction { code: code.clone() })],
             });
-            i += 1;
         }
         s.blocks.push(block)
     }

It looks like it’s something else calling random_unused_account that
is the culprit.

@matklad
Copy link
Contributor Author

matklad commented Sep 7, 2021

Confirmed that the order of accounts matters here. Iterating accounts by one up or down finished fast. Iterting accounts by eleven shows the same slowdown

diff
diff --git a/runtime/runtime-params-estimator/src/v2.rs b/runtime/runtime-params-estimator/src/v2.rs
index 6a844fcd9..597da1c01 100644
--- a/runtime/runtime-params-estimator/src/v2.rs
+++ b/runtime/runtime-params-estimator/src/v2.rs
@@ -27,7 +27,7 @@ static ALL_COSTS: &[(Cost, fn(&mut Ctx) -> GasCost)] = &[
     (Cost::ActionReceiptCreation, action_receipt_creation),
     (Cost::ActionSirReceiptCreation, action_sir_receipt_creation),
     (Cost::ActionTransfer, action_transfer),
-    (Cost::ActionCreateAccount, action_create_account),
+    // (Cost::ActionCreateAccount, action_create_account),
     (Cost::ActionDeleteAccount, action_delete_account),
     (Cost::ActionAddFullAccessKey, action_add_full_access_key),
     (Cost::ActionAddFunctionAccessKeyBase, action_add_function_access_key_base),
@@ -118,28 +118,28 @@ fn action_transfer(ctx: &mut Ctx) -> GasCost {
     total_cost - base_cost
 }
 
-fn action_create_account(ctx: &mut Ctx) -> GasCost {
-    let total_cost = {
-        let mut testbed = ctx.test_bed();
+// fn action_create_account(ctx: &mut Ctx) -> GasCost {
+//     let total_cost = {
+//         let mut testbed = ctx.test_bed();
 
-        let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
-            let sender = tb.random_account();
-            let new_account =
-                AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
-
-            let actions = vec![
-                Action::CreateAccount(CreateAccountAction {}),
-                Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
-            ];
-            tb.transaction_from_actions(sender, new_account, actions)
-        };
-        testbed.average_transaction_cost(&mut make_transaction)
-    };
+//         let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
+//             let sender = tb.random_account();
+//             let new_account =
+//                 AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
 
-    let base_cost = action_receipt_creation(ctx);
+//             let actions = vec![
+//                 Action::CreateAccount(CreateAccountAction {}),
+//                 Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
+//             ];
+//             tb.transaction_from_actions(sender, new_account, actions)
+//         };
+//         testbed.average_transaction_cost(&mut make_transaction)
+//     };
 
-    total_cost - base_cost
-}
+//     let base_cost = action_receipt_creation(ctx);
+
+//     total_cost - base_cost
+// }
 
 fn action_delete_account(ctx: &mut Ctx) -> GasCost {
     let total_cost = {
diff --git a/runtime/runtime-params-estimator/src/v2/support.rs b/runtime/runtime-params-estimator/src/v2/support.rs
index 6a635eb34..9b62e6fb2 100644
--- a/runtime/runtime-params-estimator/src/v2/support.rs
+++ b/runtime/runtime-params-estimator/src/v2/support.rs
@@ -7,7 +7,7 @@ use near_crypto::{InMemorySigner, KeyType};
 use near_primitives::hash::CryptoHash;
 use near_primitives::transaction::{Action, DeployContractAction, SignedTransaction};
 use near_primitives::types::{AccountId, Gas};
-use rand::prelude::ThreadRng;
+use rand::prelude::{SliceRandom, ThreadRng};
 use rand::Rng;
 
 use crate::cases::ratio_to_gas;
@@ -50,6 +50,7 @@ impl<'c> Ctx<'c> {
             accounts: (0..self.config.active_accounts).map(get_account_id).collect(),
             nonces: HashMap::new(),
             used_accounts: HashSet::new(),
+            idx: 0,
         }
     }
 
@@ -80,6 +81,7 @@ impl<'c> Ctx<'c> {
             accounts: proto.accounts.clone(),
             nonces: proto.nonces.clone(),
             used_accounts: HashSet::new(),
+            idx: 0,
         }
     }
 
@@ -117,6 +119,7 @@ pub(crate) struct TestBed<'c> {
     accounts: Vec<AccountId>,
     nonces: HashMap<AccountId, u64>,
     used_accounts: HashSet<AccountId>,
+    idx: usize,
 }
 
 impl<'c> TestBed<'c> {
@@ -190,24 +193,26 @@ impl<'a, 'c> TransactionBuilder<'a, 'c> {
         )
     }
 
-    pub(crate) fn rng(&mut self) -> ThreadRng {
-        rand::thread_rng()
-    }
+    // pub(crate) fn rng(&mut self) -> ThreadRng {
+    //     rand::thread_rng()
+    // }
 
     pub(crate) fn account(&mut self, account_index: usize) -> AccountId {
         get_account_id(account_index)
     }
     pub(crate) fn random_account(&mut self) -> AccountId {
-        let account_index = self.rng().gen_range(0, self.testbed.accounts.len());
+        self.testbed.idx = (self.testbed.idx + 11) % self.testbed.accounts.len();
+        let account_index = self.testbed.idx;
         self.testbed.accounts[account_index].clone()
     }
     pub(crate) fn random_unused_account(&mut self) -> AccountId {
-        loop {
+        for i in 0..1000 {
             let account = self.random_account();
             if self.testbed.used_accounts.insert(account.clone()) {
                 return account;
             }
         }
+        panic!("no unused accounts")
     }
     pub(crate) fn random_account_pair(&mut self) -> (AccountId, AccountId) {
         let first = self.random_account();

@matklad
Copy link
Contributor Author

matklad commented Sep 7, 2021

Ok, it took me just about a week to realize I can throw flamegrapsh at it 😅

stepping by 11 acconts:

image

stepping by 1:

image

profilnes recoreded with perf record -F 9999 --call-graph dwarf target/release/deps/runtime_params_estimator-0328c9c022713977 v2::slow --exact --nocapture and rendered with https://github.com/KDAB/hotspot

The rocksdb::VerifyBlockChecksum looks very suspicious!

I'll stop at this for the time being.

@matklad
Copy link
Contributor Author

matklad commented Sep 7, 2021

(meta -- does anyone know of a nice way to share per profileing results? The above is me scrennshotting hotspot ui, which doesn's seem ideal)

@mina86
Copy link
Contributor

mina86 commented Sep 7, 2021

It’s still not clear to me why changing the fast code to use random accounts doesn’t slow it down though. There are some differences how the accounts are set up in both tests so that could be the explanation. In fast test test### accounts are added to genesis and transactions are executed using those. In slow test near_###_### accounts are added by the GenesisBuilder. Though it’s not clear to me why it would make a difference.

@bowenwang1996
Copy link
Collaborator

In the fast test signers are chosen in sequence from all the available
accounts. In the slow tests, signers are chosen at random from unused
accounts

@mina86 what are the available accounts? I wonder whether this affects the trie traversal in the sense that if we go over the accounts sequentially, some trie node will be cached between the reads (if not by nearcore then on the OS level the page may be cached) while accessing them randomly makes it harder to cache anything.

@mina86
Copy link
Contributor

mina86 commented Sep 7, 2021

The tests create 200k accounts and all transactions are signed by one of them. The test uses each account only once. Sequential reads indeed seem like a plausible explanation though at this moment I think this is affecting caching in RocksDB or on OS level rather than in our code. (I still need to just add counters or profile each of the Trie::* and Store::* methods to be sure).

near-bulldozer bot pushed a commit that referenced this issue Sep 8, 2021
When debugging #4771, I had to copy-paste tracing_timings module to get
a useful profiling output, which isn't ideal. So I improved and
published that code to crates.io as `tracing-span-tree`.

In particular, this new implementation prints the trace in the correct
order, and it also has an `.aggregate(true)` mode for cases where you
run the profiled code in a loop and want a summary.

Example without aggregation:

```
  328.25ms         vm_script_smoke_test
    306.34ms         run_vm
      306.32ms         run_wasmer
        37.16µs          get_key
        301.50ms         compile_and_serialize_wasmer
        4.44ms           run_method
          2.53ms           run_method/instantiate
          128.17µs         run_method/call
          1.74ms           run_method/drop_instance
    4.50ms           run_vm
      4.49ms           run_wasmer
        31.81µs          get_key
        4.25ms           run_method
          2.45ms           run_method/instantiate
          45.46µs          run_method/call
          1.72ms           run_method/drop_instance
    4.43ms           run_vm
      4.42ms           run_wasmer
        22.43µs          get_key
        4.21ms           run_method
          2.41ms           run_method/instantiate
          42.84µs          run_method/call
          1.72ms           run_method/drop_instance
    4.81ms           run_vm
      4.80ms           run_wasmer
        22.09µs          get_key
        4.58ms           run_method
          2.40ms           run_method/instantiate
          419.57µs         run_method/call
          1.73ms           run_method/drop_instance
```

Example with aggregation (unit-less number is the number of aggregated calls):

```
  333.98ms         vm_script_smoke_test
    325.54ms  4      run_vm
      325.48ms  4      run_wasmer
        307.07ms         compile_and_serialize_wasmer
        129.19µs  4      get_key
        17.32ms   4      run_method
          612.46µs  4      run_method/call
          6.63ms    4      run_method/drop_instance
          9.94ms    4      run_method/instantiate
```

Test Plan
---------

This is a debugging utility which is not enabled in production, so I just manually verified that the output makes sense.
mina86 pushed a commit that referenced this issue Sep 9, 2021
When debugging #4771, I had to copy-paste tracing_timings module to get
a useful profiling output, which isn't ideal. So I improved and
published that code to crates.io as `tracing-span-tree`.

In particular, this new implementation prints the trace in the correct
order, and it also has an `.aggregate(true)` mode for cases where you
run the profiled code in a loop and want a summary.
@mina86
Copy link
Contributor

mina86 commented Sep 17, 2021

Turns out

diff --git a/genesis-tools/genesis-populate/src/lib.rs b/genesis-tools/genesis-populate/src/lib.rs
index d3a9793d..3eceddfa 100644
--- a/genesis-tools/genesis-populate/src/lib.rs
+++ b/genesis-tools/genesis-populate/src/lib.rs
@@ -29,7 +29,7 @@ use near_store::{
 use nearcore::{get_store_path, NightshadeRuntime};
 
 fn get_account_id(account_index: u64) -> AccountId {
-    AccountId::try_from(format!("near_{}_{}", account_index, account_index)).unwrap()
+    AccountId::try_from(format!("test{}", account_index)).unwrap()
 }
 
 pub type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
diff --git a/runtime/runtime-params-estimator/src/testbed_runners.rs b/runtime/runtime-params-estimator/src/testbed_runners.rs
index 2ede73d9..ab316ae6 100644
--- a/runtime/runtime-params-estimator/src/testbed_runners.rs
+++ b/runtime/runtime-params-estimator/src/testbed_runners.rs
@@ -18,7 +18,7 @@ use std::time::Instant;
 
 /// Get account id from its index.
 pub fn get_account_id(account_index: usize) -> AccountId {
-    AccountId::try_from(format!("near_{}_{}", account_index, account_index)).unwrap()
+    AccountId::try_from(format!("test{}", account_index)).unwrap()
 }
 
 /// Total number of transactions that we need to prepare.

makes the test go fast. The inverse (changing all instances of test{} to near_{}_{} in the code base) plus adding random order of accounts makes the fast test slow.

@matklad
Copy link
Contributor Author

matklad commented Sep 17, 2021

Excellent find! I confirm that the fast test starts going slowly with near_{}_{} accounts, so this removes any param-estimator related weirdness from the equation. I also had to replace test{} with near_{}_{} across the codebase, and not just in a test, which felt odd. Filed #4835

@bowenwang1996
Copy link
Collaborator

It feels that this is indeed related to the shape of the trie. @mina86 is it true that changing the names from test{} to near_{}_{} actually changes the number of database reads we do?

@mina86
Copy link
Contributor

mina86 commented Sep 20, 2021

There’s only a small difference:

                             test{}      near_{}_{}
Trie::retrieve_raw_bytes   : 1832596  →  1944114
TrieCachingStorage         : 2118923  →  2232747
TrieCachingStorage (cached):       2  →        0
TrieCachingStorage (missed): 2118921  →  2232747
RocksDB::get               : 2322764  →  2436590

@bowenwang1996
Copy link
Collaborator

That is weird. Do we see any significant difference in the amount of time spent on RocksDB::get?

mina86 added a commit to mina86/nearcore that referenced this issue Sep 23, 2021
When debugging a test it may be desired to be able to see the state of
the storage after the tests finishes.  Currently, scenarios which use
on-disk storage will use a temporary directory which will be deleted
after the scenario runs.  This of course is correct behaviour but it
makes it impossible to investigate the storage of a failing (or
otherwise misbehaving) test.

Provide a Scenario::home_dir field which specifies where the store
will be located.  Importantly, Secnario::run will not delete that
directory after executing the scenario thus letting the user see
the end result.

Issue: near#4771
mina86 added a commit to mina86/nearcore that referenced this issue Sep 23, 2021
When debugging a test it may be desired to be able to see the state of
the storage after the tests finishes.  Currently, scenarios which use
on-disk storage will use a temporary directory which will be deleted
after the scenario runs.  This of course is correct behaviour but it
makes it impossible to investigate the storage of a failing (or
otherwise misbehaving) test.

Provide a Scenario::home_dir field which specifies where the store
will be located.  Importantly, Secnario::run will not delete that
directory after executing the scenario thus letting the user see
the end result.

Issue: near#4771
@pmnoxx
Copy link
Contributor

pmnoxx commented Nov 17, 2021

It's also likely that something like compaction not being configured right, may be causing huge latency spikes, likes one I saw in benchmark.

@bowenwang1996
Copy link
Collaborator

@Longarithm please take a look at @pmnoxx's benchmark

@bowenwang1996 bowenwang1996 assigned Longarithm and unassigned mina86 Nov 17, 2021
@pmnoxx
Copy link
Contributor

pmnoxx commented Nov 17, 2021

@bowenwang1996 @Longarithm

In the test I write 100k or 1m or 10m keys to the database, but I'm only able to read 48.5% of them back successfully. Something seems broken.

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

     Running unittests (/home/pmnoxx/repos/nearcore.master/target/release/deps/ibf-0ff475f5b700c766)

running 1 test
test benchmark_db ... 

took on avg 2.646µs op per sec 377805 got 4879485/10000000
took on avg 2.739µs op per sec 365073 got 4879481/10000000
took on avg 2.696µs op per sec 370818 got 4876333/10000000

near-bulldozer bot pushed a commit that referenced this issue Nov 24, 2021
Add benchmark for reading / writing from store.

See #4771

```
cd chain/network/benches
rm -rf /tmp/xxxxx ; cargo bench benchmark_db
```

TODO:
- [x] Add tempdir
- [x] Pregenerate keys
- [x] Add stats to see number of entries read successfully
- [x] Why we can only read 48.5% entries back when we write them to DB?
- [x] Move code to a better place
- [x] Document what the benchmark does
pmnoxx added a commit that referenced this issue Nov 25, 2021
Add benchmark for reading / writing from store.

See #4771

```
cd chain/network/benches
rm -rf /tmp/xxxxx ; cargo bench benchmark_db
```

TODO:
- [x] Add tempdir
- [x] Pregenerate keys
- [x] Add stats to see number of entries read successfully
- [x] Why we can only read 48.5% entries back when we write them to DB?
- [x] Move code to a better place
- [x] Document what the benchmark does
@Longarithm
Copy link
Member

I hit this problem in another situation.

I wanted to iterate over Trie contracts, because of the structure the iterator at first went over all accounts, and then over all contracts. Around 15 Dec, it finished in ~1 minute, but now it takes extremely long time. In the example below, it takes ~2 seconds to iterate over 100 accounts, which is not reasonable:

Jan 12 14:26:32.545  INFO neard: 84100
Jan 12 14:26:34.481  INFO neard: 84200
Jan 12 14:26:36.506  INFO neard: 84300
Jan 12 14:26:38.448  INFO neard: 84400
Jan 12 14:26:40.233  INFO neard: 84500
Jan 12 14:26:42.175  INFO neard: 84600
Jan 12 14:26:44.122  INFO neard: 84700
Jan 12 14:26:45.462  INFO neard: 84800
Jan 12 14:26:47.045  INFO neard: 84900
Jan 12 14:26:49.073  INFO neard: 85000

e7e9fbc...c0078ea - this code stops on 84000th account, but it doesn't matter.

Also I believe at some moment the code became fast again, and also if I iterate over N accounts in some previous run, then in the next run these accounts are traversed much faster. This makes the issue much harder to debug, and makes me think that some rocksdb caching steps in here. Profiler shows that rocksdb::BlockBasedTable::Get is the bottleneck: https://share.firefox.dev/3K6VIC8

Though it can be avoided for my particular task, I feel more worried about db performance now.

@stale
Copy link

stale bot commented Apr 12, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@stale stale bot added the S-stale label Apr 12, 2022
@exalate-issue-sync exalate-issue-sync bot removed the T-node Team: issues relevant to the node experience team label May 9, 2022
@akhi3030 akhi3030 removed the S-stale label Jul 8, 2022
@jakmeier
Copy link
Contributor

I am investigating a similar issue to this one. I think I might have found the explanation: In the slow case, RocksDB is reading state from disk and in the fast case it's all in memory.

My test scenario

In my tests, I create 6000 new accounts with names format!("{hash(i)}_near"), giving much more randomly distributed accounts within the trie structure than any of the tests above.

Reading 400 of those accounts right after their creation from the estimator, I can usually get an account from RocksDB within ~100us. This is remarkably fast because the estimator avoids shard and chunk caches and it clears RocksDB write-back buffers. I also reduced RocksDB block cache to1 kiB but the access is still fast.
With all those caches disabled, the expectation is that reading an account will trigger a trie traversal with every node being read from disk. But 100us is at most one SSD access, more likely it is in-memory only.
I checked that requests to the DB are sent as expected and it seems to be the case. But I noticed that State column access times of merely ~10us. Looks like RocksDB can serve these requests from memory.

So then if I drop OS caches between creating the accounts and reading them, and all of a sudden the State column access time goes up to 7-8ms. Retrieving an account takes at least one such access but could be up to ~20 for trie traversal. With shard cache enabled, it's usually just 2 new nodes per account (extension + leaf), which gives ~15ms per account. This is perfectly in line with 1-2 seconds to iterate over 100 accounts as has been mentioned in older comments.

My conclusion

So, in my case I can trigger disk lookups by dropping the OS page cache. And I see the performance degradation equal to that described in this issue. It could be that the scenarios analysed in this issue also hit the OS page cache.

The Linux page cache is interesting in that it takes all the available unused RAM, which can be many GBs or it can be close to non-existent if the machine's memory is used otherwise.

My guess is that the performance cliff is due to a complicated combination of available page cache size, how accounts are distributed over SST files and in which layer. (Remember that SST files in level 0 are not sorted while all other are. So in layer 0, insertion order matters more than global key order.) Depending on how many SST files need to be fetched, how much fits in the RocksDB block cache, and how much of it fits in the Page Cache, there will be almost no disk accesses (fast case) or there is a new SST file read from disk every dozen or so accounts (slow case) because those accounts are in different SST files in level 0.

@EdvardD
Copy link
Contributor

EdvardD commented Aug 17, 2022

The rocksdb::VerifyBlockChecksum looks very suspicious!

I'm actually already working on that. See the PR with the fork-fix: #7418 and fix in the rocksdb: https://github.com/facebook/rocksdb/pull/10427/files

The thing is that rocksdb doesn't propagate well ReadOptions::very_checksums flag specifically for meta block reads.

@jakmeier
Copy link
Contributor

Applying #7418 changes does not resolve my testcase. It's still taking milliseconds to read single values. There seems to be good improvement but not the 1000x we are looking for. I guess this it not too surprising, as my test is only slow if I reset OS page caches, which clearly points to this being disk bound and not CPU bound.

But also, my test case is no longer perfectly consistent anymore, which it was the previous two days. So maybe I was a bit too confident in my previous claims...

@jakmeier
Copy link
Contributor

TLDR I found out quite a bit more by now. The slow case observed seems to be mostly due to a bad SST file layout. Whereas the fast case was all served from in-memory write-back buffers. The truth is somewhere in between, probably.


The first big discovery was that instead of clearing OS page caches, which took usually about 2 seconds, I could also do thread::sleep(Duration::from_millis(2000)) to get the same result. Without pause, I get the fast case. With pause, I get almost 1000x slower performance.

Then I managed to get a case where a loop starts with a couple fast iterations and then goes over to slow DB reads. At this very moment, a RocksDB flush of the State column family kicked in. This happens in the background if I wait for long enough but it seems it can be delayed if we are putting enough load on RocksDB otherwise.

The problem is, there is already code to specifically force a flush before each measurement to avoid this problem. Turns out, the forced flush I added some time ago is buggy. See #7484 for the fix. With that fix in, things get really slow. Qemu based estimations showed a steep increase in read bytes and time based estimations sometimes go up by 10x. Seems that the flushing of memtables never really worked as intended.

Now the problem was that things are way too slow. With a 40GB database setup on my local SSD, I managed to get estimations >1s per empty action receipt. About 10'000 times slower than before. Or when measured with qemu, more than 1GB was being read per empty action receipt! With smaller DBs, everything is fast and small read throughput.

The final puzzle peace is that RocksDB will produce quite unfortunate SST file layouts when it is forced to quickly store a lot of data and doesn't get time to clean up. The estimator creates a new DB from a state dump everytime. In other words, all data is written at once right before starting the estimation. Now with the additional force-flush in between. Exactly the recipe for bad SST file layout.

The solution it to run compaction on the newly created DB. With that simply change, I can go back and forth between >1s and 650us per empty action receipt. This is somewhere between the slowest (1s+) and fastest (~100us) cases I could reliably reproduce.

Note that all numbers so far are without dropping OS page cache between measurement. It is not required to explain the huge differences reported in this github issue. But when I do drop the page cache with all fixes applied, the measurement goes from 650us up to 3ms, so maybe 5x times slower. This is a completely reasonable time for a workload that has to go to disk dozens of times.

Now that all of this is explained and written down, I will close this issue. I will post more updates in #5995 with respect to necessary changes to the estimator.

@matklad
Copy link
Contributor Author

matklad commented Aug 26, 2022

I managed to get estimations >1s per empty action receipt. About 10'000 times slower than before.

There should be a Russian-reversal joke how at NEAR we are working hard on making the slowest possible code!

Kudos for getting to the bottom of this and not letting this bug to celebrate its first birthday!

jakmeier added a commit to jakmeier/nearcore that referenced this issue Aug 30, 2022
Run compaction on the DB after inserting all records from the state
dump. This brings the layout of SST files to a more stable and optimal
layout. See near#4771 for more details.
near-bulldozer bot pushed a commit that referenced this issue Aug 30, 2022
Run compaction on the DB after inserting all records from the state
dump. This brings the layout of SST files to a more stable and optimal
layout. See #4771 for more details.

This is an estimator only change.
nikurt pushed a commit that referenced this issue Nov 9, 2022
Run compaction on the DB after inserting all records from the state
dump. This brings the layout of SST files to a more stable and optimal
layout. See #4771 for more details.

This is an estimator only change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc)
Projects
None yet
Development

No branches or pull requests

8 participants