Runtime hits unknown perfomance cliff in the estimator #4771

matklad · 2021-09-01T14:16:40Z

See

matklad@80b44a8

In that commit, I add a benchmark which sends 150 blocks, where each
block contains 100 deploy code transaction.

EDIT: #4771 (comment) is modernized version.

The benchmark is implemented in two ways:

on top of parameter estimator testbed
on top of runtime tester

In the first case, I observe a very stage behavior, where at some point
block processsing times severely degrades.

To reproduce, checkout this commit and run

 $ cargo test --release --package runtime-params-estimator --lib --features required – v2::fast --exact --nocapture
 $ cargo test --release --package runtime-params-estimator --lib --features required – v2::slow --exact --nocapture

Look for Runtime::apply lines in the output, they they tell you how
long it takes to process one apply.

In the fast case, every apply takes about 16ms for me. In the slow
case, apply start at about 30ms, but then abruptly go to 600ms.

Would be great to understand why the time suddenly balloons!

The text was updated successfully, but these errors were encountered:

When debugging near#4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary.

mina86 · 2021-09-06T15:31:57Z

Quick look indicates that the slowdown comes from the storage. I’ve
added some more tracing and here’s output with a fast
process_transaction invocation followed by a slow one (they are
within the same apply call):

147.38µs         Runtime::process_transaction
  145.70µs         verify_and_charge_transaction
    48.12µs          validate_transaction
    46.70µs          get_account
      46.15µs          TrieUpdate::get
    49.21µs          get_access_key
      48.65µs          TrieUpdate::get
…
2.64ms           Runtime::process_transaction
  2.64ms           verify_and_charge_transaction
    49.03µs          validate_transaction
    1.30ms           get_account
      1.30ms           TrieUpdate::get
    1.29ms           get_access_key
      1.29ms           TrieUpdate::get

TrieUpdate::get is suddenly 25 times slower and it’s called quite
a bit so it adds up.

bowenwang1996 · 2021-09-06T17:20:50Z

@mina86 is it because the value is not cached and you have to do a trie traversal?

mina86 · 2021-09-06T18:59:46Z

It doesn’t look like cache issues. I haven’t measured cache
hit ratio or counted how often we hit the storage, but just by looking
at logs the numbers seem similar for fast and slow cases. On the
other hand, Store::get becomes ten times slower all of a sudden:

      74.73µs          Trie::Update::get
        68.40µs          Trie::lookup
          7.45µs           Store::get
          3.50µs           Store::get
          3.16µs           Store::get
          2.10µs           Store::get
          2.82µs           Store::get
          2.79µs           Store::get
          3.38µs           Store::get
          4.09µs           Store::get
          3.77µs           Store::get
          3.07µs           Store::get
          2.45µs           Store::get
          3.23µs           Store::get
          2.91µs           Store::get
          2.31µs           Store::get
          2.42µs           Store::get
          2.63µs           Store::get
          2.59µs           Store::get
        4.83µs           Store::get

vs.

      1.31ms           Trie::Update::get
        1.24ms           Trie::lookup
          86.48µs          Store::get
          72.27µs          Store::get
          70.31µs          Store::get
          69.17µs          Store::get
          68.85µs          Store::get
          68.40µs          Store::get
          68.75µs          Store::get
          69.13µs          Store::get
          69.77µs          Store::get
          68.91µs          Store::get
          68.82µs          Store::get
          69.21µs          Store::get
          68.84µs          Store::get
          78.35µs          Store::get
          75.67µs          Store::get
          76.41µs          Store::get
          75.47µs          Store::get
        69.45µs          Store::get

So I’m now looking at that.

mina86 · 2021-09-06T22:46:45Z

The difference is caused by how signer of the transaction is chosen.
In the fast test signers are chosen in sequence from all the available
accounts. In the slow tests, signers are chosen at random from unused
accounts. Making the latter use sequential accounts fixes the issue:

--- a/runtime/runtime-params-estimator/src/v2/support.rs
+++ b/runtime/runtime-params-estimator/src/v2/support.rs
@@ -202,12 +202,12 @@ impl<'a, 'c> TransactionBuilder<'a, 'c> {
         self.testbed.accounts[account_index].clone()
     }
     pub(crate) fn random_unused_account(&mut self) -> AccountId {
-        loop {
-            let account = self.random_account();
+        for account in &self.testbed.accounts {
             if self.testbed.used_accounts.insert(account.clone()) {
-                return account;
+                return account.clone();
             }
         }
+        panic!("Not enough accounts");
     }
     pub(crate) fn random_account_pair(&mut self) -> (AccountId, AccountId) {
         let first = self.random_account();

I didn’t look into how exactly this affects the database to cause the slowdown.

mina86 · 2021-09-06T22:56:13Z

Perhaps I’ve spoken too soon. The above is correct and changing
random_unused_account implementation does speed up the slow test but
the opposite does not slow down the fast test:

--- a/runtime/runtime-params-estimator/src/v2.rs
+++ b/runtime/runtime-params-estimator/src/v2.rs
@@ -449,10 +449,20 @@ fn fast() {
         use_in_memory_store: false,
     };

-    let mut i = 0;
+    let mut used = std::collections::HashSet::new();
+    let mut rng = rand::thread_rng();
+
     for h in 0..n_blocks {
         let mut block = BlockConfig::at_height((h + 1) as u64);
         for _ in 0..block_size {
+
+            let i = loop {
+                let i = rng.gen_range(0, num_accounts as u32);
+                if used.insert(i) {
+                    break i as usize;
+                }
+            };
+
             let signer_id = accounts[i].clone();
             let receiver_id = signer_id.clone();
             let signer =
@@ -464,7 +474,6 @@ fn fast() {
                 signer,
                 actions: vec![Action::DeployContract(DeployContractAction { code: code.clone() })],
             });
-            i += 1;
         }
         s.blocks.push(block)
     }

It looks like it’s something else calling random_unused_account that
is the culprit.

matklad · 2021-09-07T13:39:11Z

Confirmed that the order of accounts matters here. Iterating accounts by one up or down finished fast. Iterting accounts by eleven shows the same slowdown

diff

diff --git a/runtime/runtime-params-estimator/src/v2.rs b/runtime/runtime-params-estimator/src/v2.rs
index 6a844fcd9..597da1c01 100644
--- a/runtime/runtime-params-estimator/src/v2.rs
+++ b/runtime/runtime-params-estimator/src/v2.rs
@@ -27,7 +27,7 @@ static ALL_COSTS: &[(Cost, fn(&mut Ctx) -> GasCost)] = &[
     (Cost::ActionReceiptCreation, action_receipt_creation),
     (Cost::ActionSirReceiptCreation, action_sir_receipt_creation),
     (Cost::ActionTransfer, action_transfer),
-    (Cost::ActionCreateAccount, action_create_account),
+    // (Cost::ActionCreateAccount, action_create_account),
     (Cost::ActionDeleteAccount, action_delete_account),
     (Cost::ActionAddFullAccessKey, action_add_full_access_key),
     (Cost::ActionAddFunctionAccessKeyBase, action_add_function_access_key_base),
@@ -118,28 +118,28 @@ fn action_transfer(ctx: &mut Ctx) -> GasCost {
     total_cost - base_cost
 }
 
-fn action_create_account(ctx: &mut Ctx) -> GasCost {
-    let total_cost = {
-        let mut testbed = ctx.test_bed();
+// fn action_create_account(ctx: &mut Ctx) -> GasCost {
+//     let total_cost = {
+//         let mut testbed = ctx.test_bed();
 
-        let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
-            let sender = tb.random_account();
-            let new_account =
-                AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
-
-            let actions = vec![
-                Action::CreateAccount(CreateAccountAction {}),
-                Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
-            ];
-            tb.transaction_from_actions(sender, new_account, actions)
-        };
-        testbed.average_transaction_cost(&mut make_transaction)
-    };
+//         let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
+//             let sender = tb.random_account();
+//             let new_account =
+//                 AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
 
-    let base_cost = action_receipt_creation(ctx);
+//             let actions = vec![
+//                 Action::CreateAccount(CreateAccountAction {}),
+//                 Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
+//             ];
+//             tb.transaction_from_actions(sender, new_account, actions)
+//         };
+//         testbed.average_transaction_cost(&mut make_transaction)
+//     };
 
-    total_cost - base_cost
-}
+//     let base_cost = action_receipt_creation(ctx);
+
+//     total_cost - base_cost
+// }
 
 fn action_delete_account(ctx: &mut Ctx) -> GasCost {
     let total_cost = {
diff --git a/runtime/runtime-params-estimator/src/v2/support.rs b/runtime/runtime-params-estimator/src/v2/support.rs
index 6a635eb34..9b62e6fb2 100644
--- a/runtime/runtime-params-estimator/src/v2/support.rs
+++ b/runtime/runtime-params-estimator/src/v2/support.rs
@@ -7,7 +7,7 @@ use near_crypto::{InMemorySigner, KeyType};
 use near_primitives::hash::CryptoHash;
 use near_primitives::transaction::{Action, DeployContractAction, SignedTransaction};
 use near_primitives::types::{AccountId, Gas};
-use rand::prelude::ThreadRng;
+use rand::prelude::{SliceRandom, ThreadRng};
 use rand::Rng;
 
 use crate::cases::ratio_to_gas;
@@ -50,6 +50,7 @@ impl<'c> Ctx<'c> {
             accounts: (0..self.config.active_accounts).map(get_account_id).collect(),
             nonces: HashMap::new(),
             used_accounts: HashSet::new(),
+            idx: 0,
         }
     }
 
@@ -80,6 +81,7 @@ impl<'c> Ctx<'c> {
             accounts: proto.accounts.clone(),
             nonces: proto.nonces.clone(),
             used_accounts: HashSet::new(),
+            idx: 0,
         }
     }
 
@@ -117,6 +119,7 @@ pub(crate) struct TestBed<'c> {
     accounts: Vec<AccountId>,
     nonces: HashMap<AccountId, u64>,
     used_accounts: HashSet<AccountId>,
+    idx: usize,
 }
 
 impl<'c> TestBed<'c> {
@@ -190,24 +193,26 @@ impl<'a, 'c> TransactionBuilder<'a, 'c> {
         )
     }
 
-    pub(crate) fn rng(&mut self) -> ThreadRng {
-        rand::thread_rng()
-    }
+    // pub(crate) fn rng(&mut self) -> ThreadRng {
+    //     rand::thread_rng()
+    // }
 
     pub(crate) fn account(&mut self, account_index: usize) -> AccountId {
         get_account_id(account_index)
     }
     pub(crate) fn random_account(&mut self) -> AccountId {
-        let account_index = self.rng().gen_range(0, self.testbed.accounts.len());
+        self.testbed.idx = (self.testbed.idx + 11) % self.testbed.accounts.len();
+        let account_index = self.testbed.idx;
         self.testbed.accounts[account_index].clone()
     }
     pub(crate) fn random_unused_account(&mut self) -> AccountId {
-        loop {
+        for i in 0..1000 {
             let account = self.random_account();
             if self.testbed.used_accounts.insert(account.clone()) {
                 return account;
             }
         }
+        panic!("no unused accounts")
     }
     pub(crate) fn random_account_pair(&mut self) -> (AccountId, AccountId) {
         let first = self.random_account();

matklad · 2021-09-07T14:02:31Z

Ok, it took me just about a week to realize I can throw flamegrapsh at it 😅

stepping by 11 acconts:

stepping by 1:

profilnes recoreded with perf record -F 9999 --call-graph dwarf target/release/deps/runtime_params_estimator-0328c9c022713977 v2::slow --exact --nocapture and rendered with https://github.com/KDAB/hotspot

The rocksdb::VerifyBlockChecksum looks very suspicious!

I'll stop at this for the time being.

matklad · 2021-09-07T14:03:22Z

(meta -- does anyone know of a nice way to share per profileing results? The above is me scrennshotting hotspot ui, which doesn's seem ideal)

mina86 · 2021-09-07T15:38:27Z

It’s still not clear to me why changing the fast code to use random accounts doesn’t slow it down though. There are some differences how the accounts are set up in both tests so that could be the explanation. In fast test test### accounts are added to genesis and transactions are executed using those. In slow test near_###_### accounts are added by the GenesisBuilder. Though it’s not clear to me why it would make a difference.

bowenwang1996 · 2021-09-07T17:46:16Z

In the fast test signers are chosen in sequence from all the available
accounts. In the slow tests, signers are chosen at random from unused
accounts

@mina86 what are the available accounts? I wonder whether this affects the trie traversal in the sense that if we go over the accounts sequentially, some trie node will be cached between the reads (if not by nearcore then on the OS level the page may be cached) while accessing them randomly makes it harder to cache anything.

mina86 · 2021-09-07T17:57:46Z

The tests create 200k accounts and all transactions are signed by one of them. The test uses each account only once. Sequential reads indeed seem like a plausible explanation though at this moment I think this is affecting caching in RocksDB or on OS level rather than in our code. (I still need to just add counters or profile each of the Trie::* and Store::* methods to be sure).

When debugging #4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary. Example without aggregation: ``` 328.25ms vm_script_smoke_test 306.34ms run_vm 306.32ms run_wasmer 37.16µs get_key 301.50ms compile_and_serialize_wasmer 4.44ms run_method 2.53ms run_method/instantiate 128.17µs run_method/call 1.74ms run_method/drop_instance 4.50ms run_vm 4.49ms run_wasmer 31.81µs get_key 4.25ms run_method 2.45ms run_method/instantiate 45.46µs run_method/call 1.72ms run_method/drop_instance 4.43ms run_vm 4.42ms run_wasmer 22.43µs get_key 4.21ms run_method 2.41ms run_method/instantiate 42.84µs run_method/call 1.72ms run_method/drop_instance 4.81ms run_vm 4.80ms run_wasmer 22.09µs get_key 4.58ms run_method 2.40ms run_method/instantiate 419.57µs run_method/call 1.73ms run_method/drop_instance ``` Example with aggregation (unit-less number is the number of aggregated calls): ``` 333.98ms vm_script_smoke_test 325.54ms 4 run_vm 325.48ms 4 run_wasmer 307.07ms compile_and_serialize_wasmer 129.19µs 4 get_key 17.32ms 4 run_method 612.46µs 4 run_method/call 6.63ms 4 run_method/drop_instance 9.94ms 4 run_method/instantiate ``` Test Plan --------- This is a debugging utility which is not enabled in production, so I just manually verified that the output makes sense.

When debugging #4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary.

mina86 · 2021-09-17T01:16:58Z

Turns out

diff --git a/genesis-tools/genesis-populate/src/lib.rs b/genesis-tools/genesis-populate/src/lib.rs
index d3a9793d..3eceddfa 100644
--- a/genesis-tools/genesis-populate/src/lib.rs
+++ b/genesis-tools/genesis-populate/src/lib.rs
@@ -29,7 +29,7 @@ use near_store::{
 use nearcore::{get_store_path, NightshadeRuntime};
 
 fn get_account_id(account_index: u64) -> AccountId {
-    AccountId::try_from(format!("near_{}_{}", account_index, account_index)).unwrap()
+    AccountId::try_from(format!("test{}", account_index)).unwrap()
 }
 
 pub type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
diff --git a/runtime/runtime-params-estimator/src/testbed_runners.rs b/runtime/runtime-params-estimator/src/testbed_runners.rs
index 2ede73d9..ab316ae6 100644
--- a/runtime/runtime-params-estimator/src/testbed_runners.rs
+++ b/runtime/runtime-params-estimator/src/testbed_runners.rs
@@ -18,7 +18,7 @@ use std::time::Instant;
 
 /// Get account id from its index.
 pub fn get_account_id(account_index: usize) -> AccountId {
-    AccountId::try_from(format!("near_{}_{}", account_index, account_index)).unwrap()
+    AccountId::try_from(format!("test{}", account_index)).unwrap()
 }
 
 /// Total number of transactions that we need to prepare.

makes the test go fast. The inverse (changing all instances of test{} to near_{}_{} in the code base) plus adding random order of accounts makes the fast test slow.

matklad · 2021-09-17T13:42:01Z

Excellent find! I confirm that the fast test starts going slowly with near_{}_{} accounts, so this removes any param-estimator related weirdness from the equation. I also had to replace test{} with near_{}_{} across the codebase, and not just in a test, which felt odd. Filed #4835

bowenwang1996 · 2021-09-17T16:56:44Z

It feels that this is indeed related to the shape of the trie. @mina86 is it true that changing the names from test{} to near_{}_{} actually changes the number of database reads we do?

mina86 · 2021-09-20T17:45:50Z

There’s only a small difference:

                             test{}      near_{}_{}
Trie::retrieve_raw_bytes   : 1832596  →  1944114
TrieCachingStorage         : 2118923  →  2232747
TrieCachingStorage (cached):       2  →        0
TrieCachingStorage (missed): 2118921  →  2232747
RocksDB::get               : 2322764  →  2436590

bowenwang1996 · 2021-09-21T02:27:49Z

That is weird. Do we see any significant difference in the amount of time spent on RocksDB::get?

When debugging a test it may be desired to be able to see the state of the storage after the tests finishes. Currently, scenarios which use on-disk storage will use a temporary directory which will be deleted after the scenario runs. This of course is correct behaviour but it makes it impossible to investigate the storage of a failing (or otherwise misbehaving) test. Provide a Scenario::home_dir field which specifies where the store will be located. Importantly, Secnario::run will not delete that directory after executing the scenario thus letting the user see the end result. Issue: near#4771

pmnoxx · 2021-11-17T08:28:05Z

It's also likely that something like compaction not being configured right, may be causing huge latency spikes, likes one I saw in benchmark.

bowenwang1996 · 2021-11-17T21:13:16Z

@Longarithm please take a look at @pmnoxx's benchmark

pmnoxx · 2021-11-17T21:28:46Z

@bowenwang1996 @Longarithm

In the test I write 100k or 1m or 10m keys to the database, but I'm only able to read 48.5% of them back successfully. Something seems broken.

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

     Running unittests (/home/pmnoxx/repos/nearcore.master/target/release/deps/ibf-0ff475f5b700c766)

running 1 test
test benchmark_db ... 

took on avg 2.646µs op per sec 377805 got 4879485/10000000
took on avg 2.739µs op per sec 365073 got 4879481/10000000
took on avg 2.696µs op per sec 370818 got 4876333/10000000

Add benchmark for reading / writing from store. See #4771 ``` cd chain/network/benches rm -rf /tmp/xxxxx ; cargo bench benchmark_db ``` TODO: - [x] Add tempdir - [x] Pregenerate keys - [x] Add stats to see number of entries read successfully - [x] Why we can only read 48.5% entries back when we write them to DB? - [x] Move code to a better place - [x] Document what the benchmark does

Longarithm · 2022-01-12T14:55:52Z

I hit this problem in another situation.

I wanted to iterate over Trie contracts, because of the structure the iterator at first went over all accounts, and then over all contracts. Around 15 Dec, it finished in ~1 minute, but now it takes extremely long time. In the example below, it takes ~2 seconds to iterate over 100 accounts, which is not reasonable:

Jan 12 14:26:32.545  INFO neard: 84100
Jan 12 14:26:34.481  INFO neard: 84200
Jan 12 14:26:36.506  INFO neard: 84300
Jan 12 14:26:38.448  INFO neard: 84400
Jan 12 14:26:40.233  INFO neard: 84500
Jan 12 14:26:42.175  INFO neard: 84600
Jan 12 14:26:44.122  INFO neard: 84700
Jan 12 14:26:45.462  INFO neard: 84800
Jan 12 14:26:47.045  INFO neard: 84900
Jan 12 14:26:49.073  INFO neard: 85000

e7e9fbc...c0078ea - this code stops on 84000th account, but it doesn't matter.

Also I believe at some moment the code became fast again, and also if I iterate over N accounts in some previous run, then in the next run these accounts are traversed much faster. This makes the issue much harder to debug, and makes me think that some rocksdb caching steps in here. Profiler shows that rocksdb::BlockBasedTable::Get is the bottleneck: https://share.firefox.dev/3K6VIC8

Though it can be avoided for my particular task, I feel more worried about db performance now.

stale · 2022-04-12T16:23:51Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

jakmeier · 2022-08-16T20:34:15Z

I am investigating a similar issue to this one. I think I might have found the explanation: In the slow case, RocksDB is reading state from disk and in the fast case it's all in memory.

My test scenario

In my tests, I create 6000 new accounts with names format!("{hash(i)}_near"), giving much more randomly distributed accounts within the trie structure than any of the tests above.

Reading 400 of those accounts right after their creation from the estimator, I can usually get an account from RocksDB within ~100us. This is remarkably fast because the estimator avoids shard and chunk caches and it clears RocksDB write-back buffers. I also reduced RocksDB block cache to1 kiB but the access is still fast.
With all those caches disabled, the expectation is that reading an account will trigger a trie traversal with every node being read from disk. But 100us is at most one SSD access, more likely it is in-memory only.
I checked that requests to the DB are sent as expected and it seems to be the case. But I noticed that State column access times of merely ~10us. Looks like RocksDB can serve these requests from memory.

So then if I drop OS caches between creating the accounts and reading them, and all of a sudden the State column access time goes up to 7-8ms. Retrieving an account takes at least one such access but could be up to ~20 for trie traversal. With shard cache enabled, it's usually just 2 new nodes per account (extension + leaf), which gives ~15ms per account. This is perfectly in line with 1-2 seconds to iterate over 100 accounts as has been mentioned in older comments.

My conclusion

So, in my case I can trigger disk lookups by dropping the OS page cache. And I see the performance degradation equal to that described in this issue. It could be that the scenarios analysed in this issue also hit the OS page cache.

The Linux page cache is interesting in that it takes all the available unused RAM, which can be many GBs or it can be close to non-existent if the machine's memory is used otherwise.

My guess is that the performance cliff is due to a complicated combination of available page cache size, how accounts are distributed over SST files and in which layer. (Remember that SST files in level 0 are not sorted while all other are. So in layer 0, insertion order matters more than global key order.) Depending on how many SST files need to be fetched, how much fits in the RocksDB block cache, and how much of it fits in the Page Cache, there will be almost no disk accesses (fast case) or there is a new SST file read from disk every dozen or so accounts (slow case) because those accounts are in different SST files in level 0.

EdvardD · 2022-08-17T12:23:45Z

The rocksdb::VerifyBlockChecksum looks very suspicious!

I'm actually already working on that. See the PR with the fork-fix: #7418 and fix in the rocksdb: https://github.com/facebook/rocksdb/pull/10427/files

The thing is that rocksdb doesn't propagate well ReadOptions::very_checksums flag specifically for meta block reads.

jakmeier · 2022-08-17T15:41:43Z

Applying #7418 changes does not resolve my testcase. It's still taking milliseconds to read single values. There seems to be good improvement but not the 1000x we are looking for. I guess this it not too surprising, as my test is only slow if I reset OS page caches, which clearly points to this being disk bound and not CPU bound.

But also, my test case is no longer perfectly consistent anymore, which it was the previous two days. So maybe I was a bit too confident in my previous claims...

jakmeier · 2022-08-26T17:08:37Z

TLDR I found out quite a bit more by now. The slow case observed seems to be mostly due to a bad SST file layout. Whereas the fast case was all served from in-memory write-back buffers. The truth is somewhere in between, probably.

The first big discovery was that instead of clearing OS page caches, which took usually about 2 seconds, I could also do thread::sleep(Duration::from_millis(2000)) to get the same result. Without pause, I get the fast case. With pause, I get almost 1000x slower performance.

Then I managed to get a case where a loop starts with a couple fast iterations and then goes over to slow DB reads. At this very moment, a RocksDB flush of the State column family kicked in. This happens in the background if I wait for long enough but it seems it can be delayed if we are putting enough load on RocksDB otherwise.

The problem is, there is already code to specifically force a flush before each measurement to avoid this problem. Turns out, the forced flush I added some time ago is buggy. See #7484 for the fix. With that fix in, things get really slow. Qemu based estimations showed a steep increase in read bytes and time based estimations sometimes go up by 10x. Seems that the flushing of memtables never really worked as intended.

Now the problem was that things are way too slow. With a 40GB database setup on my local SSD, I managed to get estimations >1s per empty action receipt. About 10'000 times slower than before. Or when measured with qemu, more than 1GB was being read per empty action receipt! With smaller DBs, everything is fast and small read throughput.

The final puzzle peace is that RocksDB will produce quite unfortunate SST file layouts when it is forced to quickly store a lot of data and doesn't get time to clean up. The estimator creates a new DB from a state dump everytime. In other words, all data is written at once right before starting the estimation. Now with the additional force-flush in between. Exactly the recipe for bad SST file layout.

The solution it to run compaction on the newly created DB. With that simply change, I can go back and forth between >1s and 650us per empty action receipt. This is somewhere between the slowest (1s+) and fastest (~100us) cases I could reliably reproduce.

Note that all numbers so far are without dropping OS page cache between measurement. It is not required to explain the huge differences reported in this github issue. But when I do drop the page cache with all fixes applied, the measurement goes from 650us up to 3ms, so maybe 5x times slower. This is a completely reasonable time for a workload that has to go to disk dozens of times.

Now that all of this is explained and written down, I will close this issue. I will post more updates in #5995 with respect to necessary changes to the estimator.

matklad · 2022-08-26T17:27:03Z

I managed to get estimations >1s per empty action receipt. About 10'000 times slower than before.

There should be a Russian-reversal joke how at NEAR we are working hard on making the slowest possible code!

Kudos for getting to the bottom of this and not letting this bug to celebrate its first birthday!

Run compaction on the DB after inserting all records from the state dump. This brings the layout of SST files to a more stable and optimal layout. See near#4771 for more details.

Run compaction on the DB after inserting all records from the state dump. This brings the layout of SST files to a more stable and optimal layout. See #4771 for more details. This is an estimator only change.

matklad added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-core Team: issues relevant to the core team and removed T-core Team: issues relevant to the core team labels Sep 1, 2021

bowenwang1996 added the T-node Team: issues relevant to the node experience team label Sep 1, 2021

matklad mentioned this issue Sep 2, 2021

refactor: use external crate for displaying tracing spans #4781

Merged

matklad mentioned this issue Sep 17, 2021

runtime-tester is sensitive to account names for some reason #4835

Closed

This was referenced Sep 21, 2021

Consider using RuntimeTester to test runtime #4854

Open

Introduce TestEnvBuilder structure #4855

Merged

bowenwang1996 assigned Longarithm and unassigned mina86 Nov 17, 2021

jakmeier mentioned this issue Dec 10, 2021

Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

Closed

stale bot added the S-stale label Apr 12, 2022

exalate-issue-sync bot removed the T-node Team: issues relevant to the node experience team label May 9, 2022

akhi3030 removed the S-stale label Jul 8, 2022

jakmeier mentioned this issue Aug 23, 2022

Robust IO cost measurements in estimator #5995

Open

11 tasks

jakmeier closed this as completed Aug 26, 2022

jakmeier mentioned this issue Aug 30, 2022

feat(estimator): compact RocksDB in testbed setup #7502

Merged

Longarithm mentioned this issue May 26, 2023

Time to process transactions increases over time #3967

Closed

This was referenced Oct 18, 2023

Runtime hits unknown perfomance cliff in the estimator #9933

Closed

Consider using RuntimeTester to test runtime #9957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime hits unknown perfomance cliff in the estimator #4771

Runtime hits unknown perfomance cliff in the estimator #4771

matklad commented Sep 1, 2021 •

edited by exalate-issue-sync bot

Loading

mina86 commented Sep 6, 2021

bowenwang1996 commented Sep 6, 2021

mina86 commented Sep 6, 2021

mina86 commented Sep 6, 2021

mina86 commented Sep 6, 2021

matklad commented Sep 7, 2021 •

edited

Loading

matklad commented Sep 7, 2021

matklad commented Sep 7, 2021

mina86 commented Sep 7, 2021

bowenwang1996 commented Sep 7, 2021

mina86 commented Sep 7, 2021

mina86 commented Sep 17, 2021

matklad commented Sep 17, 2021

bowenwang1996 commented Sep 17, 2021

mina86 commented Sep 20, 2021

bowenwang1996 commented Sep 21, 2021

pmnoxx commented Nov 17, 2021

bowenwang1996 commented Nov 17, 2021

pmnoxx commented Nov 17, 2021 •

edited

Loading

Longarithm commented Jan 12, 2022

stale bot commented Apr 12, 2022

jakmeier commented Aug 16, 2022

EdvardD commented Aug 17, 2022 •

edited

Loading

jakmeier commented Aug 17, 2022

jakmeier commented Aug 26, 2022

matklad commented Aug 26, 2022

Runtime hits unknown perfomance cliff in the estimator #4771

Runtime hits unknown perfomance cliff in the estimator #4771

Comments

matklad commented Sep 1, 2021 • edited by exalate-issue-sync bot Loading

mina86 commented Sep 6, 2021

bowenwang1996 commented Sep 6, 2021

mina86 commented Sep 6, 2021

mina86 commented Sep 6, 2021

mina86 commented Sep 6, 2021

matklad commented Sep 7, 2021 • edited Loading

matklad commented Sep 7, 2021

matklad commented Sep 7, 2021

mina86 commented Sep 7, 2021

bowenwang1996 commented Sep 7, 2021

mina86 commented Sep 7, 2021

mina86 commented Sep 17, 2021

matklad commented Sep 17, 2021

bowenwang1996 commented Sep 17, 2021

mina86 commented Sep 20, 2021

bowenwang1996 commented Sep 21, 2021

pmnoxx commented Nov 17, 2021

bowenwang1996 commented Nov 17, 2021

pmnoxx commented Nov 17, 2021 • edited Loading

Longarithm commented Jan 12, 2022

stale bot commented Apr 12, 2022

jakmeier commented Aug 16, 2022

My test scenario

My conclusion

EdvardD commented Aug 17, 2022 • edited Loading

jakmeier commented Aug 17, 2022

jakmeier commented Aug 26, 2022

matklad commented Aug 26, 2022

matklad commented Sep 1, 2021 •

edited by exalate-issue-sync bot

Loading

matklad commented Sep 7, 2021 •

edited

Loading

pmnoxx commented Nov 17, 2021 •

edited

Loading

EdvardD commented Aug 17, 2022 •

edited

Loading