-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime hits unknown perfomance cliff in the estimator #4771
Comments
When debugging near#4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary.
Quick look indicates that the slowdown comes from the storage. I’ve
|
@mina86 is it because the value is not cached and you have to do a trie traversal? |
It doesn’t look like cache issues. I haven’t measured cache
vs.
So I’m now looking at that. |
The difference is caused by how signer of the transaction is chosen.
I didn’t look into how exactly this affects the database to cause the slowdown. |
Perhaps I’ve spoken too soon. The above is correct and changing
It looks like it’s something else calling |
Confirmed that the order of accounts matters here. Iterating accounts by one up or down finished fast. Iterting accounts by eleven shows the same slowdown diffdiff --git a/runtime/runtime-params-estimator/src/v2.rs b/runtime/runtime-params-estimator/src/v2.rs
index 6a844fcd9..597da1c01 100644
--- a/runtime/runtime-params-estimator/src/v2.rs
+++ b/runtime/runtime-params-estimator/src/v2.rs
@@ -27,7 +27,7 @@ static ALL_COSTS: &[(Cost, fn(&mut Ctx) -> GasCost)] = &[
(Cost::ActionReceiptCreation, action_receipt_creation),
(Cost::ActionSirReceiptCreation, action_sir_receipt_creation),
(Cost::ActionTransfer, action_transfer),
- (Cost::ActionCreateAccount, action_create_account),
+ // (Cost::ActionCreateAccount, action_create_account),
(Cost::ActionDeleteAccount, action_delete_account),
(Cost::ActionAddFullAccessKey, action_add_full_access_key),
(Cost::ActionAddFunctionAccessKeyBase, action_add_function_access_key_base),
@@ -118,28 +118,28 @@ fn action_transfer(ctx: &mut Ctx) -> GasCost {
total_cost - base_cost
}
-fn action_create_account(ctx: &mut Ctx) -> GasCost {
- let total_cost = {
- let mut testbed = ctx.test_bed();
+// fn action_create_account(ctx: &mut Ctx) -> GasCost {
+// let total_cost = {
+// let mut testbed = ctx.test_bed();
- let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
- let sender = tb.random_account();
- let new_account =
- AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
-
- let actions = vec![
- Action::CreateAccount(CreateAccountAction {}),
- Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
- ];
- tb.transaction_from_actions(sender, new_account, actions)
- };
- testbed.average_transaction_cost(&mut make_transaction)
- };
+// let mut make_transaction = |mut tb: TransactionBuilder<'_, '_>| -> SignedTransaction {
+// let sender = tb.random_account();
+// let new_account =
+// AccountId::try_from(format!("{}_{}", sender, tb.rng().gen::<u64>())).unwrap();
- let base_cost = action_receipt_creation(ctx);
+// let actions = vec![
+// Action::CreateAccount(CreateAccountAction {}),
+// Action::Transfer(TransferAction { deposit: 10u128.pow(26) }),
+// ];
+// tb.transaction_from_actions(sender, new_account, actions)
+// };
+// testbed.average_transaction_cost(&mut make_transaction)
+// };
- total_cost - base_cost
-}
+// let base_cost = action_receipt_creation(ctx);
+
+// total_cost - base_cost
+// }
fn action_delete_account(ctx: &mut Ctx) -> GasCost {
let total_cost = {
diff --git a/runtime/runtime-params-estimator/src/v2/support.rs b/runtime/runtime-params-estimator/src/v2/support.rs
index 6a635eb34..9b62e6fb2 100644
--- a/runtime/runtime-params-estimator/src/v2/support.rs
+++ b/runtime/runtime-params-estimator/src/v2/support.rs
@@ -7,7 +7,7 @@ use near_crypto::{InMemorySigner, KeyType};
use near_primitives::hash::CryptoHash;
use near_primitives::transaction::{Action, DeployContractAction, SignedTransaction};
use near_primitives::types::{AccountId, Gas};
-use rand::prelude::ThreadRng;
+use rand::prelude::{SliceRandom, ThreadRng};
use rand::Rng;
use crate::cases::ratio_to_gas;
@@ -50,6 +50,7 @@ impl<'c> Ctx<'c> {
accounts: (0..self.config.active_accounts).map(get_account_id).collect(),
nonces: HashMap::new(),
used_accounts: HashSet::new(),
+ idx: 0,
}
}
@@ -80,6 +81,7 @@ impl<'c> Ctx<'c> {
accounts: proto.accounts.clone(),
nonces: proto.nonces.clone(),
used_accounts: HashSet::new(),
+ idx: 0,
}
}
@@ -117,6 +119,7 @@ pub(crate) struct TestBed<'c> {
accounts: Vec<AccountId>,
nonces: HashMap<AccountId, u64>,
used_accounts: HashSet<AccountId>,
+ idx: usize,
}
impl<'c> TestBed<'c> {
@@ -190,24 +193,26 @@ impl<'a, 'c> TransactionBuilder<'a, 'c> {
)
}
- pub(crate) fn rng(&mut self) -> ThreadRng {
- rand::thread_rng()
- }
+ // pub(crate) fn rng(&mut self) -> ThreadRng {
+ // rand::thread_rng()
+ // }
pub(crate) fn account(&mut self, account_index: usize) -> AccountId {
get_account_id(account_index)
}
pub(crate) fn random_account(&mut self) -> AccountId {
- let account_index = self.rng().gen_range(0, self.testbed.accounts.len());
+ self.testbed.idx = (self.testbed.idx + 11) % self.testbed.accounts.len();
+ let account_index = self.testbed.idx;
self.testbed.accounts[account_index].clone()
}
pub(crate) fn random_unused_account(&mut self) -> AccountId {
- loop {
+ for i in 0..1000 {
let account = self.random_account();
if self.testbed.used_accounts.insert(account.clone()) {
return account;
}
}
+ panic!("no unused accounts")
}
pub(crate) fn random_account_pair(&mut self) -> (AccountId, AccountId) {
let first = self.random_account(); |
Ok, it took me just about a week to realize I can throw flamegrapsh at it 😅 stepping by 11 acconts: stepping by 1: profilnes recoreded with The I'll stop at this for the time being. |
(meta -- does anyone know of a nice way to share per profileing results? The above is me scrennshotting hotspot ui, which doesn's seem ideal) |
It’s still not clear to me why changing the fast code to use random accounts doesn’t slow it down though. There are some differences how the accounts are set up in both tests so that could be the explanation. In fast test |
@mina86 what are the available accounts? I wonder whether this affects the trie traversal in the sense that if we go over the accounts sequentially, some trie node will be cached between the reads (if not by nearcore then on the OS level the page may be cached) while accessing them randomly makes it harder to cache anything. |
The tests create 200k accounts and all transactions are signed by one of them. The test uses each account only once. Sequential reads indeed seem like a plausible explanation though at this moment I think this is affecting caching in RocksDB or on OS level rather than in our code. (I still need to just add counters or profile each of the Trie::* and Store::* methods to be sure). |
When debugging #4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary. Example without aggregation: ``` 328.25ms vm_script_smoke_test 306.34ms run_vm 306.32ms run_wasmer 37.16µs get_key 301.50ms compile_and_serialize_wasmer 4.44ms run_method 2.53ms run_method/instantiate 128.17µs run_method/call 1.74ms run_method/drop_instance 4.50ms run_vm 4.49ms run_wasmer 31.81µs get_key 4.25ms run_method 2.45ms run_method/instantiate 45.46µs run_method/call 1.72ms run_method/drop_instance 4.43ms run_vm 4.42ms run_wasmer 22.43µs get_key 4.21ms run_method 2.41ms run_method/instantiate 42.84µs run_method/call 1.72ms run_method/drop_instance 4.81ms run_vm 4.80ms run_wasmer 22.09µs get_key 4.58ms run_method 2.40ms run_method/instantiate 419.57µs run_method/call 1.73ms run_method/drop_instance ``` Example with aggregation (unit-less number is the number of aggregated calls): ``` 333.98ms vm_script_smoke_test 325.54ms 4 run_vm 325.48ms 4 run_wasmer 307.07ms compile_and_serialize_wasmer 129.19µs 4 get_key 17.32ms 4 run_method 612.46µs 4 run_method/call 6.63ms 4 run_method/drop_instance 9.94ms 4 run_method/instantiate ``` Test Plan --------- This is a debugging utility which is not enabled in production, so I just manually verified that the output makes sense.
When debugging #4771, I had to copy-paste tracing_timings module to get a useful profiling output, which isn't ideal. So I improved and published that code to crates.io as `tracing-span-tree`. In particular, this new implementation prints the trace in the correct order, and it also has an `.aggregate(true)` mode for cases where you run the profiled code in a loop and want a summary.
Turns out
makes the test go fast. The inverse (changing all instances of |
Excellent find! I confirm that the |
It feels that this is indeed related to the shape of the trie. @mina86 is it true that changing the names from |
There’s only a small difference:
|
That is weird. Do we see any significant difference in the amount of time spent on |
When debugging a test it may be desired to be able to see the state of the storage after the tests finishes. Currently, scenarios which use on-disk storage will use a temporary directory which will be deleted after the scenario runs. This of course is correct behaviour but it makes it impossible to investigate the storage of a failing (or otherwise misbehaving) test. Provide a Scenario::home_dir field which specifies where the store will be located. Importantly, Secnario::run will not delete that directory after executing the scenario thus letting the user see the end result. Issue: near#4771
When debugging a test it may be desired to be able to see the state of the storage after the tests finishes. Currently, scenarios which use on-disk storage will use a temporary directory which will be deleted after the scenario runs. This of course is correct behaviour but it makes it impossible to investigate the storage of a failing (or otherwise misbehaving) test. Provide a Scenario::home_dir field which specifies where the store will be located. Importantly, Secnario::run will not delete that directory after executing the scenario thus letting the user see the end result. Issue: near#4771
It's also likely that something like compaction not being configured right, may be causing huge latency spikes, likes one I saw in benchmark. |
@Longarithm please take a look at @pmnoxx's benchmark |
In the test I write 100k or 1m or 10m keys to the database, but I'm only able to read 48.5% of them back successfully. Something seems broken.
|
Add benchmark for reading / writing from store. See #4771 ``` cd chain/network/benches rm -rf /tmp/xxxxx ; cargo bench benchmark_db ``` TODO: - [x] Add tempdir - [x] Pregenerate keys - [x] Add stats to see number of entries read successfully - [x] Why we can only read 48.5% entries back when we write them to DB? - [x] Move code to a better place - [x] Document what the benchmark does
Add benchmark for reading / writing from store. See #4771 ``` cd chain/network/benches rm -rf /tmp/xxxxx ; cargo bench benchmark_db ``` TODO: - [x] Add tempdir - [x] Pregenerate keys - [x] Add stats to see number of entries read successfully - [x] Why we can only read 48.5% entries back when we write them to DB? - [x] Move code to a better place - [x] Document what the benchmark does
I hit this problem in another situation. I wanted to iterate over Trie contracts, because of the structure the iterator at first went over all accounts, and then over all contracts. Around 15 Dec, it finished in ~1 minute, but now it takes extremely long time. In the example below, it takes ~2 seconds to iterate over 100 accounts, which is not reasonable:
e7e9fbc...c0078ea - this code stops on Also I believe at some moment the code became fast again, and also if I iterate over N accounts in some previous run, then in the next run these accounts are traversed much faster. This makes the issue much harder to debug, and makes me think that some rocksdb caching steps in here. Profiler shows that Though it can be avoided for my particular task, I feel more worried about db performance now. |
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. |
I am investigating a similar issue to this one. I think I might have found the explanation: In the slow case, RocksDB is reading state from disk and in the fast case it's all in memory. My test scenarioIn my tests, I create 6000 new accounts with names Reading 400 of those accounts right after their creation from the estimator, I can usually get an account from RocksDB within ~100us. This is remarkably fast because the estimator avoids shard and chunk caches and it clears RocksDB write-back buffers. I also reduced RocksDB block cache to1 kiB but the access is still fast. So then if I drop OS caches between creating the accounts and reading them, and all of a sudden the My conclusionSo, in my case I can trigger disk lookups by dropping the OS page cache. And I see the performance degradation equal to that described in this issue. It could be that the scenarios analysed in this issue also hit the OS page cache. The Linux page cache is interesting in that it takes all the available unused RAM, which can be many GBs or it can be close to non-existent if the machine's memory is used otherwise. My guess is that the performance cliff is due to a complicated combination of available page cache size, how accounts are distributed over SST files and in which layer. (Remember that SST files in level 0 are not sorted while all other are. So in layer 0, insertion order matters more than global key order.) Depending on how many SST files need to be fetched, how much fits in the RocksDB block cache, and how much of it fits in the Page Cache, there will be almost no disk accesses (fast case) or there is a new SST file read from disk every dozen or so accounts (slow case) because those accounts are in different SST files in level 0. |
I'm actually already working on that. See the PR with the fork-fix: #7418 and fix in the rocksdb: https://github.com/facebook/rocksdb/pull/10427/files The thing is that rocksdb doesn't propagate well ReadOptions::very_checksums flag specifically for meta block reads. |
Applying #7418 changes does not resolve my testcase. It's still taking milliseconds to read single values. There seems to be good improvement but not the 1000x we are looking for. I guess this it not too surprising, as my test is only slow if I reset OS page caches, which clearly points to this being disk bound and not CPU bound. But also, my test case is no longer perfectly consistent anymore, which it was the previous two days. So maybe I was a bit too confident in my previous claims... |
TLDR I found out quite a bit more by now. The slow case observed seems to be mostly due to a bad SST file layout. Whereas the fast case was all served from in-memory write-back buffers. The truth is somewhere in between, probably. The first big discovery was that instead of clearing OS page caches, which took usually about 2 seconds, I could also do Then I managed to get a case where a loop starts with a couple fast iterations and then goes over to slow DB reads. At this very moment, a RocksDB flush of the The problem is, there is already code to specifically force a flush before each measurement to avoid this problem. Turns out, the forced flush I added some time ago is buggy. See #7484 for the fix. With that fix in, things get really slow. Qemu based estimations showed a steep increase in read bytes and time based estimations sometimes go up by 10x. Seems that the flushing of memtables never really worked as intended. Now the problem was that things are way too slow. With a 40GB database setup on my local SSD, I managed to get estimations >1s per empty action receipt. About 10'000 times slower than before. Or when measured with qemu, more than 1GB was being read per empty action receipt! With smaller DBs, everything is fast and small read throughput. The final puzzle peace is that RocksDB will produce quite unfortunate SST file layouts when it is forced to quickly store a lot of data and doesn't get time to clean up. The estimator creates a new DB from a state dump everytime. In other words, all data is written at once right before starting the estimation. Now with the additional force-flush in between. Exactly the recipe for bad SST file layout. The solution it to run compaction on the newly created DB. With that simply change, I can go back and forth between >1s and 650us per empty action receipt. This is somewhere between the slowest (1s+) and fastest (~100us) cases I could reliably reproduce. Note that all numbers so far are without dropping OS page cache between measurement. It is not required to explain the huge differences reported in this github issue. But when I do drop the page cache with all fixes applied, the measurement goes from 650us up to 3ms, so maybe 5x times slower. This is a completely reasonable time for a workload that has to go to disk dozens of times. Now that all of this is explained and written down, I will close this issue. I will post more updates in #5995 with respect to necessary changes to the estimator. |
There should be a Russian-reversal joke how at NEAR we are working hard on making the slowest possible code! Kudos for getting to the bottom of this and not letting this bug to celebrate its first birthday! |
Run compaction on the DB after inserting all records from the state dump. This brings the layout of SST files to a more stable and optimal layout. See near#4771 for more details.
Run compaction on the DB after inserting all records from the state dump. This brings the layout of SST files to a more stable and optimal layout. See #4771 for more details. This is an estimator only change.
Run compaction on the DB after inserting all records from the state dump. This brings the layout of SST files to a more stable and optimal layout. See #4771 for more details. This is an estimator only change.
See
matklad@80b44a8
In that commit, I add a benchmark which sends 150 blocks, where each
block contains 100 deploy code transaction.
EDIT: #4771 (comment) is modernized version.
The benchmark is implemented in two ways:
In the first case, I observe a very stage behavior, where at some point
block processsing times severely degrades.
To reproduce, checkout this commit and run
Look for
Runtime::apply
lines in the output, they they tell you howlong it takes to process one apply.
In the
fast
case, every apply takes about 16ms for me. In theslow
case, apply start at about 30ms, but then abruptly go to 600ms.
Would be great to understand why the time suddenly balloons!
The text was updated successfully, but these errors were encountered: