-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust IO cost measurements in estimator #5995
Comments
The work that @Longarithm did on visualizing the trie could be helpful here |
Some thoughts with regards to background threads:
|
Yeah I think that makes sense |
I measured RocksDB insertion performance for sequential keys with 1kB and 100kB sized values.
The cases with forced compaction include a manually triggered compaction after the insertions, adding to the total time. Local machine uses consumer NVMe SSD + AMD Ryzen 5 3600 6-Core Processor with CPU Mhz range 2200 - 4703. Statistical considerationsReported are arithmetic mean (avg), minimum and maximum measurements for a 1GB iteration. However, the distributions are definitely not Gaussian, judging by the histograms I looked at. The right tale is much longer. So take the average with a grain of salt. The mean would generally be quite a bit below the average. On the other hand, there are no gaps in the distibution. I.e. all values between min and max have some chance to be observed and it decreases almost monotonically towards the larger times. Source code for rerunningSource code of benchmark is on a separate branch so far Example command: # Assumes data_dump_0001.bin has been generated and should be reused for all iterations
for run in {1..100}; do cargo run --release -p runtime-params-estimator --features required -- --home ~/.near --metric time --costs RocksDbInsertValueKByte --pr-data-path data_dump_0001.bin --debug rocksdb; done >> rocksdb_insert_value_kb.txt
awk -f summary.awk rocksdb_insert_value_kb.txt AWK script to format output: # summary.awk
/setup/{c=0};
/Cost/{
# Convert seconds and ms to common base
if($3~/ms/){ c = substr($3,1,length($3)-2) }
else{ c = 1000 * substr($3,1,length($3)-1) }
};
/level/{if (c==0){lb[$5]=$1;} else {la[$5]=$1}}
/Finished/{
counts++;
sums+=c;
if(max<c) { max=c; }
if(min>c || min==0) { min=c; }
}
END {
printf "%3d | %7.3f s | %7.3f s | %7.3f s |\n", counts, sums/counts/1000, min/1000, max/1000
} InterpretationWe shouldn't read too much into the exact numbers here, since these will vary with workload and hardware. But to gain some intuition, each second reported here corresponds to 1 microsecond per 1kB, or 1ms per 1MB. If we were to translate this to gas cost, the 33.224s measurement would lead to 3.3224 TGas per MB inserted. Now compare that to what we currently charge for 10x 100kB insertion:
Total: 31.66 TGas per MB This is around 10 times more than my measurements had in the absolute worst case. Which seems like a good place to be right now. But it also hints that we can maybe afford to lower storage write costs, even when considering overhead costs and including some safety margin. Key Insights
Additional DetailsWhat is reported in this section shouldn't be relevant for the conclusion taken. However, for the extra curious reader (and my own benefit to sort my thoughs), I wrote down some more of my analysis on this topic in the spoilers below. LSMT shape analysisIt turns out, a large part of the variance in measurements is correlated with a difference in the shape of the underlying tree structure within RocksDB. Example for 100kB local, forced compaction:
The above table shows a very small difference in the shape of the tree after the setup, just one less file in level-2 of LSM tree. (And an equally small difference at the end of the benchmark.) This was enough to bring the minimum in the smaller sample down to 6.2s where the larger sample never finished before 9.7s. On the other hand, the average is nevertheless ~15% higher in the second sample. (Commenting on max is difficult. Because of the small sample size in the second group, less extreme max is expected to some degree.) There are many more different shapes when looking at the results without manual compaction. Example for 100kB local:
The inserted data has been exactly the same, for setup and the actual measurements. Thus, for now, I can only assume these differences have to do with CPU and SSD timings that vary. There is a trend that groups where the final tree has more files in L2 (e.g. 32) finish faster than those with less (e.g. 28). But it is not clear at all whether different shapes are the reason for this difference in performance, or the other way around. Looking at the same on the Google cloud VM, the shapes are much more consistent even without compaction. Example for 100kB gcloud, forced compaction:
Example for 100kB gcloud:
My next stepsThis first step was mostly for me to learn and intuitively understand what I can expect from RocksDB performance measurements. Next, I will use what I learned here to integrate robust IO measurements into the parameter estimator. |
@jakmeier What about |
Good point, I should have mentioned that. In this instance, all keys have less than 10 bytes. The associated costs is less than 10 * 70'482'867 and thus completely negligible for the purpose of this high-level comparison of raw RocksDB performance vs current gas cost. |
An interesting find regarding read performance: Writes that have not been flushed & compacted negatively impact read performance. Here are two plots produced using the script rocksdb_read_benchmark.sh. rocksdb_read_benchmark.pdf Clearly, forcing a flush and compaction before the read measurement starts is beneficial. In other words, cleaning up RocksDB structure does measurably affect read performance. Which raises the question, is this overhead caused by writes earlier something that should be accounted for in the gas cost for reads? For now, the new estimations But the discussion on how we want to treat storage costs in the future are ongoing. Personally, I think we should enforce compaction at a certain interval (e.g. after every block), add the compaction cost to the write gas cost and assume reads are always on a clean state. This should be more stable and reliable than what we have today. |
@jakmeier fyi there is some plan to have a separate hashtable based storage for reads that would allow for better performance |
Another issue I discovered in the last weeks is that currently the estimator cannot run for large DB stores. If I insert 8 million accounts, then the OOM killer kicks in on a machine with 16GB RAM. I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation. |
Indeed. Could we fix that? |
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. |
I looked into this a bit more today. With a large DB, creating a new store from a state dump is often taking the majority of the time for estimations. Unfortunately, my attempts so far to optimize this failed.
In conclusion, it would ultimately be better to avoid the need for huge databases in the estimator. Creating huge DBs has not reliably produced worst-cast access times anyway. For example #4771 shows that even small amounts of data can produce really bad cases. Far worse than what a DB with 50M accounts produces. My current plan is to either find specific bad cases with small data, or otherwise rely on counting DB operations and assuming fixed latencies. The next step forward for this is to implement #7058. |
I am happy to declare I finally reached the point where "all worst-case assumptions" proves to be too pessimistic. Combining these worst known cases
leads to a processing time of more than 1s for an empty action receipt. Clearly, this is going too far. There is no way this could be reproduced in production. And yet, in the estimator we can reproduce it. See #4771. I think the way forward is to reconsider what is a reasonably pessimistic assumption. I think I would go with enabled caches, (almost) fully compacted RocksDB, and only selective cache clearing where it makes sense. I will post more updates when I had the chance to play around with these ideas. |
We have to make sure our IO measurements for the parameter estimator are:
To get there, I think I need to first understand our storage layers better and we should probably reflect on what the goal of the measurement is.
Todos
Background
When running the parameter estimator inside QEMU with the ICount metric, it reports IO bytes read and written. This is done by intercepting syscall which should give the most accurate picture for how much actual IO is triggered.
However, there is a number of challenges with this approach. Simply looking at the counters before and after the execution of some code does not give an accurate picture. There are at least 3 effects we should take into account.
Prior Status Quo
Our IO measurements already disable (some) caches in an attempt to address point 1. Point 2 is somewhat addressed by using the same state every time, which at least makes the estimations consistent. Point 3 is not well understood right now, especially in the context of the estimator (at least by me).
Work items
Proposal / ideas
The text was updated successfully, but these errors were encountered: