-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple IO cost estimation based on SSD specs #7060
Comments
My replayer tool around currently parses the output described in ##7053. But the idea is that with such a replayer, it is very easy to extract such statistics and more. It already allows me to analyse trie node (shard and chunk) caches. See this output that it generates in one of its analysis options. It shows two receipts executed in the same block.
We can see a bunch of interesting things here.
(Just as an example. Actual analysis of this data should be done in storage optimization context.) |
Here are some statistics for DB operations per storage operations. It is based on our current estimator runs and the new instrumentation I am working on. First, the assumption is that 1DB operation = 1 disk access. Then I assume an IO operation latency of 143ns. This is derived from a measured 7000 IOPS on gcloud with a full NVME SSD and iodepth = 1. Then I inverted the 7000 IOPS, as iodepth=1 makes sequential accesses, this should be roughly what we can expect for sequential calls to the disk.
ConclusionThere are two ways of looking at this:
InstrumentationI collected these numbers running first
and then
It shows the total DB operations grouped by block measured in the estimator. This is not perfect, yet, as it require manual effort to find the important measurements and divide by the appropriate number as it is also done in the estimation. But for now, I just wanted to get the numbers and see if they are useful, before putting in too much effort in automation. |
For other storage parameters, looking at actual DB calls in traces is not useful.
Looking at it from that perspective, using the same SSD assumptions from above, we get a baseline like this: Trie Nodes
Per Byte storage cost
Again, the baseline shows higher numbers than current gas costs. Again, we can say this might be okay since the DB is more efficient than doing one IOP per DB operation. Especially since we haven't succeeded in creating benchmarks that reproducibly are slower to execute than covered by gas costs. But this should motivate us to try again finding such benchmarks. |
Finally, for all remaining parameters, I am running estimations now and checking to check how many IO operations there actually are caused by them. The only remaining work here is to then pull out the numbers. I will post here when I have them. |
Getting a correct and deterministic gas estimation of IO costs going will take some time. (See #5995 and #7053 + #7058 + #7059 for current efforts in that direction.)
In the meantime, we want an alternative a approach that serves as a sanity-checking tool. Results only have to be approximate. We would only use it to validate that a gas parameter is roughly in-line with back-of-the-envelop calculations.
One way I want to explore to do this is the following:
Here I would assume the implementation is "perfect" in the sense that no complicated data structure need to be traversed and in-memory caching is large enough to fit all data for a few blocks. At the same time, also assume that no "pre-fetching" happens, neither speculatively nor otherwise.
In other words, reading N unqiue values from disk is assumed to take exactly N IOPs, regardless of data locality. (As long as they are smaller than a block size = 4kB. Otherwise, each block requires at least one IOP)
This should give us a rough estimate of what performance we can expect from a client implementation. Let's do that for all parameters directly related to storage and see if we get similar gas numbers to what we have today.
The text was updated successfully, but these errors were encountered: