Robust IO cost measurements in estimator #5995

jakmeier · 2022-01-04T16:57:45Z

We have to make sure our IO measurements for the parameter estimator are:

isolated from what runs before (e.g. setup) and afterwards
immune to artifacts caused by the storage architecture

To get there, I think I need to first understand our storage layers better and we should probably reflect on what the goal of the measurement is.

Todos

We have to understand the factors that cause IO.
- Can we control the background threads for IO?
- How does the shape of the Trie affect storage accesses?
- What storage access pattern produce what IO patterns?
- What is the cheapest way for an attacker to produce lots of IO?
- (and many more questions)
Define what we want to measure
- What are the assumptions behind IO measurements? E.g. which caches are enabled? What is the prior state?
Ensure the measurements in all estimations follow the defined assumptions and are appropriately isolated

Background

When running the parameter estimator inside QEMU with the ICount metric, it reports IO bytes read and written. This is done by intercepting syscall which should give the most accurate picture for how much actual IO is triggered.

However, there is a number of challenges with this approach. Simply looking at the counters before and after the execution of some code does not give an accurate picture. There are at least 3 effects we should take into account.

IO is not produced evenly throughout execution. Often, IO comes in large blocks with zero IO in between. This is due to caches and write buffers in different layers.
The prior state of the database changes how expensive it is to access an entry in that DB. Both the shape of the Trie as well as the shape of the RocksDB LSM-Tree are relevant here.
RocksDB handles IO in background threads, including flushes/write-backs as well as compaction.

Prior Status Quo

Our IO measurements already disable (some) caches in an attempt to address point 1. Point 2 is somewhat addressed by using the same state every time, which at least makes the estimations consistent. Point 3 is not well understood right now, especially in the context of the estimator (at least by me).

Work items

Proposal / ideas

Proposal: Revamp parameter estimation #6445

bowenwang1996 · 2022-01-07T01:39:44Z

The work that @Longarithm did on visualizing the trie could be helpful here

jakmeier · 2022-01-07T09:18:25Z

Some thoughts with regards to background threads:

IO happens in background threads for good reasons. Moving it to the main thread would maybe make our estimations more consistent but ultimately less accurate. (I.e. the main thread in our estimator lab setup is now slowed down by things it will not do in practice.)
Another approach is to control when exactly IO happens instead of where.
- cancel_all_background_work with wait=true should allow to block until IO tasks have finished, which is useful after setup work
- Manually triggering compaction is also worth mentioning here. But careful: this messes with the compaction policy and might lead to unrealistic shapes of the LSM tree.
With IO offloaded from the main thread, this cost is in many cases orthogonal to execution cost. Thus, I could also see a strategy to estimate IO cost independently from execution cost and take whichever of the two is higher as the final estimation.

bowenwang1996 · 2022-01-07T19:16:15Z

With IO offloaded from the main thread, this cost is in many cases orthogonal to execution cost. Thus, I could also see a strategy to estimate IO cost independently from execution cost and take whichever of the two is higher as the final estimation.

Yeah I think that makes sense

jakmeier · 2022-01-11T15:12:41Z

I measured RocksDB insertion performance for sequential keys with 1kB and 100kB sized values.
The number of insertions per iteration is 1M and 10k so that each add up to 1GB per iteration.
Then I measured 100 iterations. Here is the summary:

	avg	min	max
1kB gcloud	18.828 s	16.776 s	22.121 s
100kB gcloud	17.619 s	16.816 s	20.667 s
1kB gcloud, forced compaction	31.692 s	27.294 s	33.224 s
100kB gcloud, forced compaction	32.444 s	31.779 s	33.138 s
1kB local	7.783 s	3.395 s	19.513 s
100kB local	4.945 s	1.517 s	14.731 s
1kB local, forced compaction	10.873 s	6.761 s	18.747 s
100kB local, forced compaction	9.427 s	4.022 s	20.302 s

The cases with forced compaction include a manually triggered compaction after the insertions, adding to the total time.

Local machine uses consumer NVMe SSD + AMD Ryzen 5 3600 6-Core Processor with CPU Mhz range 2200 - 4703.
Cloud uses Google-managed SSD persistent disk + 4 vCPU Intel Cascade Lake, Intel(R) Xeon(R) CPU @ 2.80GHz.

Statistical considerations

Reported are arithmetic mean (avg), minimum and maximum measurements for a 1GB iteration.

However, the distributions are definitely not Gaussian, judging by the histograms I looked at. The right tale is much longer. So take the average with a grain of salt. The mean would generally be quite a bit below the average.

On the other hand, there are no gaps in the distibution. I.e. all values between min and max have some chance to be observed and it decreases almost monotonically towards the larger times.

Source code for rerunning

Source code of benchmark is on a separate branch so far

Example command:

# Assumes data_dump_0001.bin has been generated and should be reused for all iterations
for run in {1..100}; do cargo run --release -p runtime-params-estimator --features required -- --home ~/.near --metric time --costs RocksDbInsertValueKByte --pr-data-path data_dump_0001.bin --debug rocksdb; done  >> rocksdb_insert_value_kb.txt
awk -f summary.awk rocksdb_insert_value_kb.txt

AWK script to format output:

# summary.awk
/setup/{c=0};
/Cost/{
        # Convert seconds and ms to common base
        if($3~/ms/){ c = substr($3,1,length($3)-2) }
        else{ c = 1000 * substr($3,1,length($3)-1) }
};
/level/{if (c==0){lb[$5]=$1;} else {la[$5]=$1}}
/Finished/{
        counts++;
        sums+=c;
        if(max<c) { max=c; }
        if(min>c || min==0) { min=c; }
}
END {
        printf "%3d | %7.3f s | %7.3f s | %7.3f s |\n", counts, sums/counts/1000, min/1000, max/1000
}

Interpretation

We shouldn't read too much into the exact numbers here, since these will vary with workload and hardware. But to gain some intuition, each second reported here corresponds to 1 microsecond per 1kB, or 1ms per 1MB. If we were to translate this to gas cost, the 33.224s measurement would lead to 3.3224 TGas per MB inserted.

Now compare that to what we currently charge for 10x 100kB insertion:

10 * storage_write_base (64'196'736'000 Gas)
1'000'000 * storage_write_value_byte (31'018'539 Gas)

Total: 31.66 TGas per MB

This is around 10 times more than my measurements had in the absolute worst case. Which seems like a good place to be right now. But it also hints that we can maybe afford to lower storage write costs, even when considering overhead costs and including some safety margin.

Key Insights

RocksDB insert operations in isolation take significant time but are well-covered by current cost parameters.
Compaction also takes significant time.
Even for this artificially simple case, RocksDB performance has a high variance. This can be reduced by running tests on the cloud and forcing compaction.

Additional Details

What is reported in this section shouldn't be relevant for the conclusion taken. However, for the extra curious reader (and my own benefit to sort my thoughs), I wrote down some more of my analysis on this topic in the spoilers below.

LSMT shape analysis

It turns out, a large part of the variance in measurements is correlated with a difference in the shape of the underlying tree structure within RocksDB.

Example for 100kB local, forced compaction:

# occurrences	L0/L1/L2 files after setup and after benchmark	avg	min	max
90	0/0/33-->0/0/48	9.782 s	4.022 s	20.302 s
10	0/0/32-->0/0/47	6.226 s	4.607 s	8.952 s

The above table shows a very small difference in the shape of the tree after the setup, just one less file in level-2 of LSM tree. (And an equally small difference at the end of the benchmark.) This was enough to bring the minimum in the smaller sample down to 6.2s where the larger sample never finished before 9.7s. On the other hand, the average is nevertheless ~15% higher in the second sample. (Commenting on max is difficult. Because of the small sample size in the second group, less extreme max is expected to some degree.)

There are many more different shapes when looking at the results without manual compaction.

Example for 100kB local:

# occurrences	L0/L1/L2 files after setup and after benchmark	avg	min	max
25	2/2/20-->2/2/33	3.115 s	1.517 s	6.709 s
14	2/9/19-->2/2/33	4.304 s	1.552 s	9.522 s
7	2/9/19-->3/3/28	6.717 s	2.491 s	14.731 s
7	2/2/20-->3/2/29	5.988 s	2.520 s	7.653 s
5	2/2/20-->3/1/33	5.498 s	1.702 s	6.893 s
4	4/8/12-->3/3/28	10.236 s	2.811 s	13.794 s
4	2/16/12-->3/3/28	7.557 s	2.605 s	12.591 s
3	2/2/20-->3/3/28	5.356 s	4.709 s	6.546 s
3	2/15/13-->3/3/28	8.417 s	5.912 s	12.812 s
2	2/9/19-->4/1/32	2.566 s	2.079 s	3.052 s
2	2/18/10-->3/3/28	8.983 s	8.886 s	9.080 s
2	2/15/13-->3/2/32	2.074 s	2.001 s	2.146 s
2	2/13/15-->3/3/28	6.997 s	6.745 s	7.249 s
2	2/10/18-->4/1/32	2.169 s	2.116 s	2.223 s
2	2/10/18-->2/2/33	1.894 s	1.564 s	2.225 s
1	omitted 16 rows of single occurrences	...	...	...

The inserted data has been exactly the same, for setup and the actual measurements. Thus, for now, I can only assume these differences have to do with CPU and SSD timings that vary.

There is a trend that groups where the final tree has more files in L2 (e.g. 32) finish faster than those with less (e.g. 28). But it is not clear at all whether different shapes are the reason for this difference in performance, or the other way around.

Looking at the same on the Google cloud VM, the shapes are much more consistent even without compaction.

Example for 100kB gcloud, forced compaction:

# occurrences	L0/L1/L2 files after setup and after benchmark	avg	min	max
98	0/0/33-->0/0/48	32.455 s	31.779 s	33.138 s
2	0/0/32-->0/0/47	31.917 s	31.916 s	31.919 s

Example for 100kB gcloud:

# occurrences	L0/L1/L2 files after setup and after benchmark	avg	min	max
86	2/9/19-->3/3/28	17.459 s	16.816 s	18.194 s
8	2/9/19-->4/2/28	17.490 s	17.483 s	17.501 s
5	4/8/12-->3/3/28	20.659 s	20.647 s	20.667 s
1	2/2/20-->2/3/29	17.251 s	17.251 s	17.251 s

My next steps

This first step was mostly for me to learn and intuitively understand what I can expect from RocksDB performance measurements. Next, I will use what I learned here to integrate robust IO measurements into the parameter estimator.

bowenwang1996 · 2022-01-19T19:56:55Z

Now compare that to what we currently charge for 10x 100kB insertion:

10 * storage_write_base (64'196'736'000 Gas)

1'000'000 * storage_write_value_byte (31'018'539 Gas)

@jakmeier What about storage_write_key_byte?

jakmeier · 2022-01-19T21:01:59Z

Good point, I should have mentioned that. In this instance, all keys have less than 10 bytes. The associated costs is less than 10 * 70'482'867 and thus completely negligible for the purpose of this high-level comparison of raw RocksDB performance vs current gas cost.

jakmeier · 2022-01-20T12:28:10Z

An interesting find regarding read performance: Writes that have not been flushed & compacted negatively impact read performance.

Here are two plots produced using the script rocksdb_read_benchmark.sh.

rocksdb_read_benchmark.pdf
rocksdb_read_benchmark_up_to_6kb.pdf

Clearly, forcing a flush and compaction before the read measurement starts is beneficial. In other words, cleaning up RocksDB structure does measurably affect read performance. Which raises the question, is this overhead caused by writes earlier something that should be accounted for in the gas cost for reads?

For now, the new estimations RocksDbReadValueByte,RocksDbWriteValueByte I added in the parameter estimator assume compaction is not enforced. This means writes are estimated cheaper (as shown in table from last week) but reads are more expensive, as shown in the new plots. I think this reflects best how we currently use RocksDB.

But the discussion on how we want to treat storage costs in the future are ongoing. Personally, I think we should enforce compaction at a certain interval (e.g. after every block), add the compaction cost to the write gas cost and assume reads are always on a clean state. This should be more stable and reliable than what we have today.

bowenwang1996 · 2022-01-22T19:55:25Z

@jakmeier fyi there is some plan to have a separate hashtable based storage for reads that would allow for better performance

jakmeier · 2022-02-14T11:57:51Z

Another issue I discovered in the last weeks is that currently the estimator cannot run for large DB stores. If I insert 8 million accounts, then the OOM killer kicks in on a machine with 16GB RAM. I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

bowenwang1996 · 2022-02-21T16:02:56Z

I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

Indeed. Could we fix that?

stale · 2022-05-22T16:31:12Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

jakmeier · 2022-08-23T15:46:51Z

I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

Indeed. Could we fix that?

I looked into this a bit more today. With a large DB, creating a new store from a state dump is often taking the majority of the time for estimations. Unfortunately, my attempts so far to optimize this failed.

Streaming the state dump and writing it in batches is significantly slower. But memory footprint is reduced.
Creating a RocksDB directory once at the beginning and the copying that on the filesystem level is much faster for each iteration. But the first round is much slower. And crucially, this adds significant variance to measurements as the SST files of RocksDB are now subject to copy-on-wrote optimizations by the filesystem.

In conclusion, it would ultimately be better to avoid the need for huge databases in the estimator. Creating huge DBs has not reliably produced worst-cast access times anyway. For example #4771 shows that even small amounts of data can produce really bad cases. Far worse than what a DB with 50M accounts produces.

My current plan is to either find specific bad cases with small data, or otherwise rely on counting DB operations and assuming fixed latencies. The next step forward for this is to implement #7058.

jakmeier · 2022-08-26T17:17:49Z

I am happy to declare I finally reached the point where "all worst-case assumptions" proves to be too pessimistic.

Combining these worst known cases

unreasonably bad SST file layout ,
all nearcore caches disabled,
rocksdb state is at least a couple dozen GBs large,
clearing rocksdb buffers and caches,
clearing OS page cache,

leads to a processing time of more than 1s for an empty action receipt. Clearly, this is going too far. There is no way this could be reproduced in production. And yet, in the estimator we can reproduce it. See #4771.

I think the way forward is to reconsider what is a reasonably pessimistic assumption. I think I would go with enabled caches, (almost) fully compacted RocksDB, and only selective cache clearing where it makes sense. I will post more updates when I had the chance to play around with these ideas.

jakmeier added T-contract-runtime Team: issues relevant to the contract runtime team A-params-estimator Area: runtime params estimator labels Jan 4, 2022

jakmeier self-assigned this Jan 4, 2022

jakmeier mentioned this issue Jan 4, 2022

Easy and understandable fees estimation #4633

Closed

jakmeier mentioned this issue Jan 7, 2022

feat: Reintroduce single_thread_rocksdb #6006

Merged

jakmeier mentioned this issue Feb 24, 2022

feat(params-estimator): Clear IO caches #6347

Merged

stale bot added the S-stale label May 22, 2022

jakmeier mentioned this issue Jun 17, 2022

Simple IO cost estimation based on SSD specs #7060

Closed

akhi3030 removed the S-stale label Jul 8, 2022

jakmeier mentioned this issue Aug 26, 2022

Runtime hits unknown perfomance cliff in the estimator #4771

Closed

jakmeier mentioned this issue Aug 31, 2022

Param estimator should be run with a very large state #3194

Closed

jakmeier added the C-tracking-issue Category: a tracking issue label Sep 23, 2022

jakmeier mentioned this issue Dec 16, 2022

meta-tx gas cost estimation #8114

Closed

4 tasks

jakmeier mentioned this issue Mar 28, 2023

Cost estimation tracking issue #8819

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust IO cost measurements in estimator #5995

Robust IO cost measurements in estimator #5995

jakmeier commented Jan 4, 2022 •

edited by matklad

Loading

bowenwang1996 commented Jan 7, 2022

jakmeier commented Jan 7, 2022

bowenwang1996 commented Jan 7, 2022

jakmeier commented Jan 11, 2022

bowenwang1996 commented Jan 19, 2022

jakmeier commented Jan 19, 2022

jakmeier commented Jan 20, 2022

bowenwang1996 commented Jan 22, 2022

jakmeier commented Feb 14, 2022

bowenwang1996 commented Feb 21, 2022

stale bot commented May 22, 2022

jakmeier commented Aug 23, 2022

jakmeier commented Aug 26, 2022 •

edited

Loading

Robust IO cost measurements in estimator #5995

Robust IO cost measurements in estimator #5995

Comments

jakmeier commented Jan 4, 2022 • edited by matklad Loading

Todos

Background

Prior Status Quo

Work items

Proposal / ideas

bowenwang1996 commented Jan 7, 2022

jakmeier commented Jan 7, 2022

bowenwang1996 commented Jan 7, 2022

jakmeier commented Jan 11, 2022

Interpretation

Key Insights

Additional Details

My next steps

bowenwang1996 commented Jan 19, 2022

jakmeier commented Jan 19, 2022

jakmeier commented Jan 20, 2022

bowenwang1996 commented Jan 22, 2022

jakmeier commented Feb 14, 2022

bowenwang1996 commented Feb 21, 2022

stale bot commented May 22, 2022

jakmeier commented Aug 23, 2022

jakmeier commented Aug 26, 2022 • edited Loading

jakmeier commented Jan 4, 2022 •

edited by matklad

Loading

jakmeier commented Aug 26, 2022 •

edited

Loading