Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge] Reduce cache delay by borsh #3847

Closed
wants to merge 8 commits into from
Closed

Conversation

ailisp
Copy link
Member

@ailisp ailisp commented Jan 23, 2021

DO NOT MERGE due to undo on disk cache in the up comming release.

This PR reduce deserialize_wasmer significantly from 23ms to 10.5ms in e2-standard-n2 instance.
And deserialize_wasmer is the major bottleneck in https://github.com/near/nearcore/security/advisories/GHSA-q39c-4p3r-qpv3.

Test Plan

Deploy a sufficiently large contract, such as rainbow-bridge eth_client.wasm, and call contract multiple times, add a delay detector, set delay threshold to be 5ms instead of 50, observe time spent in deserialize_wasmer

@bowenwang1996
Copy link
Collaborator

185ms still feels long to me. This means that we cannot fit more than 5 calls to contracts of ~400kb in one block, which is quite suboptimal. cc @olonho

@bowenwang1996
Copy link
Collaborator

This was ~466ms on my machine

Did you use --release? 466ms feels longer than the time it takes to compile the contract

@ailisp
Copy link
Member Author

ailisp commented Jan 25, 2021

This was ~466ms on my machine

Did you use --release? 466ms feels longer than the time it takes to compile the contract

No. Trying now

@ailisp
Copy link
Member Author

ailisp commented Jan 25, 2021

@bowenwang1996 The difference in debug version was significant (466 vs 185ms), however, it is not a big difference in release version. I clean build, run a few times for each build to make sure i didn't build and run the incorrect version.

before:
Jan 25 14:28:13.501 INFO delay_detector: Took 16.466911ms processing in deserialize_wasmer get cache

after:
Jan 25 13:27:17.814 INFO delay_detector: Took 12.179094ms processing in deserialize_wasmer get cache

The way i test it: obtain eth-client.wasm from rainbow repo, then

export NEAR_ENV=local
near deploy test.near ~/eth_client.wasm
near call test.near initialized --accountId test.near --args ''
near call test.near initialized --accountId test.near --args ''
near call test.near initialized --accountId test.near --args '' # a few times

@ailisp
Copy link
Member Author

ailisp commented Jan 25, 2021

@bowenwang1996 what is your machine and contract to get:

Jan 16 02:58:27.875  INFO delay_detector: Took 128.962978ms processing deserialize_wasmer

@ailisp
Copy link
Member Author

ailisp commented Jan 25, 2021

Oh, i see you also have one time

Jan 16 02:58:27.554  INFO delay_detector: Took 54.682015ms processing deserialize_wasmer

so that's very volatile, we probably should rely on #3841 to measure. so far my result shows debug build gives performance gain, but in release build the performance increase is trivial, and probably won't fix the long delay in cache issue. Also, on my machine both before and after this PR, release build, deserialize wasmer is quite fast, maybe we should test it on cloud? which machine I can play with?

@ailisp
Copy link
Member Author

ailisp commented Jan 26, 2021

I tried on a cloud instance (probably faster than our prod nodes, but slower than my local one), again the performance increase is minor for release build.
Before:
38 23 21ms

this pr:
20 18 18ms

@bowenwang1996
Copy link
Collaborator

so that's very volatile, we probably should rely on #3841 to measure. so far my result shows debug build gives performance gain, but in release build the performance increase is trivial, and probably won't fix the long delay in cache issue. Also, on my machine both before and after this PR, release build, deserialize wasmer is quite fast, maybe we should test it on cloud? which machine I can play with?

The machine I used is e2-standard-4 on gcloud

@bowenwang1996
Copy link
Collaborator

@ailisp you might be hitting rocksdb cache. Can you try to disable the cache for that column? I think setting this line to 0 should do it

let cache_size = 1024 * 1024 * 32;

@ailisp
Copy link
Member Author

ailisp commented Jan 26, 2021

@bowenwang1996 cache=0 doesn't help, almost same as cache != 0, 16,14,14ms before, 13,11,12ms this PR

@ailisp
Copy link
Member Author

ailisp commented Jan 26, 2021

Just sync with Max, he suggested that a complex structure borsh vs bincode won't save much time, simpler structure does. And we identify the btreemap may be the bottleneck of deserialization (to be confirmed), I found rust std btreemap doesn't have a "create_from_sorted_vector" method, instead, it's insert one by one so deserialize it is slow. Max said wasmer maybe not need a btreemap, a sorted vec/hashmap/heap may work. Max's work would need more investigation and optimization on wasmer side, so i plan to first:

  • identify the borsh deserialization bottleneck of wasmer's CacheImage
  • if it's really btreemap, see whether it can be accelerated with a third party btreemap which has some "create_from_sorted_vector" method

@ailisp
Copy link
Member Author

ailisp commented Jan 27, 2021

Observations from Max and me is correct. Majority time is spent on deserialize HashMap and BTreeMap in wasmer's CacheImage. if replace all HashMap and BTreeMap<X,Y> with a Vec<(X,Y)> (they're serialized in same way in borsh), a release build to deserialize CacheImage reduce from 10ms to 3.36ms. I would expect similar percentage of performance gain on our server. So, our borsh deserialize HashMap and BTreeMap is too slow, it first deserialize as vector and insert one by one. Given rust std HashMap/BTreeMap doesn't have batch insert/initialization, we should consider replace them with a more performant alternative, or some op in raw c pointer like hash_map.to_raw_bytes() / unsafe hash_map.from_raw_bytes()

@MaksymZavershynskyi
Copy link
Contributor

Good job investigating. I don't think we can batch initialize hash_map with raw data, since it probably does not occupy a contiguous area in memory.

@ailisp
Copy link
Member Author

ailisp commented Jan 28, 2021

Good job investigating. I don't think we can batch initialize hash_map with raw data, since it probably does not occupy a contiguous area in memory.

I looked further into this, hashmap (look it's Clone implementation) does use a contiguous memory as long as k,v are Copy, which is the case here. btreemap doesn't, but batch insertion from sorted array would reduce time from O(nlgn) to O(n).

@evgenykuzyakov suggest a faster fix, both in execution and effort, that we don't serialize that hashmap at all, (it's okay because we don't need to know the kind of wasmer exception, we always return a WasmerUnknownError for that case) This makes the overall ~15ms entire deserialize wasmer down to ~6ms, which is probably good enough, so this PR is done.

@ailisp ailisp changed the title Reduce cache delay by borsh [Do not merge] Reduce cache delay by borsh Jan 28, 2021
@ailisp ailisp marked this pull request as ready for review January 28, 2021 22:55
@stale
Copy link

stale bot commented Jul 1, 2021

This PR has been automatically marked as stale because it has not had recent activity in the 2 weeks.
It will be closed in 3 days if no further activity occurs.
Thank you for your contributions.

@stale stale bot added the S-stale label Jul 1, 2021
@bowenwang1996
Copy link
Collaborator

@ailisp what is the status of this PR?

@stale stale bot removed the S-stale label Jul 1, 2021
@ailisp
Copy link
Member Author

ailisp commented Jul 5, 2021

@bowenwang1996 supersede with #4448 with added backward compatible changes

@ailisp ailisp closed this Jul 5, 2021
@ailisp ailisp deleted the debug-cache-delay branch July 5, 2021 01:54
near-bulldozer bot pushed a commit that referenced this pull request Jul 26, 2021
Use borsh deserialized wasmer 0.x, which provides about 30% time saving in wasmer's several testing contract bench and 80% in near evm contract, and other contract bench shows 25%-50% improve on a cloud instance: #3847. This uses a pre release wasmer-near. Will replace with a formal released wasmer-near after it's released

Test Plan
---------
- CI
- wasmer's bench code which covers the code path to ser/de cache with borsh
- vm standalone test: evm_slow_deserialize_repro, test with no_cache feature (means no in memory cache), which will do a full serialize to disk cache, load and deserialize with borsh and contract call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants