Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reason for using Mutex instead of RWMutex in querier #6048

Closed
yun-yeo opened this issue Feb 4, 2021 · 14 comments
Closed

Reason for using Mutex instead of RWMutex in querier #6048

yun-yeo opened this issue Feb 4, 2021 · 14 comments
Labels
T:perf Type: Performance T:question Type: Question

Comments

@yun-yeo
Copy link

yun-yeo commented Feb 4, 2021

Is there any reason using just Mutex instead of RWMutex?

Query speed will be super faster with RWMutex (read lock on query and write lock on other parts),

mtx *tmsync.Mutex

If we can ensure these queries are not doing write operation, we can change them to use read lock and can support concurrent queries.

@melekes
Copy link
Contributor

melekes commented Feb 8, 2021

No reason. Do you want to make some benchmarks and show us the difference?

@melekes melekes added T:perf Type: Performance T:question Type: Question labels Feb 8, 2021
@hanjukim
Copy link

hanjukim commented Feb 8, 2021

We tested this with tons of simultaneous valid queries with result while syncing the blocks. With Mutex, one read query blocks other queries. RWMutex mitigates this problem by allowing multiple reads at the same time. Using RWMutex also makes program to utilize more cpu resources, which is good for cost efficiency, and it means you can serve more clients with same machine.

with Mutex, 16% (0.49/3.02) of total time spent in Wait:
https://gist.github.com/hanjukim/644842107beba4a619b3f56f2c9a62c8/raw/659d50a42068cfa107e10724c69487043a0e7cf4/pprof_block_Mutex.svg

with RWMutex:
https://gist.githubusercontent.com/hanjukim/644842107beba4a619b3f56f2c9a62c8/raw/659d50a42068cfa107e10724c69487043a0e7cf4/pprof_block_RWMutex.svg

@melekes
Copy link
Contributor

melekes commented Feb 8, 2021

cool! do you want to submit a PR?

@hanjukim
Copy link

hanjukim commented Feb 9, 2021

cool! do you want to submit a PR?

We will make a PR once it's stable for real

@RiccardoM
Copy link
Contributor

@hanjukim Do you have any update on this? It looks like this is now causing significant slowness inside the entire Cosmos Hub as well now. If you want, I can open a PR implementing your solution if you don't have time

@melekes melekes self-assigned this Feb 23, 2021
@melekes
Copy link
Contributor

melekes commented Feb 23, 2021

It may be possible to completely remove mtx from Info and Query calls.

@melekes melekes removed their assignment Feb 23, 2021
@melekes melekes mentioned this issue Feb 23, 2021
4 tasks
@alexanderbez
Copy link
Contributor

alexanderbez commented Feb 23, 2021

Curious why Info and Query need to be serialized to begin with?

@hanjukim
Copy link

hanjukim commented Feb 24, 2021

@hanjukim Do you have any update on this? It looks like this is now causing significant slowness inside the entire Cosmos Hub as well now. If you want, I can open a PR implementing your solution if you don't have time

Yes, we have some progress here.

  1. We've successfully applied RWMutex to our public nodes, but we had to make some tweaks for cosmwasm module to use write lock. You can see some changes in https://github.com/terra-project/tendermint/blob/rw-lock3/abci/client/local_client.go
  2. Changing the Mutex made some improvements. However,
  3. You can only have one iterator at a time for accessing LevelDB which means it has to be synchronous and it blocks everything until it finishes.
  4. There is a major blocker function in the staking module of Cosmos SDK: https://github.com/terra-project/cosmos-sdk/blob/v0.39.2/x/staking/keeper/delegation.go#L52-L65
  5. Since changes in the KV changes the state, we couldn't find easy way to make it compatible without hard/soft forking. (i.e. making it able to query with height parameter)
  6. We've decided to disable the blocker endpoints for public node until we have the solution.

@hanjukim
Copy link

hanjukim commented Feb 24, 2021

One more thing:

We have a special project called mantle-sdk, and mantle that wraps over the Tendermint and the CosmosSDK and serves data via GraphQL. We tried to touch the same blocker path custom/staking/validatorDelegations to Mantle server, and the result is stunning:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.18ms   48.73ms 519.16ms   96.84%
    Req/Sec     1.24k   176.31     1.54k    84.50%
  49384 requests in 10.03s, 45.26MB read
Requests/sec:   4923.65

We are guessing from this result that it is probably a problem related to IAVL tree. (Huge amount of rebalancing between blocks?)

@alexanderbez
Copy link
Contributor

Yeah sounds like it!

@kjessec
Copy link

kjessec commented Feb 26, 2021

We found out that this could potentially lead to cosmos application to panic.

https://github.com/cosmos/cosmos-sdk/blob/v0.39.1/x/staking/keeper/validator.go#L45

In GetValidator function cosmos-sdk caches validator lists in a map, and as we might already know, in case of golang concurrent writes to a map would lead to panic.

Technically this is not related to tendermint itself, but given most applications are using Tendermint+cosmos stack, we should be careful before merging this.

FYI Above map cache thing is removed in current master of cosmos-sdk (stargate) -- maybe safe in the future?

@RiccardoM
Copy link
Contributor

RiccardoM commented Feb 26, 2021

FYI Above map cache thing is removed in current master of cosmos-sdk (stargate) -- maybe safe in the future?

I can confirm that this has been removed with cosmos/cosmos-sdk#8546 as it was making gRPC queries crash and a lot slower anyway (cosmos/cosmos-sdk#8545). I don't think it will ever return as it was a tricky hack to avoid high deserialization costs due to Amino (which is now gone). For this reason, I think it shouldn't be considered an issue anymore.

@reuvenpo
Copy link

reuvenpo commented Mar 25, 2021

Has there been any progress on this issue since last month?
I'd love to see this work merged and propagate upstream to the next Cosmos SDK version if the improvements are so significant :)
If there's any specific work i can do to help push this let me know if and how I can contribute.

@tychoish
Copy link
Contributor

tychoish commented Apr 5, 2021

See this commit: 1c4dbe3

I believe this is covered in the above change (sorry for not annotating the commit message appropriately!) I think the current view is that that while changing the locking strategy will help relieve some contention, but a lot of the performance bottlenecks are rooted deeper in the client implementations (e.g. IAVL related) which changing locks won't help with.

I think it makes sense to close this issue given that the change has landed, but let me know if I've missed something or if it makes sense to keep this open for another reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:perf Type: Performance T:question Type: Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants