Reason for using Mutex instead of RWMutex in querier #6048

yun-yeo · 2021-02-04T08:54:39Z

Is there any reason using just Mutex instead of RWMutex?

Query speed will be super faster with RWMutex (read lock on query and write lock on other parts),

Line 18 in 9b6d6a3

mtx *tmsync.Mutex

If we can ensure these queries are not doing write operation, we can change them to use read lock and can support concurrent queries.

tendermint/abci/client/local_client.go

Lines 246 to 251 in 9b6d6a3

    
           func (app *localClient) QuerySync( 
        
           	ctx context.Context, 
        
           	req types.RequestQuery, 
        
           ) (*types.ResponseQuery, error) { 
        
           	app.mtx.Lock() 
        
           	defer app.mtx.Unlock()

tendermint/abci/client/local_client.go

Lines 100 to 102 in 9b6d6a3

    
           func (app *localClient) QueryAsync(ctx context.Context, req types.RequestQuery) (*ReqRes, error) { 
        
           	app.mtx.Lock() 
        
           	defer app.mtx.Unlock()

melekes · 2021-02-08T06:48:16Z

No reason. Do you want to make some benchmarks and show us the difference?

hanjukim · 2021-02-08T10:48:59Z

We tested this with tons of simultaneous valid queries with result while syncing the blocks. With Mutex, one read query blocks other queries. RWMutex mitigates this problem by allowing multiple reads at the same time. Using RWMutex also makes program to utilize more cpu resources, which is good for cost efficiency, and it means you can serve more clients with same machine.

with Mutex, 16% (0.49/3.02) of total time spent in Wait:
https://gist.github.com/hanjukim/644842107beba4a619b3f56f2c9a62c8/raw/659d50a42068cfa107e10724c69487043a0e7cf4/pprof_block_Mutex.svg

with RWMutex:
https://gist.githubusercontent.com/hanjukim/644842107beba4a619b3f56f2c9a62c8/raw/659d50a42068cfa107e10724c69487043a0e7cf4/pprof_block_RWMutex.svg

melekes · 2021-02-08T16:11:39Z

cool! do you want to submit a PR?

hanjukim · 2021-02-09T06:27:51Z

cool! do you want to submit a PR?

We will make a PR once it's stable for real

RiccardoM · 2021-02-23T06:40:09Z

@hanjukim Do you have any update on this? It looks like this is now causing significant slowness inside the entire Cosmos Hub as well now. If you want, I can open a PR implementing your solution if you don't have time

melekes · 2021-02-23T12:16:09Z

It may be possible to completely remove mtx from Info and Query calls.

alexanderbez · 2021-02-23T14:43:53Z

Curious why Info and Query need to be serialized to begin with?

hanjukim · 2021-02-24T03:44:45Z

@hanjukim Do you have any update on this? It looks like this is now causing significant slowness inside the entire Cosmos Hub as well now. If you want, I can open a PR implementing your solution if you don't have time

Yes, we have some progress here.

We've successfully applied RWMutex to our public nodes, but we had to make some tweaks for cosmwasm module to use write lock. You can see some changes in https://github.com/terra-project/tendermint/blob/rw-lock3/abci/client/local_client.go
Changing the Mutex made some improvements. However,
You can only have one iterator at a time for accessing LevelDB which means it has to be synchronous and it blocks everything until it finishes.
There is a major blocker function in the staking module of Cosmos SDK: https://github.com/terra-project/cosmos-sdk/blob/v0.39.2/x/staking/keeper/delegation.go#L52-L65
Since changes in the KV changes the state, we couldn't find easy way to make it compatible without hard/soft forking. (i.e. making it able to query with height parameter)
We've decided to disable the blocker endpoints for public node until we have the solution.

hanjukim · 2021-02-24T04:37:33Z

One more thing:

We have a special project called mantle-sdk, and mantle that wraps over the Tendermint and the CosmosSDK and serves data via GraphQL. We tried to touch the same blocker path custom/staking/validatorDelegations to Mantle server, and the result is stunning:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.18ms   48.73ms 519.16ms   96.84%
    Req/Sec     1.24k   176.31     1.54k    84.50%
  49384 requests in 10.03s, 45.26MB read
Requests/sec:   4923.65

We are guessing from this result that it is probably a problem related to IAVL tree. (Huge amount of rebalancing between blocks?)

alexanderbez · 2021-02-24T15:23:01Z

Yeah sounds like it!

kjessec · 2021-02-26T09:11:41Z

We found out that this could potentially lead to cosmos application to panic.

https://github.com/cosmos/cosmos-sdk/blob/v0.39.1/x/staking/keeper/validator.go#L45

In GetValidator function cosmos-sdk caches validator lists in a map, and as we might already know, in case of golang concurrent writes to a map would lead to panic.

Technically this is not related to tendermint itself, but given most applications are using Tendermint+cosmos stack, we should be careful before merging this.

FYI Above map cache thing is removed in current master of cosmos-sdk (stargate) -- maybe safe in the future?

RiccardoM · 2021-02-26T09:28:30Z

FYI Above map cache thing is removed in current master of cosmos-sdk (stargate) -- maybe safe in the future?

I can confirm that this has been removed with cosmos/cosmos-sdk#8546 as it was making gRPC queries crash and a lot slower anyway (cosmos/cosmos-sdk#8545). I don't think it will ever return as it was a tricky hack to avoid high deserialization costs due to Amino (which is now gone). For this reason, I think it shouldn't be considered an issue anymore.

reuvenpo · 2021-03-25T17:27:47Z

Has there been any progress on this issue since last month?
I'd love to see this work merged and propagate upstream to the next Cosmos SDK version if the improvements are so significant :)
If there's any specific work i can do to help push this let me know if and how I can contribute.

tychoish · 2021-04-05T19:49:59Z

See this commit: 1c4dbe3

I believe this is covered in the above change (sorry for not annotating the commit message appropriately!) I think the current view is that that while changing the locking strategy will help relieve some contention, but a lot of the performance bottlenecks are rooted deeper in the client implementations (e.g. IAVL related) which changing locks won't help with.

I think it makes sense to close this issue given that the change has landed, but let me know if I've missed something or if it makes sense to keep this open for another reason.

melekes added T:perf Type: Performance T:question Type: Question labels Feb 8, 2021

RiccardoM mentioned this issue Feb 17, 2021

High queries amount can make miss blocks #6122

Closed

melekes self-assigned this Feb 23, 2021

melekes removed their assignment Feb 23, 2021

melekes mentioned this issue Feb 23, 2021

Remove localClient.mtx #5411

Closed

4 tasks

hanjukim mentioned this issue Feb 24, 2021

abci: replace Mutex to RWMutex for better concurrency #6173

Closed

faddat mentioned this issue Mar 21, 2021

api /cosmos/staking/v1beta1/validators/%s/delegations is slow and high cpu usage cosmos/gaia#735

Closed

tychoish closed this as completed Apr 5, 2021

ValarDragon mentioned this issue May 29, 2021

Load test Requests per second osmosis-labs/osmosis#210

Closed

michaelfig mentioned this issue Sep 4, 2021

Allowing ABCI Queries while computing blocks #6899

Closed

4 tasks

MavenRain mentioned this issue Apr 13, 2023

blockchain.list is very slow liftedinit/many-rs#358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for using Mutex instead of RWMutex in querier #6048

Reason for using Mutex instead of RWMutex in querier #6048

yun-yeo commented Feb 4, 2021 •

edited

Loading

melekes commented Feb 8, 2021

hanjukim commented Feb 8, 2021 •

edited

Loading

melekes commented Feb 8, 2021

hanjukim commented Feb 9, 2021

RiccardoM commented Feb 23, 2021

melekes commented Feb 23, 2021

alexanderbez commented Feb 23, 2021 •

edited

Loading

hanjukim commented Feb 24, 2021 •

edited

Loading

hanjukim commented Feb 24, 2021 •

edited

Loading

alexanderbez commented Feb 24, 2021

kjessec commented Feb 26, 2021

RiccardoM commented Feb 26, 2021 •

edited

Loading

reuvenpo commented Mar 25, 2021 •

edited

Loading

tychoish commented Apr 5, 2021 •

edited

Loading

Reason for using Mutex instead of RWMutex in querier #6048

Reason for using Mutex instead of RWMutex in querier #6048

Comments

yun-yeo commented Feb 4, 2021 • edited Loading

melekes commented Feb 8, 2021

hanjukim commented Feb 8, 2021 • edited Loading

melekes commented Feb 8, 2021

hanjukim commented Feb 9, 2021

RiccardoM commented Feb 23, 2021

melekes commented Feb 23, 2021

alexanderbez commented Feb 23, 2021 • edited Loading

hanjukim commented Feb 24, 2021 • edited Loading

hanjukim commented Feb 24, 2021 • edited Loading

alexanderbez commented Feb 24, 2021

kjessec commented Feb 26, 2021

RiccardoM commented Feb 26, 2021 • edited Loading

reuvenpo commented Mar 25, 2021 • edited Loading

tychoish commented Apr 5, 2021 • edited Loading

yun-yeo commented Feb 4, 2021 •

edited

Loading

hanjukim commented Feb 8, 2021 •

edited

Loading

alexanderbez commented Feb 23, 2021 •

edited

Loading

hanjukim commented Feb 24, 2021 •

edited

Loading

hanjukim commented Feb 24, 2021 •

edited

Loading

RiccardoM commented Feb 26, 2021 •

edited

Loading

reuvenpo commented Mar 25, 2021 •

edited

Loading

tychoish commented Apr 5, 2021 •

edited

Loading