Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Add experimental config params to allow for subscription buffer size control (tm v0.34.x) #7230

Merged
merged 9 commits into from
Nov 9, 2021

Conversation

thanethomson
Copy link
Contributor

@thanethomson thanethomson commented Nov 3, 2021

Follows from #7188 and related to #6729.

The aim here is to provide a short-term workaround for #6729 in the v0.34.x series of Tendermint that allows for tuning of the internal buffer sizes relating to event subscription. The following configuration parameters are introduced in the [rpc] section:

[rpc]

# Experimental parameter to specify the maximum number of events a node will
# buffer, per subscription, before returning an error and closing the
# subscription. Must be set to at least 100, but higher values will accommodate
# higher event throughput rates (and will use more memory).
experimental_subscription_buffer_size = 1000

# Experimental parameter to specify the maximum number of RPC responses that
# can be buffered per WebSocket client. If clients cannot read from the
# WebSocket endpoint fast enough, they will be disconnected, so increasing this
# parameter may reduce the chances of them being disconnected (but will cause
# the node to use more memory).
#
# Must be at least the same as "experimental_subscription_buffer_size",
# otherwise connections could be dropped unnecessarily. This value should
# ideally be somewhat higher than "experimental_subscription_buffer_size" to
# accommodate non-subscription-related RPC responses.
experimental_websocket_write_buffer_size = 1000

# If a WebSocket client cannot read fast enough, at present we may
# silently drop events instead of generating an error or disconnecting the
# client.
#
# Enabling this experimental parameter will cause the WebSocket connection to
# be closed instead if it cannot read fast enough, allowing for greater
# predictability in subscription behaviour.
experimental_close_on_slow_client = false

@thanethomson
Copy link
Contributor Author

Seems like the new linting changes in 8a2dcba are picking up a whole bunch of spelling mistakes throughout the codebase 🙂

@tychoish
Copy link
Contributor

tychoish commented Nov 4, 2021

Seems like the new linting changes in 8a2dcba are picking up a whole bunch of spelling mistakes throughout the codebase 

They're less mistakes and more conventions :p

Anyway, those should be fixed if you rebase/etc.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson
Copy link
Contributor Author

So we've now managed to confirm that:

  1. With the latest version of Hermes, we can still reproduce the RPC disconnection with the official Gaia v4.2.1 release (Tendermint v0.34.9).
  2. We can reproduce the RPC disconnection with Gaia v4.2.1 built with Tendermint from the v0.34.x branch.
  3. We can reproduce the RPC disconnection with Gaia v4.2.1 built with Tendermint from this PR's branch (thane/6729-websocket-bufsize2) with experimental buffer sizes set to 100.
  4. Using the same setup as in (3), but setting the buffer sizes to 1000 results in us not being able to reproduce the RPC disconnection. We do not get disconnected from Gaia, even when submitting 10,000 transactions as quickly as we can with Hermes (all processes running on the same machine).

Beyond enabling these experimental config parameters, I'd like to propose we increase the default subscription buffer size to 1000 to better handle the loads we're seeing in live networks.

Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change seems mechanically fine. My one main concern is this: If we add these experimental settings only to the v0.34 line, it becomes an upgrade blocker for v0.35.

The biggest users are the ones most likely to need this knob, and there are enough other quality of life improvements in the v0.35 series that we'd prefer not to add any additional impediments to their upgrading.

It looks to me like the config plumbing parts of this PR could be safely added to v0.35.x for the next point release, which would avert that. I think it'd be fine to land them separately rather than as a backport from v0.35, though.

Your thoughts?

config/config.go Outdated Show resolved Hide resolved
rpc/core/events.go Show resolved Hide resolved
@creachadair
Copy link
Contributor

Overall this change seems mechanically fine. My one main concern is this: If we add these experimental settings only to the v0.34 line, it becomes an upgrade blocker for v0.35.

The biggest users are the ones most likely to need this knob, and there are enough other quality of life improvements in the v0.35 series that we'd prefer not to add any additional impediments to their upgrading.

It looks to me like the config plumbing parts of this PR could be safely added to v0.35.x for the next point release, which would avert that. I think it'd be fine to land them separately rather than as a backport from v0.35, though.

Your thoughts?

Just to clarify: I'm not objecting to merging this change into v0.34 either way, I am just thinking about the path for someone who needs to upgrade.

@thanethomson
Copy link
Contributor Author

It looks to me like the config plumbing parts of this PR could be safely added to v0.35.x for the next point release, which would avert that. I think it'd be fine to land them separately rather than as a backport from v0.35, though.

I'd be more than happy to submit a separate PR to introduce similar changes into v0.35.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson
Copy link
Contributor Author

Are the coverage tests known to be non-deterministic? It seems like each time I re-run the tests, a different set of coverage tests fail.

@creachadair
Copy link
Contributor

Are the coverage tests known to be non-deterministic? It seems like each time I re-run the tests, a different set of coverage tests fail.

Some of the end-to-end tests are nondeterministic, but the unit tests should usually not be. This looks like a bug in the tests, though it is quite possible it's a bug we fixed already on trunk. Over the past few months we've cleaned up various issues with goroutine lifetime.

If it doesn't fail consistently, I don't think we should block over it.

@creachadair creachadair merged commit 12e3419 into v0.34.x Nov 9, 2021
@creachadair creachadair deleted the thane/6729-websocket-bufsize2 branch November 9, 2021 20:35
creachadair pushed a commit that referenced this pull request Nov 13, 2021
…(tm v0.35.x) (#7276)

* Update error message to correspond to changes in v0.34.x
* Add buffer size and client-close config parameters

Signed-off-by: Thane Thomson <connect@thanethomson.com>
lklimek referenced this pull request in dashpay/tenderdash Mar 25, 2022
…(tm v0.35.x) (#7276)

* Update error message to correspond to changes in v0.34.x
* Add buffer size and client-close config parameters

Signed-off-by: Thane Thomson <connect@thanethomson.com>
(cherry picked from commit 035da42)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants