Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the default buffering timers #9433

Open
FancyFane opened this issue Dec 22, 2021 · 0 comments
Open

Change the default buffering timers #9433

FancyFane opened this issue Dec 22, 2021 · 0 comments
Assignees
Labels
Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@FancyFane
Copy link
Collaborator

Buffering Timeout Problems

While working on buffering documentation, a simulation was ran to see how the buffering behavior would act if a PlannedReparentShard (PRS) command was to fail. During this scenario it was discovered the failure of the PRS command takes about 40 - 50 seconds to process.

$ time vtctlclient -server localhost:15999 PlannedReparentShard -keyspace_shard=commerce/0

PlannedReparentShard Error: rpc error: code = Unknown desc = primary-elect tablet zone1-0000000101 failed to catch up with replication MySQL56/4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186: rpc error: code = Unknown desc = TabletManager.WaitForPosition on zone1-0000000101 error: timed out waiting for position 4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186: timed out waiting for position 4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186
E1221 20:04:41.042580  205304 main.go:76] remote error: rpc error: code = Unknown desc = primary-elect tablet zone1-0000000101 failed to catch up with replication MySQL56/4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186: rpc error: code = Unknown desc = TabletManager.WaitForPosition on zone1-0000000101 error: timed out waiting for position 4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186: timed out waiting for position 4fb7c72c-62c8-11ec-8287-8cae4cdeeda4:1-16186

real	0m41.734s
user	0m0.008s
sys	0m0.015s

Looking over the buffering results in this situation there were a few buffers that expired due to exceeding the -buffer_window default is 20 seconds, this is shown below:

curl -s localhost:15001/metrics | grep -v '^#' | grep buffer_requests
vtgate_buffer_requests_buffered{keyspace="commerce",shard_name="0"} 30
vtgate_buffer_requests_buffered_dry_run{keyspace="commerce",shard_name="0"} 0
vtgate_buffer_requests_drained{keyspace="commerce",shard_name="0"} 15
vtgate_buffer_requests_evicted{keyspace="commerce",reason="BufferFull",shard_name="0"} 0
vtgate_buffer_requests_evicted{keyspace="commerce",reason="ContextDone",shard_name="0"} 0
vtgate_buffer_requests_evicted{keyspace="commerce",reason="WindowExceeded",shard_name="0"} 15
vtgate_buffer_requests_skipped{keyspace="commerce",reason="BufferFull",shard_name="0"} 0
vtgate_buffer_requests_skipped{keyspace="commerce",reason="Disabled",shard_name="0"} 0
vtgate_buffer_requests_skipped{keyspace="commerce",reason="LastFailoverTooRecent",shard_name="0"} 50
vtgate_buffer_requests_skipped{keyspace="commerce",reason="LastReparentTooRecent",shard_name="0"} 0
vtgate_buffer_requests_skipped{keyspace="commerce",reason="Shutdown",shard_name="0"} 0

NOTE: All 15 of the connections utilized in this scenario failed due to WindowExceeded

Purposed Solution

To better handle the failed PRS scenario, I would like to purpose changing the default buffer times to the following values:

-buffer_max_failover_duration=1m  (current default 20s)
-buffer_min_time_between_failovers=2m (current default 1m)
-buffer_window=1m (current default 10s)

In follow up buffering test, it was found these values helped span the PRS failure and prevented sending errors to applications sending query request during this period. This was tested on main and I would purpose these changes to compliment the changes made in the Change default Vitess buffering implementation issue #9359

@FancyFane FancyFane self-assigned this Dec 22, 2021
FancyFane added a commit to planetscale/vitess that referenced this issue Dec 22, 2021
Signed-off-by: FancyFane <fane@planetscale.com>
vmg pushed a commit to planetscale/vitess that referenced this issue Jan 11, 2022
Signed-off-by: FancyFane <fane@planetscale.com>
@ajm188 ajm188 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management labels Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

No branches or pull requests

2 participants