Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Fixes and VTOrc Improvement #11881

Merged
merged 24 commits into from
Dec 16, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
31ba4d4
refactor: move tests out of newfeaturestest so that they run on upgra…
GuptaManan100 Dec 2, 2022
b9417b8
feat: add failing ers test for handling multiple vttablet failures wi…
GuptaManan100 Dec 2, 2022
def9aec
feat: add a new lock-timeout flag and use that instead of remote-oper…
GuptaManan100 Dec 2, 2022
0273581
feat: augment DownPrimary test to reproduce the issue of VTOrc not ha…
GuptaManan100 Dec 2, 2022
6c449c0
feat: remove LockShardTimeout configuration from VTOrc and add parall…
GuptaManan100 Dec 5, 2022
d288571
log: add more logging lines around ers in vtorc
GuptaManan100 Dec 5, 2022
e1b23da
test: get the test to work
GuptaManan100 Dec 5, 2022
c2b6a0d
feat: fix usage of wait for replicas timeout
GuptaManan100 Dec 5, 2022
033083c
test: fix flags expected output
GuptaManan100 Dec 5, 2022
5f03e90
test: fix race in test now that the function is called in parallel mu…
GuptaManan100 Dec 5, 2022
c2aa9fd
feat: fix default of onCloseTimeout to 1 second
GuptaManan100 Dec 13, 2022
2097817
test: add failing unit test to refreshTabletsInKeyspaceShard
GuptaManan100 Dec 13, 2022
27bf296
feat: fix vtorc to not forget a tablet which has been deleted
GuptaManan100 Dec 13, 2022
e3790c9
feat: fix backward compatibility, add tests and release notes docs
GuptaManan100 Dec 13, 2022
97c7eef
test: fix flags output
GuptaManan100 Dec 13, 2022
e31e7c5
test: use disable-replication-manager instead of disable-active-repar…
GuptaManan100 Dec 14, 2022
ac83323
test: fix flaky test by not checking for an error
GuptaManan100 Dec 14, 2022
6b0db02
feat: handle the case of empty hostname in tablet initialization
GuptaManan100 Dec 14, 2022
1f1e8ac
feat: update onclose timeout to 10 seconds
GuptaManan100 Dec 14, 2022
2411cea
test: fix unit test
GuptaManan100 Dec 14, 2022
ea18935
feat: address review comments
GuptaManan100 Dec 15, 2022
740f2fa
docs: add comments explaining the test functions
GuptaManan100 Dec 16, 2022
24a4f40
Merge remote-tracking branch 'upstream/main' into timeout-fixes
GuptaManan100 Dec 16, 2022
d827acc
feat: add summary docs for 'lock-shard-timeout' deprecation
GuptaManan100 Dec 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/releasenotes/16_0_0_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,14 @@ In [PR #11097](https://github.com/vitessio/vitess/pull/11097) we introduced nati
Orchestrator integration in `vttablet` was deprecated in the previous release and is deleted in this release.
Consider using `VTOrc` instead of `Orchestrator`.

#### `lock-timeout` and `remote_operation_timeout` Changes
Earlier, the shard and keyspace locks used to be capped by the `remote_operation_timeout`. This is no longer the case and instead a new flag called `lock-timeout` is introduced.
For backward compatibility, if `lock-timeout` is unspecified and `remote_operation_timeout` flag is provided, then its value will also be used for `lock-timeout` as well.
The default value for `remote_operation_timeout` has also changed from 30 seconds to 15 seconds. The default for the new flag `lock-timeout` is 45 seconds.

During upgrades, if the users want to preserve the same behaviour as previous releases, then they should provide the `remote_operation_timeout` flag explicitly before upgrading.
After the upgrade, they should then alter their configuration to also specify `lock-timeout` explicitly.
Comment on lines +70 to +71
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying both remote_operation_timeout and lock-timeout to 30s will expose users to the bug that is being fixed in this PR.
Instead we should recommend that if (and only if) they get timeouts during PRS or ERS, they may need to increase remote_operation_timeout and lock-timeout at the same time while making sure that lock-timeout is at least 15 seconds more than remote_operation_timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the defaults are 15 and 45 seconds, and we don't want lock-timeout to be at least 15 seconds more than remote_operation_timeout, we want it to be atleaset 2 times it and a littlle more.

Well in the previous release this bug exists, and only after upgrading and specifying both the flags (or specifying neither) can they really get rid of the bug. So the recommended process is to specify the remote-operation_timeout first so that while upgrading their behaviour remains the same. Once ugpraded they can choose whatever values they want.


### New command line flags and behavior

#### VTGate: Support query timeout --query-timeout
Expand Down
2 changes: 1 addition & 1 deletion go/flags/endtoend/mysqlctl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Global flags:
--mysqlctl_client_protocol string the protocol to use to talk to the mysqlctl server (default "grpc")
--mysqlctl_mycnf_template string template file to use for generating the my.cnf file during server init
--mysqlctl_socket string socket file to use for remote mysqlctl actions (empty for local actions)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 1ns)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeout duration wait no more than this for OnTermSync handlers before stopping (default 10s)
--pid_file string If set, the process will write its pid to the named file, and delete it on graceful shutdown.
--pool_hostname_resolve_interval duration if set force an update to all hostnames and reconnect if changed, defaults to 0 (disabled)
Expand Down
2 changes: 1 addition & 1 deletion go/flags/endtoend/mysqlctld.txt
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Usage of mysqlctld:
--mysql_socket string Path to the mysqld socket file
--mysqlctl_mycnf_template string template file to use for generating the my.cnf file during server init
--mysqlctl_socket string socket file to use for remote mysqlctl actions (empty for local actions)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 1ns)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeout duration wait no more than this for OnTermSync handlers before stopping (default 10s)
--pid_file string If set, the process will write its pid to the named file, and delete it on graceful shutdown.
--pool_hostname_resolve_interval duration if set force an update to all hostnames and reconnect if changed, defaults to 0 (disabled)
Expand Down
3 changes: 2 additions & 1 deletion go/flags/endtoend/vtbackup.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ Usage of vtbackup:
--keep-alive-timeout duration Wait until timeout elapses after a successful backup before shutting down.
--keep_logs duration keep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtime duration keep logs for this long (using mtime) (zero to keep forever)
--lock-timeout duration Maximum time for which a shard/keyspace lock can be acquired for (default 45s)
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
--log_err_stacks log stack traces for errors
Expand Down Expand Up @@ -119,7 +120,7 @@ Usage of vtbackup:
--port int port for the server
--pprof strings enable profiling
--purge_logs_interval duration how often try to remove old logs (default 1h0m0s)
--remote_operation_timeout duration time to wait for a remote operation (default 30s)
--remote_operation_timeout duration time to wait for a remote operation (default 15s)
--restart_before_backup Perform a mysqld clean/full restart after applying binlogs, but before taking the backup. Only makes sense to work around xtrabackup bugs.
--s3_backup_aws_endpoint string endpoint of the S3 backend (region must be provided).
--s3_backup_aws_region string AWS region to use. (default "us-east-1")
Expand Down
7 changes: 4 additions & 3 deletions go/flags/endtoend/vtctld.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Usage of vtctld:
--action_timeout duration time to wait for an action before resorting to force (default 2m0s)
--action_timeout duration time to wait for an action before resorting to force (default 1m0s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing? Can you also check the blame so that we can see whether it changed recently?
There's an action_timeout that defaults to 1h on the vtctlclient and vtctldclient binaries. How does this relate to those?

Copy link
Member Author

@GuptaManan100 GuptaManan100 Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a effect of

DefaultActionTimeout = topo.RemoteOperationTimeout * 4
and
actionTimeout = wrangler.DefaultActionTimeout

Apparently there are two action_timeouts 🤣. That action_timeout is a parameter for vtctldclient and vtctlclient. This one is taken by vtctld and applies only to ApplyKeyspaceAction, ApplyShardAction and ApplyTabletAction. This is the commit that introduced it - facdb18

--alsologtostderr log to standard error as well as files
--azblob_backup_account_key_file string Path to a file containing the Azure Storage account key; if this flag is unset, the environment variable VT_AZBLOB_ACCOUNT_KEY will be used as the key itself (NOT a file path).
--azblob_backup_account_name string Azure Storage Account name for backups; if this flag is unset, the environment variable VT_AZBLOB_ACCOUNT_NAME will be used.
Expand Down Expand Up @@ -60,21 +60,22 @@ Usage of vtctld:
--keep_logs duration keep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtime duration keep logs for this long (using mtime) (zero to keep forever)
--lameduck-period duration keep running at least this long after SIGTERM before stopping (default 50ms)
--lock-timeout duration Maximum time for which a shard/keyspace lock can be acquired for (default 45s)
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
--log_err_stacks log stack traces for errors
--log_rotate_max_size uint size in bytes at which logs are rotated (glog.MaxSize) (default 1887436800)
--logtostderr log to standard error instead of files
--max-stack-size int configure the maximum stack size in bytes (default 67108864)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 1ns)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeout duration wait no more than this for OnTermSync handlers before stopping (default 10s)
--opentsdb_uri string URI of opentsdb /api/put method
--pid_file string If set, the process will write its pid to the named file, and delete it on graceful shutdown.
--port int port for the server
--pprof strings enable profiling
--proxy_tablets Setting this true will make vtctld proxy the tablet status instead of redirecting to them
--purge_logs_interval duration how often try to remove old logs (default 1h0m0s)
--remote_operation_timeout duration time to wait for a remote operation (default 30s)
--remote_operation_timeout duration time to wait for a remote operation (default 15s)
--s3_backup_aws_endpoint string endpoint of the S3 backend (region must be provided).
--s3_backup_aws_region string AWS region to use. (default "us-east-1")
--s3_backup_aws_retries int AWS request retries. (default -1)
Expand Down
5 changes: 3 additions & 2 deletions go/flags/endtoend/vtgate.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Usage of vtgate:
--keyspaces_to_watch strings Specifies which keyspaces this vtgate should have access to while routing queries or accessing the vschema.
--lameduck-period duration keep running at least this long after SIGTERM before stopping (default 50ms)
--legacy_replication_lag_algorithm Use the legacy algorithm when selecting vttablets for serving. (default true)
--lock-timeout duration Maximum time for which a shard/keyspace lock can be acquired for (default 45s)
--lock_heartbeat_time duration If there is lock function used. This will keep the lock connection active by using this heartbeat (default 5s)
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
Expand Down Expand Up @@ -119,7 +120,7 @@ Usage of vtgate:
--mysql_tcp_version string Select tcp, tcp4, or tcp6 to control the socket type. (default "tcp")
--no_scatter when set to true, the planner will fail instead of producing a plan that includes scatter queries
--normalize_queries Rewrite queries with bind vars. Turn this off if the app itself sends normalized queries with bind vars. (default true)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 1ns)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeout duration wait no more than this for OnTermSync handlers before stopping (default 10s)
--opentsdb_uri string URI of opentsdb /api/put method
--pid_file string If set, the process will write its pid to the named file, and delete it on graceful shutdown.
Expand All @@ -134,7 +135,7 @@ Usage of vtgate:
--querylog-format string format for query logs ("text" or "json") (default "text")
--querylog-row-threshold uint Number of rows a query has to return or affect before being logged; not useful for streaming queries. 0 means all queries will be logged.
--redact-debug-ui-queries redact full queries and bind variables from debug UI
--remote_operation_timeout duration time to wait for a remote operation (default 30s)
--remote_operation_timeout duration time to wait for a remote operation (default 15s)
--retry-count int retry count (default 2)
--schema_change_signal Enable the schema tracker; requires queryserver-config-schema-change-signal to be enabled on the underlying vttablets for this to work (default true)
--schema_change_signal_user string User to be used to send down query to vttablet to retrieve schema changes
Expand Down
3 changes: 2 additions & 1 deletion go/flags/endtoend/vtgr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Usage of vtgr:
-h, --help display usage and exit
--keep_logs duration keep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtime duration keep logs for this long (using mtime) (zero to keep forever)
--lock-timeout duration Maximum time for which a shard/keyspace lock can be acquired for (default 45s)
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
--log_err_stacks log stack traces for errors
Expand All @@ -31,7 +32,7 @@ Usage of vtgr:
--pprof strings enable profiling
--purge_logs_interval duration how often try to remove old logs (default 1h0m0s)
--refresh_interval duration Refresh interval to load tablets. (default 10s)
--remote_operation_timeout duration time to wait for a remote operation (default 30s)
--remote_operation_timeout duration time to wait for a remote operation (default 15s)
--scan_interval duration Scan interval to diagnose and repair. (default 3s)
--scan_repair_timeout duration Time to wait for a Diagnose and repair operation. (default 3s)
--security_policy string the name of a registered security policy to use for controlling access to URLs - empty means allow all for anyone (built-in policies: deny-all, read-only)
Expand Down
6 changes: 3 additions & 3 deletions go/flags/endtoend/vtorc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,14 @@ Usage of vtorc:
--keep_logs duration keep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtime duration keep logs for this long (using mtime) (zero to keep forever)
--lameduck-period duration keep running at least this long after SIGTERM before stopping (default 50ms)
--lock-shard-timeout duration Duration for which a shard lock is held when running a recovery (default 30s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to deprecate this flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deprecated it -

fs.Duration("lock-shard-timeout", 30*time.Second, "Duration for which a shard lock is held when running a recovery")
_ = fs.MarkDeprecated("lock-shard-timeout", "Please use lock-timeout instead.")

Deprecated flags don't show up in the flags output either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the summary too.

--lock-timeout duration Maximum time for which a shard/keyspace lock can be acquired for (default 45s)
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
--log_err_stacks log stack traces for errors
--log_rotate_max_size uint size in bytes at which logs are rotated (glog.MaxSize) (default 1887436800)
--logtostderr log to standard error instead of files
--max-stack-size int configure the maximum stack size in bytes (default 67108864)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 1ns)
--onclose_timeout duration wait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeout duration wait no more than this for OnTermSync handlers before stopping (default 10s)
--pid_file string If set, the process will write its pid to the named file, and delete it on graceful shutdown.
--port int port for the server
Expand All @@ -40,7 +40,7 @@ Usage of vtorc:
--reasonable-replication-lag duration Maximum replication lag on replicas which is deemed to be acceptable (default 10s)
--recovery-period-block-duration duration Duration for which a new recovery is blocked on an instance after running a recovery (default 30s)
--recovery-poll-duration duration Timer duration on which VTOrc polls its database to run a recovery (default 1s)
--remote_operation_timeout duration time to wait for a remote operation (default 30s)
--remote_operation_timeout duration time to wait for a remote operation (default 15s)
--security_policy string the name of a registered security policy to use for controlling access to URLs - empty means allow all for anyone (built-in policies: deny-all, read-only)
--shutdown_wait_time duration Maximum time to wait for VTOrc to release all the locks that it is holding before shutting down on SIGTERM (default 30s)
--snapshot-topology-interval duration Timer duration on which VTOrc takes a snapshot of the current MySQL information it has in the database. Should be in multiple of hours
Expand Down
Loading