-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failover must change priority based on ping #483
Comments
Serpentian
added a commit
to Serpentian/vshard
that referenced
this issue
Aug 16, 2024
Previously prioritized replica was changed only if it was disconnected for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as 'connected' it doesn't mean, that this connection actually works. The connection must be pingable in order to be operational. This commit makes failover temporary lower replica's priority if FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal requests (including failover ping) and all user calls affect the number of sequentially failed requests. Note, that we consider request failed, when net.box connection is not operational (cannot make conn.call, e.g. connection is not yet established or timeout is reached), user functions throwing errors won't affect prioritized replica. The behavior of failover is the following after this commit: 1. Failover pings all prioritized replicas. If ping doesn't succeed, the connection is recreated, which is needed, if user returns too big values from the functions, in such case no other request can be done until this value is returned. Failed ping affects the number of sequentially failed requests. 2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the number of sequentially failed requests is >= FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower priority as the main one. 3. If failover didn't try to use the more prioritized replica (according to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a new replica as the prioritized one. Note, that we don't set it, if ping to it didn't succeed during ping round in (1). Closes tarantool#483 NO_DOC=bugfix
Serpentian
added a commit
that referenced
this issue
Aug 16, 2024
Previously prioritized replica was changed only if it was disconnected for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as 'connected' it doesn't mean, that this connection actually works. The connection must be pingable in order to be operational. This commit makes failover temporary lower replica's priority if FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal requests (including failover ping) and all user calls affect the number of sequentially failed requests. Note, that we consider request failed, when net.box connection is not operational (cannot make conn.call, e.g. connection is not yet established or timeout is reached), user functions throwing errors won't affect prioritized replica. The behavior of failover is the following after this commit: 1. Failover pings all prioritized replicas. If ping doesn't succeed, the connection is recreated, which is needed, if user returns too big values from the functions, in such case no other request can be done until this value is returned. Failed ping affects the number of sequentially failed requests. 2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the number of sequentially failed requests is >= FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower priority as the main one. 3. If failover didn't try to use the more prioritized replica (according to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a new replica as the prioritized one. Note, that we don't set it, if ping to it didn't succeed during ping round in (1). Closes #483 NO_DOC=bugfix
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the
call
, then this connection cannot serve any other requests until a value is returned.Current behavior of failover fiber:
FAILOVER_UP_TIMEOUT
failover fiber tries to connect to the replica with higher priority.FAILOVER_DOWN_TIMEOUT
, then we take another one and connect to it.The major problem here is the assumption, that if net.box connection
is_connected
, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:If user's
call
or failover'sping
fails with error, which indicates that connection is dead (somenet.box
error orTimeOut
), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.The text was updated successfully, but these errors were encountered: