Failover must change priority based on ping #483

Serpentian · 2024-08-05T09:33:38Z

Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the call, then this connection cannot serve any other requests until a value is returned.

Current behavior of failover fiber:

Increase replica's priority: every FAILOVER_UP_TIMEOUT failover fiber tries to connect to the replica with higher priority.
Decrease replica priority: If we're not connected to prioritized replica more than FAILOVER_DOWN_TIMEOUT, then we take another one and connect to it.

The major problem here is the assumption, that if net.box connection is_connected, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:

If user's call or failover's ping fails with error, which indicates that connection is dead (some net.box error or TimeOut), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.

The text was updated successfully, but these errors were encountered:

Previously prioritized replica was changed only if it was disconnected for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as 'connected' it doesn't mean, that this connection actually works. The connection must be pingable in order to be operational. This commit makes failover temporary lower replica's priority if FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal requests (including failover ping) and all user calls affect the number of sequentially failed requests. Note, that we consider request failed, when net.box connection is not operational (cannot make conn.call, e.g. connection is not yet established or timeout is reached), user functions throwing errors won't affect prioritized replica. The behavior of failover is the following after this commit: 1. Failover pings all prioritized replicas. If ping doesn't succeed, the connection is recreated, which is needed, if user returns too big values from the functions, in such case no other request can be done until this value is returned. Failed ping affects the number of sequentially failed requests. 2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the number of sequentially failed requests is >= FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower priority as the main one. 3. If failover didn't try to use the more prioritized replica (according to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a new replica as the prioritized one. Note, that we don't set it, if ping to it didn't succeed during ping round in (1). Closes tarantool#483 NO_DOC=bugfix

Previously prioritized replica was changed only if it was disconnected for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as 'connected' it doesn't mean, that this connection actually works. The connection must be pingable in order to be operational. This commit makes failover temporary lower replica's priority if FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal requests (including failover ping) and all user calls affect the number of sequentially failed requests. Note, that we consider request failed, when net.box connection is not operational (cannot make conn.call, e.g. connection is not yet established or timeout is reached), user functions throwing errors won't affect prioritized replica. The behavior of failover is the following after this commit: 1. Failover pings all prioritized replicas. If ping doesn't succeed, the connection is recreated, which is needed, if user returns too big values from the functions, in such case no other request can be done until this value is returned. Failed ping affects the number of sequentially failed requests. 2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the number of sequentially failed requests is >= FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower priority as the main one. 3. If failover didn't try to use the more prioritized replica (according to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a new replica as the prioritized one. Note, that we don't set it, if ping to it didn't succeed during ping round in (1). Closes #483 NO_DOC=bugfix

Serpentian added bug Something isn't working router labels Aug 5, 2024

Serpentian self-assigned this Aug 5, 2024

Serpentian mentioned this issue Aug 5, 2024

Router sends requests to dead instances #480

Closed

Serpentian closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover must change priority based on ping #483

Failover must change priority based on ping #483

Serpentian commented Aug 5, 2024 •

edited

Loading

Failover must change priority based on ping #483

Failover must change priority based on ping #483

Comments

Serpentian commented Aug 5, 2024 • edited Loading

Serpentian commented Aug 5, 2024 •

edited

Loading