Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover must change priority based on ping #483

Closed
Serpentian opened this issue Aug 5, 2024 · 0 comments
Closed

Failover must change priority based on ping #483

Serpentian opened this issue Aug 5, 2024 · 0 comments
Assignees
Labels
bug Something isn't working router

Comments

@Serpentian
Copy link
Contributor

Serpentian commented Aug 5, 2024

Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the call, then this connection cannot serve any other requests until a value is returned.

Current behavior of failover fiber:

  1. Increase replica's priority: every FAILOVER_UP_TIMEOUT failover fiber tries to connect to the replica with higher priority.
  2. Decrease replica priority: If we're not connected to prioritized replica more than FAILOVER_DOWN_TIMEOUT, then we take another one and connect to it.

The major problem here is the assumption, that if net.box connection is_connected, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:

If user's call or failover's ping fails with error, which indicates that connection is dead (some net.box error or TimeOut), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.

@Serpentian Serpentian added bug Something isn't working router labels Aug 5, 2024
@Serpentian Serpentian self-assigned this Aug 5, 2024
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 16, 2024
Previously prioritized replica was changed only if it was disconnected
for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as
'connected' it doesn't mean, that this connection actually works. The
connection must be pingable in order to be operational.

This commit makes failover temporary lower replica's priority if
FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal
requests (including failover ping) and all user calls affect the number
of sequentially failed requests. Note, that we consider request
failed, when net.box connection is not operational (cannot make
conn.call, e.g. connection is not yet established or timeout is
reached), user functions throwing errors won't affect prioritized
replica.

The behavior of failover is the following after this commit:

1. Failover pings all prioritized replicas. If ping doesn't succeed, the
   connection is recreated, which is needed, if user returns too big
   values from the functions, in such case no other request can be done
   until this value is returned. Failed ping affects the number of
   sequentially failed requests.

2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the
   number of sequentially failed requests is >=
   FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower
   priority as the main one.

3. If failover didn't try to use the more prioritized replica (according
   to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a
   new replica as the prioritized one. Note, that we don't set it, if
   ping to it didn't succeed during ping round in (1).

Closes tarantool#483

NO_DOC=bugfix
Serpentian added a commit that referenced this issue Aug 16, 2024
Previously prioritized replica was changed only if it was disconnected
for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as
'connected' it doesn't mean, that this connection actually works. The
connection must be pingable in order to be operational.

This commit makes failover temporary lower replica's priority if
FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal
requests (including failover ping) and all user calls affect the number
of sequentially failed requests. Note, that we consider request
failed, when net.box connection is not operational (cannot make
conn.call, e.g. connection is not yet established or timeout is
reached), user functions throwing errors won't affect prioritized
replica.

The behavior of failover is the following after this commit:

1. Failover pings all prioritized replicas. If ping doesn't succeed, the
   connection is recreated, which is needed, if user returns too big
   values from the functions, in such case no other request can be done
   until this value is returned. Failed ping affects the number of
   sequentially failed requests.

2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the
   number of sequentially failed requests is >=
   FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower
   priority as the main one.

3. If failover didn't try to use the more prioritized replica (according
   to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a
   new replica as the prioritized one. Note, that we don't set it, if
   ping to it didn't succeed during ping round in (1).

Closes #483

NO_DOC=bugfix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working router
Projects
None yet
Development

No branches or pull requests

1 participant