Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to make several requests with hanging connection #484

Closed
Serpentian opened this issue Aug 5, 2024 · 1 comment
Closed

Allow to make several requests with hanging connection #484

Serpentian opened this issue Aug 5, 2024 · 1 comment
Assignees
Labels
bug Something isn't working router

Comments

@Serpentian
Copy link
Contributor

Serpentian commented Aug 5, 2024

Currently it's impossible to make several requests to an instance, connection to which hanged. Router must slightly balance replica even during callro, we'll call this stateless balancing.


The first problem there is that request use all remaining time hoping for an answer from an instance:

opts.timeout = tend - fiber_clock()
local storage_call_status, call_status, call_error =
replicaset[call](replicaset, 'vshard.storage.call',
{bucket_id, mode, func, args}, opts)

However, if connection is not alive, but it's shown as connected, request fails with timed out error and we cannot make another one here, as we have no remaining time.

I propose to introduce new option for router calls and name it smth like request_timeout. It shows, how much time single request has and it must always be <= than the timeout. This option will be exported to crud and users will be able to control its value so that they have no failed requests, which is important in mission critical projects.

The question here, whether we should make such value to be equal e.g. request_timeout = timeout / <number of replicas in rs> or leave as it is now (just request_timeout = timeout). @Gerold103


The second problem we have is that replicaset module itself doesn't change priority of replicas. So, even if request_timeout will be less than timeout, we'll just make several requests to a dead replica (if failover fiber won't wake up between requests).

if err then
return nil, err
else
return nil, lerror.timeout()
end

I suppose, that if request failed, we should unconditionally lower its priority. This will cause constant change of priority on dead replicasets, when e.g. privileges are configured incorrectly (however, this case is covered by the backoff procedure), but this will make requests much more solid, they'll fail much more rarely. @Gerold103, your opinion on this?

@Gerold103
Copy link
Collaborator

request_timeout

That looks good.

users will be able to control its value so that they have no failed requests

They will have failed requests anyway, unfortunately. Router can become isolated from the storages, or have wrong credentials, or all storages can be down, or the timeout just might be not enough. I wanted to emphasize with this, that the app code must be handling errors gracefully and not assume that failures never happen.

leave as it is now

Yes. If you would make it by default = timeout / replica count, then existing apps could break since their actual request timeout would be divided by 3 or more from what they specify (assuming people have at least 3 replicas).

we should unconditionally lower its priority

Absolutely no 🥺. If the user specified in their config a faulty replica as the highest prio, then they we should try to use it. We can't bend the config as we like permanently. I.e. if the replica with the highest prio in the config is alive, it must be used with the highest prio, 100%. But we can change how exactly we try to use it depending on the situation. Right now if a replica fails, we keep retrying it. The alternative would be that we mark it as faulty and don't use for user-requests after some number of failures, and then the failover fiber would try to ping it in the background. When failover succeeds, we make it usable again with its original prio. But until the failover ping succeeds, we don't use the connection for user requests.

As ping you can use literally the ping or maybe our vshard.storage._call('info'). That way we can test if vshard storage is activated. Not just bare iproto.

Speaking of the backoff - if you are talking about the built-in backoff of the timeouts, it is not relevant here. It only grows and shrinks the timeout assuming the replica is alive. It doesn't change the currently used most prioritized replica.

@Serpentian Serpentian added bug Something isn't working and removed feature A new functionality labels Aug 16, 2024
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 16, 2024
New option for call, callro, callre, callbro, callbre: `request_timeout`.
Note, that for callrw (and call with 'write' mode) this option is NoOp.

The option allows to specify the maximum number of seconds single request
in call takes. By default it's equal to the `timeout` option. Must be
always <= timeout.

Router's read-only requests retry, when recoverable error happens (e.g.
WRONG_BUCKET or timeout). By default in case of timeout error vshard's
call will exit with error, as request_timeout = timeout. However, if
request_timeout < timeout, then several requests will be done in scope
of one call. These requests are balanced between replicas, we won't go
to the not responding one in the scope of the single request. Balancing
is introduced in the following commit.

May be useful for mission critical systems, where the number of failed
requests must be minimized.

Part of tarantool#484

NO_DOC=<ticket was already created>
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 16, 2024
This commit introduces balancing for requests, where balance mode is not
set. The motivation for this change is that we should make our best to
succeed with call, which is important for mission critical systems.

When balance mode is set, replicas are balanced between requests,
consequent calls won't go to the same replica. Balancing is done
according to the round-robin strategy, weights doesn't affect such
balancing.

However, when balance is false, callro and callre are still balanced
now. But this balancing happens only in the scope of the call,
consequent calls go to the same, most prioritized replica firstly.
Retry happens, when network error happens (e.g. timeout).

During such balancing callro method doesn't distrinuish master and
replica. If master has the highest priority according to config, then
callro will go to master firstly, and only after that to replica. callre
method, on the other hand, will firstly try to access all replicas, if
all requests fail, then we fallback to master.

Closes tarantool#484

NO_DOC=<ticket was already created>
Serpentian added a commit that referenced this issue Aug 16, 2024
New option for call, callro, callre, callbro, callbre: `request_timeout`.
Note, that for callrw (and call with 'write' mode) this option is NoOp.

The option allows to specify the maximum number of seconds single request
in call takes. By default it's equal to the `timeout` option. Must be
always <= timeout.

Router's read-only requests retry, when recoverable error happens (e.g.
WRONG_BUCKET or timeout). By default in case of timeout error vshard's
call will exit with error, as request_timeout = timeout. However, if
request_timeout < timeout, then several requests will be done in scope
of one call. These requests are balanced between replicas, we won't go
to the not responding one in the scope of the single request. Balancing
is introduced in the following commit.

May be useful for mission critical systems, where the number of failed
requests must be minimized.

Part of #484

NO_DOC=<ticket was already created>
Serpentian added a commit that referenced this issue Aug 16, 2024
This commit introduces balancing for requests, where balance mode is not
set. The motivation for this change is that we should make our best to
succeed with call, which is important for mission critical systems.

When balance mode is set, replicas are balanced between requests,
consequent calls won't go to the same replica. Balancing is done
according to the round-robin strategy, weights doesn't affect such
balancing.

However, when balance is false, callro and callre are still balanced
now. But this balancing happens only in the scope of the call,
consequent calls go to the same, most prioritized replica firstly.
Retry happens, when network error happens (e.g. timeout).

During such balancing callro method doesn't distrinuish master and
replica. If master has the highest priority according to config, then
callro will go to master firstly, and only after that to replica. callre
method, on the other hand, will firstly try to access all replicas, if
all requests fail, then we fallback to master.

Closes #484

NO_DOC=<ticket was already created>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working router
Projects
None yet
Development

No branches or pull requests

2 participants