Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The case when storage and router iproto fiber is cancelled #341

Closed
filonenko-mikhail opened this issue Jun 23, 2022 · 2 comments · Fixed by #356
Closed

The case when storage and router iproto fiber is cancelled #341

filonenko-mikhail opened this issue Jun 23, 2022 · 2 comments · Fixed by #356
Assignees
Labels
teamS Scaling

Comments

@filonenko-mikhail
Copy link

Privet

There is case when something happened and storage fiber is cancelled (for e.g. cartridge hotreload or any other fiber killer).

Some affected snippet

vshard = require('vshard')
netbox = require('net.box')

cfg = {
    memtx_memory = 100 * 1024 * 1024,
    bucket_count = 3,
    rebalancer_disbalance_threshold = 10,
    rebalancer_max_receiving = 100,
    sharding = {
        ['cbf06940-0790-498b-948d-042b62cf3d29'] = {
            replicas = {
                ['8a274925-a26d-47fc-9e1b-af88ce939412'] = {
                    uri = 'storage:storage@127.0.0.1:3301',
                    name = 'storage_1_a',
                    master = true
                },
            },
        },
    },
}

vshard.storage.cfg(cfg, '8a274925-a26d-47fc-9e1b-af88ce939412')
box.schema.user.grant('storage', 'super', nil, nil, {if_not_exists=true})

vshard.router.cfg(cfg)
vshard.router.bootstrap()

local log = require('log')
local fiber = require('fiber')
rc, err = vshard.router.callrw(1, 'box.info')
assert(rc ~= nil)
log.info(rc)
--log.info(fiber.info())

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('before netbox call')
log.info(c:call('box.info'))

for id, f in pairs(fiber.info()) do 
    if f.name:endswith('(net.box)') then
        fiber.kill(fiber.find(id))
    end
end

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)
log.info(rc)

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)

log.info('after netbox call')
local rc, res, err = pcall(c.call, c, {'box.info'})
if rc ~= true then
    log.info(res)
end

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('after netbox call with reloaded connection')
log.info(c:call('box.info'))

package.loaded['vshard'] = nil
local vshard = require('vshard')
rc, err = vshard.router.callrw(1, 'box.info')
log.info(rc)
assert(rc ~= nil, tostring(err))

require('console').start() os.exit(0)

The question is, how to restart netbox connection under vshard.router? Or is it possible to be done on vshard side?

@Serpentian
Copy link
Contributor

Actually, router and storage are not reloaded when we do something like this:

package.loaded['vshard'] = nil
local vshard = require('vshard')

As user expects everything to be reloaded, I suppose we should implement atomic reload of the whole vshard.

Speaking of restoring fibers after explicit kill of them, we can do that in replicaset.rebind_replicasets. This will restore connection when router is reloaded. The other solution is to add check if the connection's fiber is dead right here:

if not conn or conn.state == 'closed' then
conn = netbox.connect(replica.uri, {
reconnect_after = consts.RECONNECT_TIMEOUT,
wait_connected = false
})

As this method is invoked in replicaset_master_call fibers will be restored too.

@Serpentian Serpentian self-assigned this Jul 11, 2022
@Gerold103
Copy link
Collaborator

Most of replicaset methods like rebind_replicasets() are internal, people shouldn't use it in their code. A proper fix is firstly 1) make the core netbox report its worker fiber state as closed if the fiber is cancelled. I suspect it might be reported as error_reconnect or something, which is misleading - it is not reconnecting anymore. Or make netbox spawn a new fiber if the current one is cancelled. 2) replicaset_connect_to_replica() can try to check if the state == error_reconnect (or whatever the name is), then we also check the fiber state somehow (don't know if worker fiber state is reachable at all) - if it is dead/cancelled, then create a new connection. Users shouldn't need to bother with that.

Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
`error_reconnect` state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is `dead` and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 14, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 15, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 15, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 15, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 15, 2022
Currently if we kill net.box's fibers the connection goes into
'error_reconnect' state. However, it's not reconnecting anymore.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 29, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 29, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 29, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

So, after fiber kill wait until it's really dead and make a request
only after that.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Jul 30, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, from 2.10.1 worker fiber is invincible and cannot be killed
at all.

Closes tarantool#341
@kyukhin kyukhin removed the 1sp label Aug 3, 2022
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
The problem is that recovery fiber wakes up earlier than we want it
to do so. This leads to the test output which we don't expect.

Let's block recovery fiber before making any changes to the `_bucket`.
It'll start again as soon as the instance is restarted.

Needed for tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 11, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

However, reconnecting doesn't happen automatically in tarantool 2.10.0,
as there's no way to determine if the fiber is dead other than checking
the error message of the connection, which is not a good practice as this
check can be easily trigerred false-positively.

Closes tarantool#341
Serpentian added a commit to Serpentian/vshard that referenced this issue Aug 13, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

Closes tarantool#341
Gerold103 pushed a commit that referenced this issue Aug 15, 2022
The problem is that recovery fiber wakes up earlier than we want it
to do so. This leads to the test output which we don't expect.

Let's block recovery fiber before making any changes to the `_bucket`.
It'll start again as soon as the instance is restarted.

Needed for #341
Gerold103 pushed a commit that referenced this issue Aug 15, 2022
Currently if we kill the worker fiber of the connection, which was
initialized with 'reconnect_after' option, this connection goes into
'error_reconnect' or 'error' state (depends on tarantool version).
Reconnecting doesn't happen in both cases and the only way for user
to return router to working order is reloading or manual restoring of
the connections.

This patch introduces reconnecting in that case. It should be used
wisely, though. Fiber's killing doesn't happen instantly and if the
user doesn't wait util fiber's status is 'dead' and makes the request
immediately, exception will be probably thrown as the fiber can die
in the middle of request.

Closes #341
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
teamS Scaling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants