The case when storage and router iproto fiber is cancelled #341

filonenko-mikhail · 2022-06-23T13:33:20Z

Privet

There is case when something happened and storage fiber is cancelled (for e.g. cartridge hotreload or any other fiber killer).

Some affected snippet

vshard = require('vshard')
netbox = require('net.box')

cfg = {
    memtx_memory = 100 * 1024 * 1024,
    bucket_count = 3,
    rebalancer_disbalance_threshold = 10,
    rebalancer_max_receiving = 100,
    sharding = {
        ['cbf06940-0790-498b-948d-042b62cf3d29'] = {
            replicas = {
                ['8a274925-a26d-47fc-9e1b-af88ce939412'] = {
                    uri = 'storage:storage@127.0.0.1:3301',
                    name = 'storage_1_a',
                    master = true
                },
            },
        },
    },
}

vshard.storage.cfg(cfg, '8a274925-a26d-47fc-9e1b-af88ce939412')
box.schema.user.grant('storage', 'super', nil, nil, {if_not_exists=true})

vshard.router.cfg(cfg)
vshard.router.bootstrap()

local log = require('log')
local fiber = require('fiber')
rc, err = vshard.router.callrw(1, 'box.info')
assert(rc ~= nil)
log.info(rc)
--log.info(fiber.info())

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('before netbox call')
log.info(c:call('box.info'))

for id, f in pairs(fiber.info()) do 
    if f.name:endswith('(net.box)') then
        fiber.kill(fiber.find(id))
    end
end

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)
log.info(rc)

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)

log.info('after netbox call')
local rc, res, err = pcall(c.call, c, {'box.info'})
if rc ~= true then
    log.info(res)
end

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('after netbox call with reloaded connection')
log.info(c:call('box.info'))

package.loaded['vshard'] = nil
local vshard = require('vshard')
rc, err = vshard.router.callrw(1, 'box.info')
log.info(rc)
assert(rc ~= nil, tostring(err))

require('console').start() os.exit(0)

The question is, how to restart netbox connection under vshard.router? Or is it possible to be done on vshard side?

Serpentian · 2022-07-11T11:07:23Z

Actually, router and storage are not reloaded when we do something like this:

package.loaded['vshard'] = nil
local vshard = require('vshard')

As user expects everything to be reloaded, I suppose we should implement atomic reload of the whole vshard.

Speaking of restoring fibers after explicit kill of them, we can do that in replicaset.rebind_replicasets. This will restore connection when router is reloaded. The other solution is to add check if the connection's fiber is dead right here:

vshard/vshard/replicaset.lua

Lines 173 to 177 in dd70cfb

 if not conn or conn.state == 'closed' then 

 conn = netbox.connect(replica.uri, { 

 reconnect_after = consts.RECONNECT_TIMEOUT, 

 wait_connected = false 

 })

As this method is invoked in replicaset_master_call fibers will be restored too.

Gerold103 · 2022-07-11T19:40:22Z

Most of replicaset methods like rebind_replicasets() are internal, people shouldn't use it in their code. A proper fix is firstly 1) make the core netbox report its worker fiber state as closed if the fiber is cancelled. I suspect it might be reported as error_reconnect or something, which is misleading - it is not reconnecting anymore. Or make netbox spawn a new fiber if the current one is cancelled. 2) replicaset_connect_to_replica() can try to check if the state == error_reconnect (or whatever the name is), then we also check the fiber state somehow (don't know if worker fiber state is reachable at all) - if it is dead/cancelled, then create a new connection. Users shouldn't need to bother with that.

Currently if we kill net.box's fibers the connection goes into `error_reconnect` state. However, it's not reconnecting anymore. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is `dead` and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. So, after fiber kill wait until it's really dead and make a request only after that. Closes tarantool#341

Currently if we kill net.box's fibers the connection goes into 'error_reconnect' state. However, it's not reconnecting anymore. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. So, after fiber kill wait until it's really dead and make a request only after that. Closes tarantool#341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. So, after fiber kill wait until it's really dead and make a request only after that. Closes tarantool#341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. However, from 2.10.1 worker fiber is invincible and cannot be killed at all. Closes tarantool#341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. However, reconnecting doesn't happen automatically in tarantool 2.10.0, as there's no way to determine if the fiber is dead other than checking the error message of the connection, which is not a good practice as this check can be easily trigerred false-positively. Closes tarantool#341

The problem is that recovery fiber wakes up earlier than we want it to do so. This leads to the test output which we don't expect. Let's block recovery fiber before making any changes to the `_bucket`. It'll start again as soon as the instance is restarted. Needed for tarantool#341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. However, reconnecting doesn't happen automatically in tarantool 2.10.0, as there's no way to determine if the fiber is dead other than checking the error message of the connection, which is not a good practice as this check can be easily trigerred false-positively. Closes tarantool#341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. Closes tarantool#341

The problem is that recovery fiber wakes up earlier than we want it to do so. This leads to the test output which we don't expect. Let's block recovery fiber before making any changes to the `_bucket`. It'll start again as soon as the instance is restarted. Needed for #341

Currently if we kill the worker fiber of the connection, which was initialized with 'reconnect_after' option, this connection goes into 'error_reconnect' or 'error' state (depends on tarantool version). Reconnecting doesn't happen in both cases and the only way for user to return router to working order is reloading or manual restoring of the connections. This patch introduces reconnecting in that case. It should be used wisely, though. Fiber's killing doesn't happen instantly and if the user doesn't wait util fiber's status is 'dead' and makes the request immediately, exception will be probably thrown as the fiber can die in the middle of request. Closes #341

filonenko-mikhail mentioned this issue Jun 23, 2022

HotReload error in 2.7.4 tarantool/cartridge#1835

Closed

sergos added teamS Scaling 1sp labels Jul 8, 2022

Serpentian mentioned this issue Jul 8, 2022

flaky test: rebalancer/stress_add_remove_rs.test.lua (on memtx) hangs #309

Closed

Serpentian self-assigned this Jul 11, 2022

Serpentian mentioned this issue Jul 14, 2022

replicaset: reconnect after fiber kill #356

Merged

Gerold103 mentioned this issue Jul 20, 2022

net.box worker fiber death leaves "error_reconnect" state but does not reconnect tarantool/tarantool#7448

Closed

kyukhin removed the 1sp label Aug 3, 2022

Gerold103 closed this as completed in #356 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The case when storage and router iproto fiber is cancelled #341

The case when storage and router iproto fiber is cancelled #341

filonenko-mikhail commented Jun 23, 2022

Serpentian commented Jul 11, 2022

Gerold103 commented Jul 11, 2022

The case when storage and router iproto fiber is cancelled #341

The case when storage and router iproto fiber is cancelled #341

Comments

filonenko-mikhail commented Jun 23, 2022

Serpentian commented Jul 11, 2022

Gerold103 commented Jul 11, 2022