Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: worker not exited when executing quit or reload command #9909

Merged
merged 15 commits into from
Aug 17, 2023
32 changes: 31 additions & 1 deletion apisix/core/config_etcd.lua
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ local constants = require("apisix.constants")
local health_check = require("resty.etcd.health_check")
local semaphore = require("ngx.semaphore")
local tablex = require("pl.tablex")
local ngx_thread_spawn = ngx.thread.spawn
local ngx_thread_kill = ngx.thread.kill
local ngx_thread_wait = ngx.thread.wait


local is_http = ngx.config.subsystem == "http"
Expand Down Expand Up @@ -124,7 +127,7 @@ local function produce_res(res, err)
end


local function run_watch(premature)
local function do_run_watch(premature)
if premature then
return
end
Expand Down Expand Up @@ -257,6 +260,30 @@ local function run_watch(premature)
end


local function run_watch(premature)
local run_watch_th = ngx_thread_spawn(do_run_watch, premature)

::restart::
local check_worker_th = ngx_thread_spawn(function ()
while not exiting() do
ngx_sleep(0.1)
end
end)

local ok, err = ngx_thread_wait(check_worker_th)
jiangfucheng marked this conversation as resolved.
Show resolved Hide resolved

if not ok then
log.error("check_worker thread terminates failed, retart checker, error: " .. err)
ngx_thread_kill(check_worker_th)
goto restart
end

ngx_thread_kill(run_watch_th)
-- notify child watchers
produce_res(nil, "worker exited")
end


local function init_watch_ctx(key)
if not watch_ctx then
watch_ctx = {
Expand Down Expand Up @@ -851,6 +878,9 @@ local function _automatic_fetch(premature, self)
.. backoff_duration .. "s")
end
end
elseif err == "worker exited" then
log.info("worker exited.")
return
elseif err ~= "timeout" and err ~= "Key not found"
and self.last_err ~= err then
log.error("failed to fetch data from etcd: ", err, ", ",
Expand Down
38 changes: 38 additions & 0 deletions t/cli/test_cmd.sh
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,41 @@ fi

rm conf/customized_config.yaml
echo "passed: test customized config successful"

# test quit command
bin/apisix start

if ! ps -ef | grep "apisix" | grep "master process" | grep -v "grep"; then
echo "apisix not started"
exit 1
fi

bin/apisix quit

sleep 0.5

if ps -ef | grep "worker process is shutting down" | grep -v "grep"; then
echo "all workers should exited"
exit 1
fi

echo "passed: test quit command successful"

# test reload command
bin/apisix start

if ! ps -ef | grep "apisix" | grep "master process" | grep -v "grep"; then
echo "apisix not started"
exit 1
fi

bin/apisix reload

sleep 0.5

if ps -ef | grep "worker process is shutting down" | grep -v "grep"; then
echo "old workers should exited"
exit 1
fi

echo "passed: test reload command successful"
10 changes: 0 additions & 10 deletions t/node/healthcheck-stop-checker.t
Original file line number Diff line number Diff line change
Expand Up @@ -165,15 +165,6 @@ create new checker: table: 0x
content_by_lua_block {
local t = require("lib.test_admin").test

-- release the clean handler of previous test
local code, _, body = t('/apisix/admin/routes/1', "DELETE")

if code > 300 then
ngx.status = code
ngx.say(body)
return
end

local code, _, body = t('/apisix/admin/upstreams/stopchecker',
"PUT",
[[{"type":"roundrobin","nodes":{"127.0.0.1:1980":1,"127.0.0.1:1981":1},"checks":{"active":{"http_path":"/status","healthy":{"interval":1,"successes":1},"unhealthy":{"interval":1,"http_failures":2}}}}]]
Expand Down Expand Up @@ -256,7 +247,6 @@ ok
--- grep_error_log eval
qr/create new checker: table: 0x|try to release checker: table: 0x/
--- grep_error_log_out
try to release checker: table: 0x
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be remove, the reasons are as follows:

1.healthchekcer will be remove after worker exit since the status store in memroy, that's why TEST 5 will not print try to release checker: table: 0x in first line.
2.Why the try to release checker: table: 0x will be print if we add sleep(1) before worker exited
- Because the log is printed by old worker. Before old worker be killed, the new worker will be created, at this moment, both old worker and new worker can receive events, it's easy to be proved through print debug log before execute fire_all_clean_handlers

            if pre_index then
            local pre_val = self.values[pre_index]
            log.info("sync_data: check pre_val: ", inspect(pre_val.clean_handlers), worker_id_str, " pid: ", ngx.worker.pid())
            if pre_val then
                config_util.fire_all_clean_handlers(pre_val)
            end

logs:

2023/08/08 18:04:31 [info] 77582#1004031: *312 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: { {
f = <function 1>,
id = 1
},
  _id = 2
} worker_id: 0 pid: 77582, context: ngx.timer

2023/08/08 18:04:31 [info] 77620#1004305: *447 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: {} worker_id: 0 pid: 77620, context: ngx.timer

We can see the etcd events be received with two wokrer 0, and there pid is different, we can easliy to confirm the worker is old worker which has check_handlers field

3.Why these test cases can passed in old version(before #9456 be merged)
- Because in the old version, the worker is not exit immediately too, it will exit after make quit/reload about 60s, so the reason is same as above.

create new checker: table: 0x
try to release checker: table: 0x
create new checker: table: 0x
Expand Down
Loading