Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: worker not exited when executing quit or reload command #9909

Merged
merged 15 commits into from
Aug 17, 2023

Conversation

jiangfucheng
Copy link
Member

@jiangfucheng jiangfucheng commented Jul 26, 2023

Description

Fixes #9802

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@monkeyDluffy6017
Copy link
Contributor

How long does the long connection prevent the quit operation?

@kingluo
Copy link
Contributor

kingluo commented Jul 27, 2023

@jiangfucheng Please show your steps to reproduce this issue.

@jiangfucheng
Copy link
Member Author

How long does the long connection prevent the quit operation?

Accoding my test, the worker will never exit.

@jiangfucheng
Copy link
Member Author

@jiangfucheng Please show your steps to reproduce this issue.

  1. make run
  2. make reload or make quit
  3. ps -ef |grep nginx check the worker if exited

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
@kingluo
Copy link
Contributor

kingluo commented Aug 3, 2023

@jiangfucheng
The code is ok.
Please check the failed ci: t/node/healthcheck-stop-checker.t.

@monkeyDluffy6017 monkeyDluffy6017 added wait for update wait for the author's response in this issue/PR and removed discuss labels Aug 4, 2023
kingluo
kingluo previously approved these changes Aug 7, 2023
@jiangfucheng jiangfucheng marked this pull request as ready for review August 8, 2023 12:50
@@ -256,7 +247,6 @@ ok
--- grep_error_log eval
qr/create new checker: table: 0x|try to release checker: table: 0x/
--- grep_error_log_out
try to release checker: table: 0x
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be remove, the reasons are as follows:

1.healthchekcer will be remove after worker exit since the status store in memroy, that's why TEST 5 will not print try to release checker: table: 0x in first line.
2.Why the try to release checker: table: 0x will be print if we add sleep(1) before worker exited
- Because the log is printed by old worker. Before old worker be killed, the new worker will be created, at this moment, both old worker and new worker can receive events, it's easy to be proved through print debug log before execute fire_all_clean_handlers

            if pre_index then
            local pre_val = self.values[pre_index]
            log.info("sync_data: check pre_val: ", inspect(pre_val.clean_handlers), worker_id_str, " pid: ", ngx.worker.pid())
            if pre_val then
                config_util.fire_all_clean_handlers(pre_val)
            end

logs:

2023/08/08 18:04:31 [info] 77582#1004031: *312 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: { {
f = <function 1>,
id = 1
},
  _id = 2
} worker_id: 0 pid: 77582, context: ngx.timer

2023/08/08 18:04:31 [info] 77620#1004305: *447 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: {} worker_id: 0 pid: 77620, context: ngx.timer

We can see the etcd events be received with two wokrer 0, and there pid is different, we can easliy to confirm the worker is old worker which has check_handlers field

3.Why these test cases can passed in old version(before #9456 be merged)
- Because in the old version, the worker is not exit immediately too, it will exit after make quit/reload about 60s, so the reason is same as above.

@monkeyDluffy6017 monkeyDluffy6017 added approved and removed wait for update wait for the author's response in this issue/PR labels Aug 14, 2023
@@ -257,6 +260,30 @@ local function run_watch(premature)
end


local function run_watch(premature)
::restart::
Copy link
Contributor

@kingluo kingluo Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! The watch routine must not be restarted!
Because it's a stateful routine (e.g. last watch starts revision) and should start once and notify all child watchers that it started.

@jiangfucheng
Instead, you just wait for check_worker_th, if it exits, it means the worker process is exiting, then kill run_watch_th, and that's it.
Please make a change again on the code, thank you very much!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the thread do_run_watch crashes ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the thread do_run_watch crashes?

It's not supposed to crash.
In fact, almost all timers did not have a crash guard.
And even if run_watch crashes, you cannot fix it by restarting it, 'cause it's a stateful routine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, restarting it in this scenario is an over-design.

Copy link
Contributor

@monkeyDluffy6017 monkeyDluffy6017 Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @kingluo, restart is unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thank you for your review.❤️

@kingluo
Copy link
Contributor

kingluo commented Aug 14, 2023

@jiangfucheng The code is ok. I will approve it after the ci finishes successfully.

@jiangfucheng
Copy link
Member Author

@monkeyDluffy6017 @kingluo Please take a look again, thanks.❤️

@moonming
Copy link
Member

@kingluo please take a look again

@moonming moonming merged commit 5d6bde2 into apache:master Aug 17, 2023
31 checks passed
hongbinhsu pushed a commit to fitphp/apix that referenced this pull request Aug 22, 2023
* upstream/master: (77 commits)
  docs: Update admin-api.md (apache#10056)
  ci: fix a bug that can not open nginx.pid (apache#10061)
  feat: remove rust dependency by rollback lua-resty-ldap on master (apache#9936)
  docs: fix grpc-transcode.md error (apache#10059)
  feat: upgrade lua dependencies (apache#10051)
  fix: rollback lua-resty-session to 3.10 (apache#10046)
  feat: upgrade resty-redis-cluster from  1.02-4->1.05-1 (apache#10041)
  feat: update lua library (apache#10037)
  fix: worker not exited when executing quit or reload command (apache#9909)
  fix: traffic split plugin not validating upstream_id (apache#10008)
  ci: update the timeout value in cli.yml (apache#10026)
  fix(tencent-cloud-cls): DNS parsing failure (apache#9843)
  chore(deps): bump actions/setup-node from 3.7.0 to 3.8.0 (apache#10025)
  feat(openid-connect): add proxy_opts attribute (apache#9948)
  perf(log-rotate): replace string.sub with string.byte (apache#9984)
  fix(ci): replace github action in update-labels.yml (apache#9987)
  fix: can't sync etcd data if key has special character (apache#9967)
  perf(aws-lambda): cache the index of the array (apache#9944)
  fix: add support for dependency installation on endeavouros (apache#9985)
  chore(ci): automate management of unresponded issues (apache#9927)
  ...
@jiangfucheng jiangfucheng deleted the worker_cant_exit branch August 23, 2023 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: APISIX executes reload, but the old process does not exit for a long time
4 participants