Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGUSR2: zero downtime restart #4624

Merged
merged 1 commit into from
Nov 28, 2024
Merged

SIGUSR2: zero downtime restart #4624

merged 1 commit into from
Nov 28, 2024

Conversation

daipom
Copy link
Contributor

@daipom daipom commented Aug 30, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:
This replaces the current SIGUSR2 (#2716) with the new feature.
(Not supported on Windows).

  • Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd without data loss of plugins such as in_udp.
(Please See #4622).

Specification:

  • 2 ways to trigger this feature (non-Windows):
    • Signal: SIGUSR2 to the supervisor.
    • RPC: /api/processes.zeroDowntimeRestart
      • Leave /api/config.gracefulReload for the traditional feature.
  • This starts the new supervisor and workers with zero downtime for some plugins.
    • Input plugins with zero_downtime_restart supported work in parallel.
      • Supported input plugins:
        • in_tcp
        • in_udp
        • in_syslog
    • The old processes stop after 10s.
  • The new supervisor works in source-only mode (Add with-source-only feature #4661) until the old processes stop.
    • After the old processes stop, the data handled by the new processes are loaded and processed.
    • If need, you can configure source_only_buffer (see Add with-source-only feature #4661).
  • Windows: Not affected at all. Remains the traditional GracefulReload.

Mechanism

  1. The supervisor receives SIGUSR2.
  2. Spawn a new supervisor.
  3. Take over shared sockets.
  4. Launch new workers, and stop old processes in parallel.
    • Launch new workers with source-only mode
      • Limit to zero_downtime_restart_ready? input plugin
    • Send SIGTERM to the old supervisor after 10s delay from 3.
  5. The old supervisor stops and sends SIGWINCH to the new one.
  6. The new workers run fully.

375586975-1ec2b716-03c3-4db6-90c0-f9e7e8a14a17

Needs following:

Conditions under which zero_downtime_restart_ready? can be enabled:

  • Must be able to work in parallel with another Fluentd instance.
  • Notes:
    • The sockets provided by server helper are shared with the new Fluentd instance.
    • Input plugins managing a position such as in_tail should not enable its zero_downtime_restart_ready?.
      • Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place.
    • in_http and in_forward could also be supported. Not supporting them this time is simply a matter of time to consider.

The appropriateness of replacing the traditional SIGUSR2

There are the following reasons:

ETC

Docs Changes:
TODO

Release Note:

  • Add zero-downtime-restart feature for non-Windows (USR2 signal and /api/processes.zeroDowntimeRestart RPC API)

TODO:

  • Some implementation TODO referred in code comment.
  • Tests
  • Document

@daipom daipom changed the title Restart without downtime Update/Restart without downtime Aug 30, 2024
@daipom daipom changed the title Update/Restart without downtime Update/Reload without downtime Aug 30, 2024
@daipom daipom self-assigned this Aug 30, 2024
@daipom daipom force-pushed the restart-without-downtime branch from 41fd042 to d0f31e8 Compare October 3, 2024 01:58
@daipom daipom force-pushed the restart-without-downtime branch from d0f31e8 to 630f809 Compare October 11, 2024 00:59
@daipom
Copy link
Contributor Author

daipom commented Oct 11, 2024

The basic implementation is done.
Some concept of #4654 is reflected. Thanks @Watson1978!

@daipom daipom force-pushed the restart-without-downtime branch from 630f809 to 1cbbd9a Compare October 11, 2024 01:12
@daipom daipom added this to the v1.18.0 milestone Oct 11, 2024
@daipom daipom force-pushed the restart-without-downtime branch from 1cbbd9a to f8755d0 Compare October 31, 2024 03:06
daipom added a commit to fluent/fluent-package-builder that referenced this pull request Oct 31, 2024
Use this to test this feature.

* fluent/fluentd#4624

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
kenhys pushed a commit to fluent/fluent-package-builder that referenced this pull request Nov 5, 2024
Use this to test this feature.

* fluent/fluentd#4624

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
kenhys pushed a commit to fluent/fluent-package-builder that referenced this pull request Nov 19, 2024
Use this to test this feature.

* fluent/fluentd#4624

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom daipom force-pushed the restart-without-downtime branch 10 times, most recently from 1a7f2e0 to feda2ea Compare November 25, 2024 10:12
@daipom daipom changed the title Update/Reload without downtime GracefulReload(SIGUSR2): Restart new process with zero downtime Nov 25, 2024
@daipom daipom force-pushed the restart-without-downtime branch 2 times, most recently from 9a2e188 to 873bf29 Compare November 25, 2024 16:08
@daipom daipom marked this pull request as ready for review November 25, 2024 16:13
@daipom daipom requested a review from Watson1978 November 26, 2024 01:48
@daipom daipom changed the title GracefulReload(SIGUSR2): Restart new process with zero downtime SIGUSR2: Restart new process with zero downtime Nov 26, 2024
@daipom daipom force-pushed the restart-without-downtime branch 2 times, most recently from 049b9f7 to c714d9c Compare November 26, 2024 03:14
@daipom daipom changed the title SIGUSR2: Restart new process with zero downtime SIGUSR2: zero downtime restart Nov 26, 2024
@daipom daipom requested a review from kenhys November 26, 2024 03:26
@daipom daipom force-pushed the restart-without-downtime branch 4 times, most recently from 1b52de8 to d7e68db Compare November 27, 2024 02:53
@daipom
Copy link
Contributor Author

daipom commented Nov 27, 2024

Thanks for your review!

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved
lib/fluent/root_agent.rb Outdated Show resolved Hide resolved
lib/fluent/root_agent.rb Outdated Show resolved Hide resolved
lib/fluent/root_agent.rb Outdated Show resolved Hide resolved
@kenhys
Copy link
Contributor

kenhys commented Nov 27, 2024

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

@daipom daipom force-pushed the restart-without-downtime branch from d7e68db to e52d3bd Compare November 27, 2024 03:12
@daipom
Copy link
Contributor Author

daipom commented Nov 27, 2024

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

Yes.
The old Fluentd should continue to work as is until it receives SIGTERM at 4..
(Even if the new Fluentd does not work as expected).

The new Fluentd RPC starts at 5., so there is no conflict.

If the old Fluentd receives /api/processes.killWorkers, it causes just a quick transition to 5..

@daipom daipom force-pushed the restart-without-downtime branch from e52d3bd to 8e09c09 Compare November 27, 2024 03:53
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
Co-authored-by: Kentaro Hayashi <hayashi@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom daipom force-pushed the restart-without-downtime branch from 8e09c09 to d7164dd Compare November 27, 2024 04:36
Copy link
Contributor

@kenhys kenhys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@daipom daipom merged commit d102527 into master Nov 28, 2024
17 of 18 checks passed
@daipom daipom deleted the restart-without-downtime branch November 28, 2024 04:48
@daipom
Copy link
Contributor Author

daipom commented Nov 28, 2024

Thanks for your review!

daipom added a commit to daipom/fluent-package-builder that referenced this pull request Nov 29, 2024
Use this to test this feature.

* fluent/fluentd#4624

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update/Reload without downtime
3 participants