Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/Reload without downtime #4622

Closed
daipom opened this issue Aug 30, 2024 · 0 comments · Fixed by #4624
Closed

Update/Reload without downtime #4622

daipom opened this issue Aug 30, 2024 · 0 comments · Fixed by #4624
Assignees
Labels
enhancement Feature request or improve operations work-in-progress

Comments

@daipom
Copy link
Contributor

daipom commented Aug 30, 2024

Is your feature request related to a problem? Please describe.

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as in_udp, in_tcp, and in_syslog, cannot receive data during this time.
This means that the data sent by a client is lost during this time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in some cases.

Describe the solution you'd like

Add a new feature: Update/Reload without downtime.

For example, implement a mechanism similar to nginx's feature for upgrading on the fly.

The main problem is that Fluentd can't run in parallel with the same config.
(It causes some conflicts, such as buffer files)

Because of this problem, it is very difficult to support all plugins.
However, it is possible to support only plugins that can run in parallel.

Based on the above, the following mechanism would be a good way to achieve this.

  1. The current supervisor receives a signal.
  2. The current supervisor sends signals to its workers, and the workers stop all plugins that cannot run in parallel.
  3. The current supervisor starts a new supervisor.
    • => Old processes and new processes run in parallel.
  4. After the new supervisor and its workers start to work, the current supervisor and its workers stop.

More specifically, it would be better to run only limited Input plugins in parallel, such as in_tcp, in_udp, and in_syslog.
Stop all plugins except those Input plugins, and prepare a dedicated file buffer for Output.
After the new workers start, they load the file buffer and route those events to the @ROOT label.

Describe alternatives you've considered

None.

Additional context

I have already started to create a PoC.

@daipom daipom added enhancement Feature request or improve operations work-in-progress labels Aug 30, 2024
@daipom daipom self-assigned this Aug 30, 2024
@daipom daipom moved this to Work-In-Progress in Fluentd Kanban Aug 30, 2024
daipom added a commit to daipom/serverengine that referenced this issue Aug 30, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
It receives the sockets from the existing server and stops it
before starts a new server.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom added a commit to daipom/serverengine that referenced this issue Aug 30, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
It receives the sockets from the existing server and stops it
after starts a new server.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
ashie pushed a commit to daipom/serverengine that referenced this issue Sep 3, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
It receives the sockets from the existing server and stops it
after starts a new server.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 21, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 21, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 21, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with
the existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Limitation: This feature would not work well if the process
opens new TCP ports frequently.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 21, 2024
This provides live restart feature for network servers.
(The existing live restart feature does not support network
servers.)

Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with
the existing server.
The old process should stop without removing the file for the
socket after the new process starts.

ref: fluent/fluentd#4622

Limitation: This feature would not work well if the process
opens new TCP ports frequently.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 21, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with
the existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This allows us to replace both the server and the
workers with new processes without socket downtime.
(The existing live restart feature does not support network
servers. We can restart workers without downtime, but there is
no way to restart the network server without downtime.)

ref: fluent/fluentd#4622

Limitation: This feature would not work well if the process
opens new TCP ports frequently.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
daipom added a commit to clear-code/serverengine that referenced this issue Oct 22, 2024
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with
the existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This allows us to replace both the server and the
workers with new processes without socket downtime.
(The existing live restart feature does not support network
servers. We can restart workers without downtime, but there is
no way to restart the network server without downtime.)

ref: fluent/fluentd#4622

Limitation: This feature would not work well if the process
opens new TCP ports frequently.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Shizuo Fujita <fujita@clear-code.com>
@github-project-automation github-project-automation bot moved this from Work-In-Progress to Done in Fluentd Kanban Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request or improve operations work-in-progress
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant