Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

Closed
winlinvip opened this issue Jan 19, 2020 · 7 comments
Assignees
Labels
Feature It's a new feature. Kubernetes For K8s, Prometheus, APM and Grafana. TransByAI Translated by AI/GPT.
Milestone

Comments

@winlinvip
Copy link
Member

winlinvip commented Jan 19, 2020

Usage

SRS supports two signals:

  • SIGTERM: Fast exit, quickly clean up actively disconnected connections, and then exit. K8s sends this signal during preStop, and then sends SIGKILL to forcefully kill the Pod after a timeout. We can configure force_grace_quit to consider SIGTERM as Gracefully QUIT as well.
  • SIGQUIT: Graceful exit, close listening and wait for all clients to disconnect before exiting. If there are still connections, SRS will not exit, but the longest exit waiting time configuration in K8s is terminationGracePeriodSeconds, and it will force exit after waiting for this long. If there are no connections, it will wait for grace_final_wait before exiting.

Note: SRS does not implement a maximum waiting time. It will wait for clients to disconnect indefinitely without forcing an exit. In conjunction with the terminationGracePeriodSeconds configuration in K8s for managing Pods, K8s will send SIGKILL to forcefully shut down SRS after a timeout.

Other

In order to simplify the handling process, SRS does not clean up memory objects when stopping the stream, as the stream may be re-pushed. If cleaning is required, it would result in complex and careful handling of Source objects, which is not conducive to problem simplification.

Not cleaning up Source objects will cause continuous memory growth. This may not be a noticeable issue in scenarios where there is less streaming and more playback. However, in scenarios with a lot of streaming, such as monitoring and conference scenarios, cleaning up the streams becomes necessary. Reference:

  1. PR for Source cleanup submitted by Nobody2 (fix: clean up source and add publisher status #1568) discussed various scenarios that require cleaning up. Of course, Nobody2 did a great job with the submitted PR, but the issue itself is too complex.
  2. Online reports (Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509, #1271, After stopping the stress test, the CPU and memory have remained consistently high. #1507) indicate memory leaks and OOM caused by not cleaning up Source objects.

Currently, partial optimizations have been implemented to alleviate this issue.

  • Resolve coroutine issue: link
  • Reduce memory usage in source: link

At the same time, we are also considering the most stable and easiest solution. There is another idea to make SRS support smooth exit and smooth upgrade, roughly as follows:

  1. Disable exclusive access to the PID file, allowing a new SRS to be started.
  2. Use REUSEPORT to open a new SRS, allowing both the old and new SRS to provide services using the same PID file.
  3. The old SRS will no longer accept new connections and the API port will be closed. After serving existing clients or after a certain period of time, such as 12 hours, the old SRS will exit.

This way, the old SRS can easily and safely release the created sources and potential other memory issues. Users can smoothly upgrade and exit SRS during off-peak periods according to their business needs, minimizing the impact on users.

The only issue is that when both the new and old SRS are providing services, the API is provided by the new SRS, which means that the system count is not accurate, and the number of users served by the old SRS may be missed.

Remark: If it is a source station cluster, the stream is on the old SRS, which may result in the inability to detect the stream. In this case, it is necessary to forcefully disconnect the stream. The client needs to support retries in order to smoothly support this. One solution is to place an Edge before the source station, so that retries can be supported through the Edge.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Jan 19, 2020

Users can choose:

  1. According to their business situation, users can choose to smoothly upgrade and exit SRS during the low peak period of their business.
  2. If there is a significant increase in the memory usage of SRS, exceeding the warning level, users can choose to urgently initiate a smooth upgrade.
  3. Users can choose to upgrade in a planned manner after a new stable version is released, in order to avoid impacting a small number of existing users.

TRANS_BY_GPT3

@winlinvip winlinvip changed the title 支持热升级或平滑升级,Upgrade smoothly. 支持热升级或平滑升级,Upgrade smoothly, Source清理 Jan 26, 2020
@winlinvip winlinvip changed the title 支持热升级或平滑升级,Upgrade smoothly, Source清理 支持热升级或平滑升级,Upgrade smoothly, Gracefully Upgrade, Source清理 Feb 18, 2020
@winlinvip
Copy link
Member Author

winlinvip commented Feb 18, 2020

When SRS supports K8S deployment, services need to support upgrades, rollbacks, and canary releases. The basic requirement for these mechanisms is that SRS needs to support Gracefully Quit/Upgrade. Only when SRS can do its part well, can K8S or other release mechanisms meet the requirements for production-level releases.

The SRS cluster is divided into Origin and Edge clusters, and this issue can be viewed separately.

  • The Origin cluster can be directly restarted because there is an Edge cluster that can retry. However, it can be improved when exiting, for example, not exiting abruptly at once, as this may cause the flow to be directed to another Origin server.
  • The Edge cluster cannot be directly restarted because it directly serves the clients. It can only exit after the service is finished. Whether it is a long connection or a short connection, the requirement is the same, only the duration differs.

Therefore, we focus on the issue of Gracefully Quit in the Edge cluster, which can refer to the mechanism of Nginx.

  1. Update the Nginx binary.
  2. Send the SIGUSR2 signal to Nginx.
  3. The Nginx master modifies the PID file to /var/run/nginx.pid.oldbin, allowing the new master to start. This file can also be used to send signals to the old master.
  4. Start the new master using execve, with the PID set to /var/run/nginx.pid. Pass the listen file descriptor to the new master through the ENV, allowing both the old and new masters to listen on the same port.
  5. Send the SIGWINCH signal to the old master to gracefully terminate the workers after serving the existing file descriptors. Meanwhile, the new master is already working, and the new workers are serving new connections.
  6. Send the SIGQUIT signal to the old master to initiate a graceful shutdown.
  7. After a certain period, the old master can also be sent the SIGTERM signal to exit directly.

Since Nginx chooses to start the master using execve, inheriting the listen file descriptor, this process can be more complex. SRS can choose to use REUSEPORT to directly start a new process listening on the same file descriptor, making this solution simpler.

Additionally, SRS3 has been released with the following plans:

  • SRS3 supports some key features that require script coordination or K8S management.
  • SRS4 will provide improved support for Gracefully Upgrade and offer more comprehensive features.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 18, 2020

To publish updates, rollbacks, and gray releases, there are two main requirements for SRS:

  • Gracefully Quit: Smoothly exit by closing the listening port, no longer accepting new connections, and waiting for existing connections to end before quitting.
  • Gracefully Upgrade: Smoothly upgrade by starting a new SRS instance while the old one continues to run. The old instance will begin a Gracefully Quit process to smoothly exit.

The key to Gracefully Quit is to no longer accept new connections and wait for the existing connections to exit. We can achieve this by closing the listening file descriptor (fd) in SRS. Another approach is to remove the backend Pod from the SLB (Server Load Balancer), which will naturally prevent new fds from being created.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 18, 2020

SRS adds a new signal: SIGQUIT, which stands for Gracefully QUIT. It allows for a smooth exit by closing the listening file descriptor (FD) and waiting for existing connections to finish before exiting.

Finally, it will wait for a certain period of time, by default 3.2 seconds, to allow for the completion of the final cleanup. For example, if there are no connections, only the listening needs to be closed.


[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5698/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
[root@55233a151f96 trunk]# 

[2020-02-18 11:07:21.529][Trace][5698][700] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:07:21.530][Warn][5698][700][11] main cycle terminated, system quit normally.
[2020-02-18 11:07:26.740][Trace][5698][700] final wait for another 5200ms
[2020-02-18 11:07:26.740][Trace][5698][700] srs gracefully quit

When there are connections, it will keep waiting.

[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs 

[2020-02-18 11:09:57.356][Trace][5776][516] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:09:57.356][Warn][5776][516][11] main cycle terminated, system quit normally.
[2020-02-18 11:09:58.382][Trace][5776][516] wait for 1 conns to quit
[2020-02-18 11:10:00.459][Trace][5776][516] wait for 1 conns to quit

You can see that the listening connection is closed, but the service connection is still not closed. SRS will only exit after this streaming connection is finished.

Add a new configuration for the waiting time before exiting, with a default value of 3.2 seconds.

# for gracefully quit, final wait for cleanup in milliseconds.
# default: 3200
grace_final_wait 3200;

Note: For K8S, it is also necessary to enable force_grace_quit, please refer to force_grace_quit

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 18, 2020

We also need a configuration because when K8S calls preStop, it sends a SIGTERM signal to SRS. SIGTERM is a fast quit signal that causes SRS to exit quickly. Even during the Gracefully Quit period, SRS will handle this signal. Therefore, it is necessary to configure SRS to consider SIGTERM as a gracefully quit signal.

# Whether force gracefully quit, never fast quit.
# By default, SIGTERM which means fast quit, is sent by K8S, so we need to
# force SRS to treat SIGTERM as gracefully quit for gray release or canary.
# default: off
force_grace_quit off;

By default, it is not enabled, which means that SRS will exit when it receives a SIGTERM signal. This is suitable for general scenarios, such as origin servers or situations where smooth upgrades are not required.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 21, 2020

SRS3 already supports graceful shutdown. It can also support smooth upgrades in the K8S and SLB architectures. Please refer to: https://github.com/ossrs/srs/wiki/v4_CN_K8s#srs-cluster-update-rollback-gray-release-with-zero-downtime

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Dec 1, 2020

Just need to clean up one Source, as described in other Issues:'

Make sure to maintain the markdown structure.

Make sure to maintain the markdown structure.

For more progress, please refer to: #413

Make sure to maintain the markdown structure.

TRANS_BY_GPT3

@winlinvip winlinvip added the Kubernetes For K8s, Prometheus, APM and Grafana. label Sep 1, 2022
@winlinvip winlinvip changed the title 支持热升级或平滑升级,Upgrade smoothly, Gracefully Upgrade, Source清理 Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning Jul 28, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature It's a new feature. Kubernetes For K8s, Prometheus, APM and Grafana. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

1 participant