Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

winlinvip · 2020-01-19T06:20:07Z

Usage

SRS supports two signals:

SIGTERM: Fast exit, quickly clean up actively disconnected connections, and then exit. K8s sends this signal during preStop, and then sends SIGKILL to forcefully kill the Pod after a timeout. We can configure force_grace_quit to consider SIGTERM as Gracefully QUIT as well.
SIGQUIT: Graceful exit, close listening and wait for all clients to disconnect before exiting. If there are still connections, SRS will not exit, but the longest exit waiting time configuration in K8s is terminationGracePeriodSeconds, and it will force exit after waiting for this long. If there are no connections, it will wait for grace_final_wait before exiting.

Note: SRS does not implement a maximum waiting time. It will wait for clients to disconnect indefinitely without forcing an exit. In conjunction with the terminationGracePeriodSeconds configuration in K8s for managing Pods, K8s will send SIGKILL to forcefully shut down SRS after a timeout.

Other

In order to simplify the handling process, SRS does not clean up memory objects when stopping the stream, as the stream may be re-pushed. If cleaning is required, it would result in complex and careful handling of Source objects, which is not conducive to problem simplification.

Not cleaning up Source objects will cause continuous memory growth. This may not be a noticeable issue in scenarios where there is less streaming and more playback. However, in scenarios with a lot of streaming, such as monitoring and conference scenarios, cleaning up the streams becomes necessary. Reference:

PR for Source cleanup submitted by Nobody2 (fix: clean up source and add publisher status #1568) discussed various scenarios that require cleaning up. Of course, Nobody2 did a great job with the submitted PR, but the issue itself is too complex.
Online reports (Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509, #1271, After stopping the stress test, the CPU and memory have remained consistently high. #1507) indicate memory leaks and OOM caused by not cleaning up Source objects.

Currently, partial optimizations have been implemented to alleviate this issue.

Resolve coroutine issue: link
Reduce memory usage in source: link

At the same time, we are also considering the most stable and easiest solution. There is another idea to make SRS support smooth exit and smooth upgrade, roughly as follows:

Disable exclusive access to the PID file, allowing a new SRS to be started.
Use REUSEPORT to open a new SRS, allowing both the old and new SRS to provide services using the same PID file.
The old SRS will no longer accept new connections and the API port will be closed. After serving existing clients or after a certain period of time, such as 12 hours, the old SRS will exit.

This way, the old SRS can easily and safely release the created sources and potential other memory issues. Users can smoothly upgrade and exit SRS during off-peak periods according to their business needs, minimizing the impact on users.

The only issue is that when both the new and old SRS are providing services, the API is provided by the new SRS, which means that the system count is not accurate, and the number of users served by the old SRS may be missed.

Remark: If it is a source station cluster, the stream is on the old SRS, which may result in the inability to detect the stream. In this case, it is necessary to forcefully disconnect the stream. The client needs to support retries in order to smoothly support this. One solution is to place an Edge before the source station, so that retries can be supported through the Edge.

TRANS_BY_GPT3

The text was updated successfully, but these errors were encountered:

winlinvip · 2020-01-19T06:23:26Z

Users can choose:

According to their business situation, users can choose to smoothly upgrade and exit SRS during the low peak period of their business.
If there is a significant increase in the memory usage of SRS, exceeding the warning level, users can choose to urgently initiate a smooth upgrade.
Users can choose to upgrade in a planned manner after a new stable version is released, in order to avoid impacting a small number of existing users.

TRANS_BY_GPT3

winlinvip · 2020-02-18T01:40:54Z

When SRS supports K8S deployment, services need to support upgrades, rollbacks, and canary releases. The basic requirement for these mechanisms is that SRS needs to support Gracefully Quit/Upgrade. Only when SRS can do its part well, can K8S or other release mechanisms meet the requirements for production-level releases.

The SRS cluster is divided into Origin and Edge clusters, and this issue can be viewed separately.

The Origin cluster can be directly restarted because there is an Edge cluster that can retry. However, it can be improved when exiting, for example, not exiting abruptly at once, as this may cause the flow to be directed to another Origin server.
The Edge cluster cannot be directly restarted because it directly serves the clients. It can only exit after the service is finished. Whether it is a long connection or a short connection, the requirement is the same, only the duration differs.

Therefore, we focus on the issue of Gracefully Quit in the Edge cluster, which can refer to the mechanism of Nginx.

Update the Nginx binary.
Send the SIGUSR2 signal to Nginx.
The Nginx master modifies the PID file to /var/run/nginx.pid.oldbin, allowing the new master to start. This file can also be used to send signals to the old master.
Start the new master using execve, with the PID set to /var/run/nginx.pid. Pass the listen file descriptor to the new master through the ENV, allowing both the old and new masters to listen on the same port.
Send the SIGWINCH signal to the old master to gracefully terminate the workers after serving the existing file descriptors. Meanwhile, the new master is already working, and the new workers are serving new connections.
Send the SIGQUIT signal to the old master to initiate a graceful shutdown.
After a certain period, the old master can also be sent the SIGTERM signal to exit directly.

Since Nginx chooses to start the master using execve, inheriting the listen file descriptor, this process can be more complex. SRS can choose to use REUSEPORT to directly start a new process listening on the same file descriptor, making this solution simpler.

Additionally, SRS3 has been released with the following plans:

SRS3 supports some key features that require script coordination or K8S management.
SRS4 will provide improved support for Gracefully Upgrade and offer more comprehensive features.

TRANS_BY_GPT3

winlinvip · 2020-02-18T06:49:14Z

To publish updates, rollbacks, and gray releases, there are two main requirements for SRS:

Gracefully Quit: Smoothly exit by closing the listening port, no longer accepting new connections, and waiting for existing connections to end before quitting.
Gracefully Upgrade: Smoothly upgrade by starting a new SRS instance while the old one continues to run. The old instance will begin a Gracefully Quit process to smoothly exit.

The key to Gracefully Quit is to no longer accept new connections and wait for the existing connections to exit. We can achieve this by closing the listening file descriptor (fd) in SRS. Another approach is to remove the backend Pod from the SLB (Server Load Balancer), which will naturally prevent new fds from being created.

TRANS_BY_GPT3

winlinvip · 2020-02-18T11:25:35Z

SRS adds a new signal: SIGQUIT, which stands for Gracefully QUIT. It allows for a smooth exit by closing the listening file descriptor (FD) and waiting for existing connections to finish before exiting.

Finally, it will wait for a certain period of time, by default 3.2 seconds, to allow for the completion of the final cleanup. For example, if there are no connections, only the listening needs to be closed.


[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5698/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
[root@55233a151f96 trunk]# 

[2020-02-18 11:07:21.529][Trace][5698][700] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:07:21.530][Warn][5698][700][11] main cycle terminated, system quit normally.
[2020-02-18 11:07:26.740][Trace][5698][700] final wait for another 5200ms
[2020-02-18 11:07:26.740][Trace][5698][700] srs gracefully quit

When there are connections, it will keep waiting.

[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs 

[2020-02-18 11:09:57.356][Trace][5776][516] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:09:57.356][Warn][5776][516][11] main cycle terminated, system quit normally.
[2020-02-18 11:09:58.382][Trace][5776][516] wait for 1 conns to quit
[2020-02-18 11:10:00.459][Trace][5776][516] wait for 1 conns to quit

You can see that the listening connection is closed, but the service connection is still not closed. SRS will only exit after this streaming connection is finished.

Add a new configuration for the waiting time before exiting, with a default value of 3.2 seconds.

# for gracefully quit, final wait for cleanup in milliseconds.
# default: 3200
grace_final_wait 3200;

Note: For K8S, it is also necessary to enable force_grace_quit, please refer to force_grace_quit

TRANS_BY_GPT3

winlinvip · 2020-02-18T14:03:56Z

We also need a configuration because when K8S calls preStop, it sends a SIGTERM signal to SRS. SIGTERM is a fast quit signal that causes SRS to exit quickly. Even during the Gracefully Quit period, SRS will handle this signal. Therefore, it is necessary to configure SRS to consider SIGTERM as a gracefully quit signal.

# Whether force gracefully quit, never fast quit.
# By default, SIGTERM which means fast quit, is sent by K8S, so we need to
# force SRS to treat SIGTERM as gracefully quit for gray release or canary.
# default: off
force_grace_quit off;

By default, it is not enabled, which means that SRS will exit when it receives a SIGTERM signal. This is suitable for general scenarios, such as origin servers or situations where smooth upgrades are not required.

TRANS_BY_GPT3

winlinvip · 2020-02-21T04:55:54Z

SRS3 already supports graceful shutdown. It can also support smooth upgrades in the K8S and SLB architectures. Please refer to: https://github.com/ossrs/srs/wiki/v4_CN_K8s#srs-cluster-update-rollback-gray-release-with-zero-downtime

TRANS_BY_GPT3

winlinvip · 2020-12-01T05:36:18Z

Just need to clean up one Source, as described in other Issues:'

Make sure to maintain the markdown structure.

Live: Source cleanup to free memory for multiple streams. #413, support Source cleaning, but it was revoked due to multiple issues. It will be improved and resolved in the future.
You can consider using Gracefully Quit to smoothly exit (Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579 (comment)), restart the service when there is no traffic, and temporarily bypass this problem.'

Make sure to maintain the markdown structure.

For more progress, please refer to: #413

Make sure to maintain the markdown structure.

TRANS_BY_GPT3

winlinvip added the Feature It's a new feature. label Jan 19, 2020

winlinvip added this to the SRS 4.0 release milestone Jan 19, 2020

This was referenced Jan 19, 2020

fix: clean up source and add publisher status #1568

Closed

Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

Closed

winlinvip mentioned this issue Jan 19, 2020

After stopping the stress test, the CPU and memory have remained consistently high. #1507

Closed

winlinvip changed the title ~~支持热升级或平滑升级，Upgrade smoothly.~~ 支持热升级或平滑升级，Upgrade smoothly, Source清理 Jan 26, 2020

This was referenced Jan 26, 2020

SOURCE: The SRS server crashed after stopping the stream. #671

Closed

SrsSource destructor coredump #986

Closed

winlinvip changed the title ~~支持热升级或平滑升级，Upgrade smoothly, Source清理~~ 支持热升级或平滑升级，Upgrade smoothly, Gracefully Upgrade, Source清理 Feb 18, 2020

This was referenced Feb 18, 2020

Cluster: Origin Cluster for Fault Tolarence and Load Balance. #464

Closed

Support docker and k8s in native #1595

Closed

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, define signals for fast/grace quit and upgrade

f4c7b88

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, support gracefully quit. 3.0.119

3c59754

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, support force gracefully quit. 3.0.120

58b4047

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, support gracefully quit and force to. 4.0.5

d87f58a

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, support start/final wait for gracefully quit. 3.0.121

dc0f804

winlinvip added a commit that referenced this issue Feb 18, 2020

For #1579, support start/final wait for gracefully quit. 4.0.6

ad3cfbf

winlinvip added a commit that referenced this issue Feb 19, 2020

For #1579, support rolling update of k8s. 4.0.7

1d01ef4

winlinvip added a commit that referenced this issue Feb 26, 2020

For #1579, refactor log for gracefully quit.

ea30579

winlinvip closed this as completed Dec 1, 2020

winlinvip mentioned this issue Jan 11, 2021

Live: Source cleanup to free memory for multiple streams. #413

Closed

winlinvip mentioned this issue Mar 4, 2021

srs occasional crash caused by cleaning up unused SrsSource mechanism #713

Closed

winlinvip self-assigned this Sep 5, 2021

winlinvip added the Kubernetes For K8s, Prometheus, APM and Grafana. label Sep 1, 2022

winlinvip changed the title ~~支持热升级或平滑升级，Upgrade smoothly, Gracefully Upgrade, Source清理~~ Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning Jul 28, 2023

winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

winlinvip commented Jan 19, 2020 •

edited

Loading

winlinvip commented Jan 19, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 21, 2020 •

edited

Loading

winlinvip commented Dec 1, 2020 •

edited

Loading

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

Comments

winlinvip commented Jan 19, 2020 • edited Loading

Usage

Other

winlinvip commented Jan 19, 2020 • edited Loading

winlinvip commented Feb 18, 2020 • edited Loading

winlinvip commented Feb 18, 2020 • edited Loading

winlinvip commented Feb 18, 2020 • edited Loading

winlinvip commented Feb 18, 2020 • edited Loading

winlinvip commented Feb 21, 2020 • edited Loading

winlinvip commented Dec 1, 2020 • edited Loading

winlinvip commented Jan 19, 2020 •

edited

Loading

winlinvip commented Jan 19, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 18, 2020 •

edited

Loading

winlinvip commented Feb 21, 2020 •

edited

Loading

winlinvip commented Dec 1, 2020 •

edited

Loading