-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509
Comments
The reason is clearly stated in the log: 'Cannot allocate memory'.
|
My server had 60GB of remaining memory at that time, and the current SRS process was using 3.4GB of memory. Does running out of memory refer to the memory limit for the current process? This issue was occurring repeatedly while handling 500 streaming routes, closing and restarting them.
|
After checking the srsSource objects in the pool, there are SrsSource::do_cycle_all with a pool size of [38320]. I don't know if it has any impact or not.
|
You have also used a considerable amount of memory for other things, so the available memory should be less than 60GB. How did you test it? Can you describe the process?
|
Mem: 109G 10G 61G 4.0G 38G 94G |
I push the stream to SRS using route 500, then play it on 500 clients, repeat for one hour, close all streams, and repeat the above steps. Each stream has a unique stream ID, so a new SRS source object is created every time a stream is pushed. SRSlibrtmp is used for pushing the stream, and JMeter is used for playing the stream, which is discarded after obtaining it. Now that the stress test is over, the CPU and memory usage of SRS remain high. I plan to add pool cleaning in the reload function. When this situation occurs, I will send a signal to SRS to execute a reload and release all SRS source objects in the pool, to see if the CPU and memory usage can be reduced.
|
An srsSource object will start 2 coroutines. At that time, there were 38,320 remaining srsSource objects in the pool, which is equivalent to over 70,000 coroutines. This scheduling is a significant CPU consumption. After releasing the srsSource, the CPU usage can be noticeably reduced.
|
Okay, I understand what you mean. It seems like the performance issue is caused by too many sources. I will try to reproduce it first. Thank you.
|
After I enabled HTTP FLV, I created a new source with curl and repeatedly used "killall curl" to reproduce this issue. The speed increases at around 1MB every 3 seconds.
I discovered that many coroutines are actually SrsAsyncCallWorker, which means they are coroutines for asynchronous callbacks.
I found that HLS and DVR are using this.
Running for about 16 minutes, with 4k streams, it occupies 125MB of memory and CPU usage is 5%.
If this part is optimized, the number of coroutines may decrease, but it will not have a significant impact on the overall performance.
|
Modify HLS and DVR to only start SrsAsyncCallWorker when publishing and stop it when unpublishing. This will prevent a Source from having two persistent coroutines. When there are 4k streams, the situation is slightly better for the empty Source test compared to before. The memory has decreased to 80MB, while the CPU remains at 5%.
The number of coroutines has already been reduced to 8.
We need to use a tool to check the reason.
|
Enable SRS support for gperf and valgrind, refer to SRS Performance (CPU) and Memory Optimization Tool Usage:
We found that CPU usage is high in the HTTP handler lookup. Since each stream source creates a new handler, the time taken to find the handler will also increase as the number of sources grows.
In GMP analysis, the main memory performance overhead is in:
In Valgrind analysis, the memory leaks mainly occur in SrsFastVector.
|
When there are 4,000 streams, changing the default size of the fast vector from 8,000 to 8 bytes reduces the memory to 45MB, and the CPU usage is at 4.7%.
|
Will be fixed in #1579 to support hot upgrade or smooth upgrade. Upgrade smoothly. Dup to #1509
|
Dup to #413 |
Description
Describe the problem you encountered.
When pushing the stream to SRS, the SRS process crashed.
Environment
...
...
client identified, type=fmle-publish, stream_name=62c1352ea3a64621999bfd946081f392, duration=-1.00
connected stream, tcUrl=rtmp://172.16.129.97/netposa, pageUrl=, swfUrl=, schema=rtmp, vhost=defaultVhost, port=1935, app=netposa, stream=62c1352ea3a64621999bfd946081f392, args=null
st_thread_create failed. ret=1017(Cannot allocate memory)
Reproduction
The steps to reproduce the bug are as follows:
...
...
Expected Behavior
Describe what you expect to happen.
By examining the stack analysis, the reasons are as follows:
During the push stream process, the function SrsSource::initialize is called, which then calls hls->initialize. However, an error occurs in hls->initialize, specifically st_thread_create failed, resulting in an error code being returned. In the function SrsSource::fetch_or_create, srs_freep(source) is called, leading to the destructor of SrsSource being invoked. As a result, the member variable play_edge is released, triggering its destructor and subsequently calling the destructor of the SrsPlayEdge member variable ingester. The stop function is called, ultimately invoking _source->on_unpublish(). Since the call to hls->initialize failed and returned without calling play_edge->initialize(this, _req), the _source pointer in SrsEdgeIngester is null, causing a core dump. Therefore, adding a null check before using _source can resolve the issue.
I hope someone knowledgeable can help answer why st_thread_create failed and why it resulted in a failure.
TRANS_BY_GPT3
The text was updated successfully, but these errors were encountered: