Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLUSTER: Request non-existent HTTP streams, log flooding, thread does not exit, FD leakage. #636

Closed
MaxLoveThree opened this issue Sep 6, 2016 · 13 comments
Assignees
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Milestone

Comments

@MaxLoveThree
Copy link

MaxLoveThree commented Sep 6, 2016

SRS version:
[root@localhost trunk]# ./objs/srs -v
2.0.214
Configuration:

listen              1936;
max_connections     1000;
pid                 ./objs/play_edge.pid;
srs_log_level       info;
srs_log_file        ./objs/play_edge.log;
chunk_size          4096;

http_server {
    enabled         on;
    listen          8090;
    dir             ./objs/nginx/html;
}

vhost **defaultVhost** {
        mode            remote;
        origin          127.0.0.1:19350;
    http_remux {
            enabled     on;
            mount       [vhost]/[app]/[stream].aac;
            hstrs       on;
    }
}

Operation:
Without any streaming, initiate a pull request to the server through the browser.

curl http://172.16.198.129:8090/my_test/test.aac -o /dev/null

Phenomenon:
The following logs are repeatedly displayed, even if the HTTP connection of the browser is disconnected, the following prints continue to be displayed.

[2016-09-06 04:24:24.713][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:25.715][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:25.728][trace][56189][113] complex handshake success.
[2016-09-06 04:24:25.728][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:25.808][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:25.808][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=117, dsu=1
[2016-09-06 04:24:25.809][trace][56189][113] out chunk size to 60000
[2016-09-06 04:24:28.856][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:29.857][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:29.873][trace][56189][113] complex handshake success.
[2016-09-06 04:24:29.873][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:29.953][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:29.953][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=120, dsu=1
[2016-09-06 04:24:29.953][trace][56189][113] out chunk size to 60000

TRANS_BY_GPT3

@MaxLoveThree
Copy link
Author

MaxLoveThree commented Sep 6, 2016

According to code tracing, the HTTP client thread has been continuously inside the SrsLiveStream::serve_http interface without coming out. This causes the client thread to not reach the logic code that checks whether the HTTP connection still exists, even if the HTTP connection is disconnected. By reading the implementation of other business functions, it seems that when the edge server pulls the stream from the source station, it should start another thread instead of stacking it on the HTTP client thread.

TRANS_BY_GPT3

@MaxLoveThree
Copy link
Author

MaxLoveThree commented Sep 6, 2016

The while loop in SrsLiveStream::serve_http is missing a logic to check if the HTTP client connection still exists.

TRANS_BY_GPT3

@winlinvip winlinvip added this to the srs 2.0 release milestone Sep 9, 2016
@winlinvip winlinvip added the Bug It might be a bug. label Sep 9, 2016
@winlinvip
Copy link
Member

winlinvip commented Sep 9, 2016

HTTP long connection, unless data is written, there is no way to know if the client has disconnected, because no data is read from the client after entering FLV.
This can only be solved like this: if the stream does not exist when sourcing, return 404 to the client.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Feb 10, 2017

This is because of the issue with the origin retrieval strategy. If there is no stream, the origin server should return a 404 error, and then the 404 error should be passed from the origin server to the edge server, and the edge server should give a 404 error to the player. However, this issue requires significant changes and cannot be addressed in SRS2 in time.
Postpone to SRS3+

TRANS_BY_GPT3

@winlinvip winlinvip modified the milestones: srs 3.0 release, srs 2.0 release Feb 10, 2017
@winlinvip winlinvip changed the title 配置http_remux后,请求不存在的流,线程无法清除,打印刷屏 CLUSTER: 配置http_remux后,请求不存在的流,线程无法清除,打印刷屏 Feb 10, 2017
@winlinvip winlinvip changed the title CLUSTER: 配置http_remux后,请求不存在的流,线程无法清除,打印刷屏 CLUSTER: 边缘配置http_remux后,请求不存在的流,线程无法清除,打印刷屏 Feb 10, 2017
@winlinvip winlinvip modified the milestones: srs 2.0 release, srs 3.0 release Apr 24, 2017
@winlinvip
Copy link
Member

winlinvip commented Apr 24, 2017

Take another look, it will cause the fd not to close.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Apr 25, 2017

You should open a coroutine to receive data from the fd. If the client closes the fd, the reading coroutine will return an error.
In each iteration, check if the coroutine has encountered an error. If there is an error, stop the loop.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Apr 25, 2017

Similar to reading RTMP CONNECTION:

if ((ret = trd->error_code()) != ERROR_SUCCESS) {

For RTMP playback clients, only a very small number of requests are sent from the client to the server, while the majority are sent from the server to the client. Therefore, a new coroutine is created for reading, while the main coroutine is mainly responsible for sending.

For FLV players, there are no read requests, only write requests. Therefore, a new coroutine can be created to block at the read location, and if the client closes the file descriptor, the read coroutine will return an error. The main coroutine's loop is mainly for sending data, and it checks the read coroutine at each iteration.

TRANS_BY_GPT3

@walkermi
Copy link
Contributor

walkermi commented Apr 26, 2017

For some players, after a successful connection, if they request a non-existent HTTP-FLV, the connection will not be actively disconnected and will remain open. In this situation, although there won't be a "close_wait" state, both sides will be stuck here, so it is better for SRS to actively respond with a 404.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Apr 30, 2017

The SRS configuration is as follows:

listen              1935;
max_connections     1000;
daemon              off;
srs_log_tank        console;
http_server {
    enabled         on;
    listen          8080;
}
vhost __defaultVhost__ {
    http_remux {
        enabled     on;
        mount       [vhost]/[app]/[stream].flv;
        hstrs       on;
    }
}

Directly access the player multiple times: http://www.ossrs.net/players/srs_player.html?app=live&stream=livestream.flv&server=localhost&port=8080&autostart=true&vhost=localhost&schema=http

You can see many CLOSE_WAIT:

winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52870        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52866        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52864        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52862        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52855        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52852        CLOSE_WAIT 

You can see many FDs (10-15) not closed:

winlin:srs winlin$ lsof |grep 10671|grep CLOSE_WAIT
srs       10671 winlin   10u     IPv4 0x17dcbc0eb6e11347       0t0      TCP localhost:http-alt->localhost:52852 (CLOSE_WAIT)
srs       10671 winlin   11u     IPv4 0x17dcbc0eb6ab6a4f       0t0      TCP localhost:http-alt->localhost:52855 (CLOSE_WAIT)
srs       10671 winlin   12u     IPv4 0x17dcbc0eb6f32347       0t0      TCP localhost:http-alt->localhost:52862 (CLOSE_WAIT)
srs       10671 winlin   13u     IPv4 0x17dcbc0eb6a98f67       0t0      TCP localhost:http-alt->localhost:52864 (CLOSE_WAIT)
srs       10671 winlin   14u     IPv4 0x17dcbc0ea4a5485f       0t0      TCP localhost:http-alt->localhost:52866 (CLOSE_WAIT)
srs       10671 winlin   15u     IPv4 0x17dcbc0eb6aca157       0t0      TCP localhost:http-alt->localhost:52870 (CLOSE_WAIT)

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Apr 30, 2017

Open a new ST receiving thread to read the HTTP request. Since HTTP-FLV does not have any subsequent requests, the receiving thread will encounter an error and exit when the client closes the connection. Generally, HTTP requests are handled in this way:

SrsHttpConn::do_cycle
    parser->parse_message(&req)
    process_request(writer, req)

However, in process_request, we need to open another thread to detect if the FD is closed:

process_request(writer, req)
    trd->start()
    while trd->error_code() == ERROR_SUCCESS 
        write FLV data.

In the thread trd, we need to call the function that directly reads the HTTP message:

SrsHttpConn::pop_message(&req)

Note: This API can only be used in connections without requests, that is, for FD closure detection. It would be a disaster if two FDs read the same FD.

There is a change, in fact, HTTP Streaming (FLV/TS) is replaced by SrsResponseOnlyHttpConn instead of SrsHttpConn, when reading the request, all the body is discarded.

When the player is closed, the receiving thread detects that the SOCKET has been RESET, which means it has been closed by the client, and the thread interrupts the loop.

[2017-04-30 11:58:43.928][trace][16163][107] HTTP client ip=127.0.0.1
[2017-04-30 11:58:43.928][trace][16163][107] HTTP GET http://localhost:8080/live/livestream.flv, content-length=-1
[2017-04-30 11:58:43.963][trace][16163][107] http: mount flv stream for vhost=/live/livestream, mount=/live/livestream.flv
[2017-04-30 11:58:43.964][trace][16163][107] hstrs: source url=/live/livestream, is_edge=0, source_id=-1[-1]
[2017-04-30 11:58:43.964][trace][16163][107] dispatch cached gop success. count=0, duration=-1
[2017-04-30 11:58:43.964][trace][16163][107] create consumer, queue_size=30.00, jitter=1
[2017-04-30 11:58:46.988][warn][16163][107][54] client disconnect peer. ret=1004

No FD leakage displayed.

winlin:srs winlin$ lsof |grep 16163|grep CLOSE_WAIT
winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT

It actually took 52 minutes to solve this problem. In the case of such a simple ST architecture, it can be considered a relatively troublesome problem...

TRANS_BY_GPT3

@winlinvip winlinvip changed the title CLUSTER: 边缘配置http_remux后,请求不存在的流,线程无法清除,打印刷屏 CLUSTER: 请求不存在的HTTP流,日志刷屏、线程不退出、FD泄漏 Apr 30, 2017
@chenliang2017
Copy link

chenliang2017 commented May 17, 2017

Compilation failed under Ubuntu.
src/app/srs_app_recv_thread.cpp:557:9: error: ‘ISrsHttpMessage’ was not declared in this scope.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

@chenliang2017 Please file another bug.

@winlinvip
Copy link
Member

Fixed by f2b4bc7

@winlinvip winlinvip self-assigned this Sep 23, 2021
@winlinvip winlinvip changed the title CLUSTER: 请求不存在的HTTP流,日志刷屏、线程不退出、FD泄漏 CLUSTER: Request non-existent HTTP streams, log flooding, thread does not exit, FD leakage. Jul 27, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

4 participants