Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s: CloudNative: Support SLB heath check. #1598

Closed
winlinvip opened this issue Feb 12, 2020 · 6 comments
Closed

K8s: CloudNative: Support SLB heath check. #1598

winlinvip opened this issue Feb 12, 2020 · 6 comments
Assignees
Labels
Bug It might be a bug. Enhancement Improvement or enhancement. Kubernetes For K8s, Prometheus, APM and Grafana. TransByAI Translated by AI/GPT.
Milestone

Comments

@winlinvip
Copy link
Member

winlinvip commented Feb 12, 2020

Description'

Please ensure that the markdown structure is maintained.

Please describe the issue you encountered here.
'
Make sure to maintain the markdown structure.

  1. SRS version: 3.0.112
  2. The log of SRS is as follows:
[2020-02-12 11:08:02.825][Warn][1][430][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4749
thread [1][430]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][430]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

[2020-02-12 11:08:03.013][Warn][1][431][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4750
thread [1][431]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][431]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

Replay

How to replay bug?

Steps to reproduce the bug

Steps to reproduce the bug:

  1. Load balancing service of cloud service, with SRS attached at the back.
  2. Set health check as TCP or HTTP.
  3. A large number of logs appear, approximately one every 100 milliseconds.

Expected behavior:

Support SLB health check, TCP or HTTP method, refer to Aliyun SLB Health Check.

TRANS_BY_GPT3

@winlinvip winlinvip added Bug It might be a bug. Enhancement Improvement or enhancement. labels Feb 12, 2020
@winlinvip winlinvip added this to the SRS 4.0 release milestone Feb 12, 2020
@winlinvip
Copy link
Member Author

winlinvip commented Feb 12, 2020

SRS3 will become the main version used within a certain period of time, and support for cloud native will be prioritized in SRS3, unless the changes are significantly large and affect stability.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 16, 2020

Currently, there is no problem with the TCP keep-alive detection of SLB, but it causes a large number of invalid logs for SRS.

[2020-02-16 14:00:24.542][Warn][1][471][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=8288
thread [1][471]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][471]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

After filtering out these invalid logs, more than half of them are reduced.

-rw-r--r--   1 chengli.ycl  staff  2485201 Feb 16 22:00 t.log
-rw-r--r--   1 chengli.ycl  staff   242670 Feb 16 22:12 t2.log

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 16, 2020

Logs must be collected centrally.
For example, if there is a problem with playing livestream.flv, you have to go to each edge to investigate.

Mac:srs chengli.ycl$ kubectl get po |grep edge
srs-edge-deploy-5cfd4b5b74-7hwfh      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-crgtn      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-gbzsp      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-rx856      1/1     Running   0          75m

Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-7hwfh grep 'livestream.flv' objs/srs.log
[2020-02-16 13:38:12.800][Trace][1][552] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-crgtn grep 'livestream.flv' objs/srs.log
[2020-02-16 14:33:35.624][Trace][1][780] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-gbzsp grep 'livestream.flv' objs/srs.log
command terminated with exit code 1
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-rx856 grep 'livestream.flv' objs/srs.log
[2020-02-16 13:42:44.325][Trace][1][369] HTTP GET http://r.ossrs.net:8080/live/livestream.flv, content-length=-1
[2020-02-16 13:42:44.325][Trace][1][369] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
[2020-02-16 13:42:44.325][Trace][1][369] FLV /live/livestream.flv, encoder=FastFLV, nodelay=0, mw_sleep=350ms, cache=0, msgs=128
Mac:srs.wiki chengli.ycl$ 

Then, based on the timestamp, if the logs can be collected in SLS, it will be easy to search. You just need to input livestream.flv in SLS to find all the information about this stream on all nodes.

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 21, 2020

The TCP keep-alive detection connection of SLB fails when retrieving information. It appears as follows in lsof:

COMMAND PID   USER   FD   TYPE  DEVICE SIZE/OFF    NODE NAME
srs     693 winlin   14u  sock     0,6      0t0 7163442 can't identify protocol

There was an error in obtaining the address of the file descriptor in SRS.

string srs_get_peer_ip(int fd)
{
    sockaddr_storage addr;
    socklen_t addrlen = sizeof(addr);
    if (getpeername(fd, (sockaddr*)&addr, &addrlen) == -1) {
        return "";

This will result in a large number of error messages.

Capture packets using tcpdump:

sudo tcpdump -i eth0 tcp port 2935 -w t.pcap

The SRS server IP is 172.17.1.57, and the SLB IP is 100.121.184.64:

image

From the above figure, it can be seen that 1-2-3-4 is one heartbeat, and the second packet is sent by the SRS server, after which the SLB immediately closes the connection. Then the second heartbeat is initiated with 5-6-7-8, with an interval of only 0.3 seconds (an SLB has around 10 LVS for detection), but the actual detection interval configured on the SLB is 2 seconds. For the health check mechanism, refer to: TCP Listening Health Check Mechanism

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

winlinvip commented Feb 21, 2020

Supports TCP-based health checks, which are enabled by default. This means that connections that fail to obtain an IP will be ignored.

# Whether client empty IP is ok, for example, health checking by SLB.
# If ok(on), we will ignore this connection without warnings or errors.
# default: on
empty_ip_ok on;

TRANS_BY_GPT3

@winlinvip
Copy link
Member Author

Fixed

@winlinvip winlinvip self-assigned this Sep 5, 2021
@winlinvip winlinvip changed the title Support SLB heath check. K8s: CloudNative: Support SLB heath check. Feb 19, 2022
@winlinvip winlinvip added the Kubernetes For K8s, Prometheus, APM and Grafana. label Sep 1, 2022
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It might be a bug. Enhancement Improvement or enhancement. Kubernetes For K8s, Prometheus, APM and Grafana. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

1 participant