Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaddleServing服务化部署后,服务端和客户端隔几天不定期的报grpc链接超时复位的错误,导致服务不可用 #1829

Closed
AI-Mart opened this issue Jul 22, 2022 · 1 comment

Comments

@AI-Mart
Copy link

AI-Mart commented Jul 22, 2022

由于已经上生产环境,需要尽快解决,谢谢

环境:

registry.baidubce.com/paddlepaddle/serving:0.8.3-cuda10.1-cudnn7-runtime作为基础镜像
由于要更改的grpc版本,grpcio==1.37.1和grpcio-tools==1.37.1
如下的三个包是定制的
paddle_serving_app-0.8.3-py3-none-any.whl
paddle_serving_client-0.8.3-cp37-none-any.whl
paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl

报错

ERROR 2022-07-21 03:15:39,100 [app.py:1892] Exception on /nlp/v1/company_policy/policy_res [POST]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functionsrule.endpoint
File "/usr/local/lib/python3.7/site-packages/flask_restful/init.py", line 467, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 89, in view
return self.dispatch_request(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/flask_restful/init.py", line 582, in dispatch_request
resp = meth(*args, **kwargs)
File "/deploy/main_api.py", line 53, in post
top_k_)
File "/deploy/semantics_match.py", line 103, in simi_search
ret = self.client.predict(feed_dict=feed)
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/pipeline/pipeline_client.py", line 202, in predict
resp = self._stub.inference(req)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1658373339.099364704","description":"Error received from peer ipv4:193.168.57.222:30088","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Connection reset by peer","grpc_status":14}"

寻找问题根源

和这个链接描述的一样:http://longfan.me/post/devops/2020-07-09

sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75

ipvsadm -l --timeout
Timeout (tcp tcpfin udp): 900 120 300

由于900<7200+9x75所以k8svip会超时复位grpc的服务端和客户端的长链接

问题

参考链接的描述http://longfan.me/post/devops/2020-07-09,
因为grpc的包是内嵌在paddle_serving_app-0.8.3-py3-none-any.whl、
paddle_serving_client-0.8.3-cp37-none-any.whl、paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl三个包里面,如何通过代码实现grpc本身的超时设置时间,使得服务端和客户端的超时设置小于系统ipvs时间900?还是需要更改paddleServing底层代码重新定制这三个包?

@AI-Mart
Copy link
Author

AI-Mart commented Jul 22, 2022

详细报错日志.txt

@paddle-bot paddle-bot bot closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant