-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: 运行一段时间之后无法连接上端口模拟进程server #421
Comments
可能是内存不足导致程序崩了,建议先小数据量小模型验证下程序正确性 |
经过验证,我换了一个128G的主机,经过长时间的运行从之前64G内存的跑20个的datapoint到现在能跑200个datapoint左右,发现内存一直在缓慢增加,我看过自己的程序,里面没有增加内存的list等结构,怀疑可能是server存在内存泄露。 |
能提供下代码吗?我们好复现一下 |
这是我统计的内存log,每次跑完一次inference进行统计,可以看到client的内存每次几乎没有变化,但是总的使用内存一直在增加: |
代码发你同事了 |
distributed本身是一个性能测试使用的框架,为了性能效果,会根据堆积的对象个数延迟执行GC操作。 换个写法,模型加载一次重复使用,就可以了,diff如下
|
@warriorpaw , 改了之后会报错: |
报错的那句换成 spu_params = ppd.device("SPU")._place_arguments(ppd.device("P2")(lambda x: x)(PRETRAINED_MODEL.params))[0][0] |
@warriorpaw 经过修改,可以run了,谢谢。现在的内存状况如下: 可以发现模型参数相当于一直没有释放,一开始内存就会很高,然而内存还是会缓慢增加。从之前的分析可以得出: 补充一点:我在128G的电脑上run,内存从一开始的20%,最后也能长满导致server挂掉。 |
input_id有同样的问题,但是比模型本身小很多,不会很明显。 没有主动GC接口。 你可以手动减小 distributed.py 文件 1052 行 _GC_COLLECT_THRESHOLD 变量,加快GC触发 |
无法复现你描述的问题,跑你提供的代码超过12个小时 总计418个迭代,最终内存占用没有超过35GB。测试过程中内存占用有略微增长,可能原因是 malloc 内部未释放碎片导致,但是没看到你描述的快速增长占用超过120+GB的情况。 麻烦提供下系统htop对比截图:刚运行第一个迭代和占用增长到100+GB时的截图,需要能看出来哪些进程增加的占用。感谢 |
@warriorpaw 我私下给你一个阿里云环境,这是必现的问题。下面这个纯净的阿里云系统中,每2小时左右我都需要重启client和server清除内存才能跑下面的datapoint(为此我专门写了一个damon...) |
经过在另外两台ubuntu22.04上运行,可以持续跑循环,估计是ubuntu20.04系统glibc malloc的问题了。可以尝试以下方法解决: |
1.我在一个Robert模型上使用SPU进行分类任务(非text generation)的inference,会循环跑很多个数据点。但是跑几十个数据点之后就会很容易在client端(model inference端)报错:
Traceback (most recent call last):
File "roberta.py", line 114, in
pre = run_on_spu(inputs_ids, PRETRAINED_MODEL.params)
File "roberta.py", line 59, in run_on_spu
outputs = ppd.device("SPU")(
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 679, in call
results = [future.result() for future in futures]
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 679, in
results = [future.result() for future in futures]
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 249, in run
return self._call(self._stub.Run, fn, *args, **kwargs)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 238, in _call
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 218, in rebuild_messages
return b''.join([msg for msg in msgs])
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 218, in
return b''.join([msg for msg in msgs])
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 238, in
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/grpc/_channel.py", line 541, in next
return self._next()
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/grpc/_channel.py", line 967, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Socket closed", grpc_status:14, created_time:"2023-11-27T21:18:41.065482153+08:00"}"
2.同时在server端(端口模拟端)会报错:
[2023-11-27 21:15:33.074] [�[32minfo�[m] [api.cc:204] Link details: total send bytes 4412083004, send actions 25207
W1127 21:18:40.795603 3191480 external/com_github_brpc_brpc/src/brpc/input_messenger.cpp:375] Fail to read from Socket{id=0 fd=17 addr=127.0.0.1:9930:35130} (0x7fae24068280): Connection reset by peer
I1127 21:18:40.895106 3191540 external/com_github_brpc_brpc/src/brpc/socket.cpp:2466] Checking Socket{id=0 addr=127.0.0.1:9930} (0x7fae20068280)
I1127 21:18:40.905504 3191551 external/com_github_brpc_brpc/src/brpc/socket.cpp:2466] Checking Socket{id=0 addr=127.0.0.1:9930} (0x7fae24068280)
[2023-11-27 21:18:41.428] [�[32minfo�[m] [channel.cc:346] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '104', http status code '0', response header '', response body '', error msg '[E104]Fail to read from Socket{id=0 fd=17 addr=127.0.0.1:9930:35130} (0x0x7fae24068280): Connection reset by peer'
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x7faece344ed0
#1 yacl::link::transport::Channel::SendRequestWithRetry()+0x7faece358639
#2 yacl::link::transport::Channel::SendMono()+0x7faece358987
#3 yacl::link::transport::SendTask::Proc()+0x7faece35ba72
#4 bthread::TaskGroup::task_runner()+0x7faece4dce17
#5 bthread_make_fcontext+0x7faece4f1d71
......
[2023-11-27 21:18:45.440] [�[32minfo�[m] [channel.cc:346] send request failed and retry, retry_count=3, max_retry=3, interval_ms=5000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 127.0.0.1:9930 yet, server_id=0'
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x7faece344ed0
#1 yacl::link::transport::Channel::SendRequestWithRetry()+0x7faece358639
#2 yacl::link::transport::Channel::SendMono()+0x7faece358987
#3 yacl::link::transport::SendTask::Proc()+0x7faece35ba72
#4 bthread::TaskGroup::task_runner()+0x7faece4dce17
#5 bthread_make_fcontext+0x7faece4f1d71
[2023-11-27 21:18:50.451] [�[31m�[1merror�[m] [channel.cc:98] SendImpl error [external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 127.0.0.1:9930 yet, server_id=0'
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x7faece344ed0
#1 yacl::link::transport::Channel::SendRequestWithRetry()+0x7faece358639
#2 yacl::link::transport::Channel::SendMono()+0x7faece358987
#3 yacl::link::transport::SendTask::Proc()+0x7faece35ba72
#4 bthread::TaskGroup::task_runner()+0x7faece4dce17
#5 bthread_make_fcontext+0x7faece4f1d71
[2023-11-27 21:18:50.451] [�[31m�[1merror�[m] [channel.cc:98] SendImpl error [external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 127.0.0.1:9930 yet, server_id=0'
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x7faece344ed0
#1 yacl::link::transport::Channel::SendRequestWithRetry()+0x7faece358639
#2 yacl::link::transport::Channel::SendMono()+0x7faece358987
#3 yacl::link::transport::SendTask::Proc()+0x7faece35ba72
#4 bthread::TaskGroup::task_runner()+0x7faece4dce17
#5 bthread_make_fcontext+0x7faece4f1d71
[2023-11-27 21:18:50.452] [�[31m�[1merror�[m] [channel.cc:98] SendImpl error [external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 127.0.0.1:9930 yet, server_id=0'
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x7faece344ed0
#1 yacl::link::transport::Channel::SendRequestWithRetry()+0x7faece358639
#2 yacl::link::transport::Channel::SendMono()+0x7faece358987
#3 yacl::link::transport::SendTask::Proc()+0x7faece35ba72
#4 bthread::TaskGroup::task_runner()+0x7faece4dce17
#5 bthread_make_fcontext+0x7faece4f1d71
I1127 21:18:51.702651 3191492 external/com_github_brpc_brpc/src/brpc/socket.cpp:2466] Checking Socket{id=1 addr=127.0.0.1:9931} (0x7fae20068500)
terminate called after throwing an instance of 'yacl::IoError'
what(): [external/yacl/yacl/link/transport/channel.cc:405] Get data timeout, key=root-176919:P2P-1:0->2
Stacktrace:
#0 yacl::link::Context::RecvInternal()+0x7faece34f99b
#1 yacl::link::Context::Recv()+0x7faece351e06
#2 spu::mpc::Communicator::rotate()+0x7faecdf9eda5
#3 spu::mpc::aby3::MatMulAA::proc()+0x7faecd68c54b
#4 spu::dynDispatch<>()+0x7faecdfe7708
#5 spu::mpc::mmul_aa()+0x7faecdffcf45
#6 spu::mpc::mmul_ss()+0x7faecdfef909
#7 spu::kernel::hal::_mmul_ss()+0x7faecdfdb86b
#8 spu::kernel::hal::_mmul_impl()+0x7faecdfc80a2
#9 spu::kernel::hal::_mmul()+0x7faecdfcdf6e
#10 spu::kernel::hal::f_mmul()+0x7faecdf63390
#11 spu::kernel::hal::(anonymous namespace)::dtypeBinaryDispatch<>()+0x7faecdf4ba4c
#12 spu::kernel::hal::matmul()+0x7faecdf4c49d
#13 spu::kernel::hlo::Dot()+0x7faecdf3030f
#14 spu::device::pphlo::dispatchOp<>()+0x7faecd911a2c
#15 spu::device::pphlo::dispatchOp<>()+0x7faecd913e9c
Process Process-4:
Traceback (most recent call last):
File "nodectl.py", line 57, in
worker.join()
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/process.py", line 149, in join
^C File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 224, in call
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 464, in close_fds
AttributeError: 'NoneType' object has no attribute 'close'
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 224, in call
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 464, in close_fds
AttributeError: 'NoneType' object has no attribute 'close'
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 224, in call
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 464, in close_fds
AttributeError: 'NoneType' object has no attribute 'close'
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 224, in call
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/multiprocess/util.py", line 464, in close_fds
AttributeError: 'NoneType' object has no attribute 'close'
3.不重启server的情况下,再使用client必定失败;重启server,从失败的datapoint开始,又能正常运行几十个datapoint。 若不重启server,client会报错:
Traceback (most recent call last):
File "roberta.py", line 87, in
ppd.init(conf["nodes"], conf["devices"])
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 1110, in init
_CONTEXT = HostContext(nodes_def, devices_def)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 1037, in init
self.devices[name] = SPU(
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 952, in init
results = [future.result() for future in futures]
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 952, in
results = [future.result() for future in futures]
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/xxx/miniconda3/envs/spu/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 249, in run
return self._call(self._stub.Run, fn, *args, **kwargs)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 238, in _call
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 218, in rebuild_messages
return b''.join([msg for msg in msgs])
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 218, in
return b''.join([msg for msg in msgs])
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/spu/utils/distributed.py", line 238, in
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/grpc/_channel.py", line 541, in next
return self._next()
File "/xxx/miniconda3/envs/spu/lib/python3.8/site-packages/grpc/_channel.py", line 950, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:9920: Failed to connect to remote host: Connection refused"
debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:9920: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-11-28T08:47:33.462023675+08:00"}"
The text was updated successfully, but these errors were encountered: