[fleet] fix bind failed with Address already in use #38174
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Others
Describe
fix bind failed with
Address already in use
.之前在#32892 中分析过产生
Address already in use
的原因,并尝试解决,但后续发现还存在该现象。产生原因
除之前分析的原因,进一步分析,还有以下原因。
3. 在paddle中存在
wait_server_ready
函数,在0号卡使用,用以判断其它卡的服务是否启动。这里存在一个问题,0号卡
wait_server_ready
可能先于其它卡的bind
。wait_server_ready
中使用了connect
,在发起连接时会占用端口,正好可能选中其它卡使用的端口,产生tcp自连接的现象,导致其它卡bind
时失败。可见https://my.oschina.net/u/2310891/blog/652323
解决方案
在
wait_server_ready
函数中给socket加上reuse_port
。当然对于
找到空闲端口到给C++使用存在一个时间差,可能被别的程序给占用。此问题暂无解
的问题,还是存在的。复现测试代码