-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tcp_no_deplay setting by using int type #4058
Conversation
Thanks for this @htgeis ! I'm not qualified to approve this PR, but want to note for other reviewers that this code is covered by the tests on |
thanks for your comments @jameslamb and could you tell me who is qualified to approve this PR. This PR could improve the training speed a lot as I have tested by reducing the network cost and it's better to be included in the next release version if possible. |
Oh! I didn't understand that this had any impact on training time. From the PR description, I thought this was just a small fix to suppress a warning. Maybe @shiyu1994 will have time to quickly review before the release goes out, but if not this is going to miss release 3.2.0 (#3872) which I expect will go out in the next day or two. In the future, when you open pull requests it would be helpful to include information about how the change improves LightGBM, either by directly explaining that or by linking to a previous discussion that led to the change. If I'd understood that this change had performance implications, I would have Thanks again for taking the time to contribute to LightGBM! |
Sorry, I forget to add some prerequisites. It would improve the speed of distributed training. Enable tcp_no_deplay would disable Nagle’s algorithm. Instead of coalescing small packets, it sends them to the pipe as soon as possible. In general, Nagle’s algorithm’s goal is to reduce the number of packets sent to save bandwidth and increase throughput with the trade-off sometimes to introduce increased latency to services. Distributed LightGBM training is sensitive to the network latency and it would reduce 30% training time in our tests case(2 workers, socket version, data parallel) by applying this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htgeis Thanks for your contribution! LGTM.
@shiyu1994 @StrikerRUS at this point, can I merge code changes like this one? I remember last release that we didn't merge things for a few days after the release went out, in case it was necessary to do a hotfix. |
Oh, great question! I completely forgot about this, was about clicking |
ok great, thanks! and @htgeis thanks again for your help, we really appreciate it. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Issue: set TCP_NODELAY failed in distributed training and it could be found from the warning log.
The reason is that the type of kNoDelay should be int from https://man7.org/linux/man-pages/man2/getsockopt.2.html and https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-setsockopt