Fix tcp_no_deplay setting by using int type #4058

htgeis · 2021-03-10T05:30:38Z

Issue: set TCP_NODELAY failed in distributed training and it could be found from the warning log.

The reason is that the type of kNoDelay should be int from https://man7.org/linux/man-pages/man2/getsockopt.2.html and https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-setsockopt

int setsockopt(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen);

Most socket-level options utilize an int argument for optval.
To enable a Boolean option, the optval parameter points to a nonzero integer. To disable the option optval points to an integer equal to zero.

jameslamb · 2021-03-10T18:06:31Z

Thanks for this @htgeis ! I'm not qualified to approve this PR, but want to note for other reviewers that this code is covered by the tests on lightgbm.dask, so the fact that those tests are passing is a good sign that this change is working as expected.

htgeis · 2021-03-18T10:07:21Z

Thanks for this @htgeis ! I'm not qualified to approve this PR, but want to note for other reviewers that this code is covered by the tests on lightgbm.dask, so the fact that those tests are passing is a good sign that this change is working as expected.

thanks for your comments @jameslamb and could you tell me who is qualified to approve this PR. This PR could improve the training speed a lot as I have tested by reducing the network cost and it's better to be included in the next release version if possible.

jameslamb · 2021-03-18T14:19:04Z

Oh! I didn't understand that this had any impact on training time. From the PR description, I thought this was just a small fix to suppress a warning.

Maybe @shiyu1994 will have time to quickly review before the release goes out, but if not this is going to miss release 3.2.0 (#3872) which I expect will go out in the next day or two.

In the future, when you open pull requests it would be helpful to include information about how the change improves LightGBM, either by directly explaining that or by linking to a previous discussion that led to the change. If I'd understood that this change had performance implications, I would have @-ed other maintainers a few days ago to try to get this reviewed and included in the next release.

Thanks again for taking the time to contribute to LightGBM!

htgeis · 2021-03-19T04:12:22Z

Oh! I didn't understand that this had any impact on training time. From the PR description, I thought this was just a small fix to suppress a warning.

Maybe @shiyu1994 will have time to quickly review before the release goes out, but if not this is going to miss release 3.2.0 (#3872) which I expect will go out in the next day or two.

In the future, when you open pull requests it would be helpful to include information about how the change improves LightGBM, either by directly explaining that or by linking to a previous discussion that led to the change. If I'd understood that this change had performance implications, I would have @-ed other maintainers a few days ago to try to get this reviewed and included in the next release.

Thanks again for taking the time to contribute to LightGBM!

Sorry, I forget to add some prerequisites. It would improve the speed of distributed training. Enable tcp_no_deplay would disable Nagle’s algorithm. Instead of coalescing small packets, it sends them to the pipe as soon as possible. In general, Nagle’s algorithm’s goal is to reduce the number of packets sent to save bandwidth and increase throughput with the trade-off sometimes to introduce increased latency to services. Distributed LightGBM training is sensitive to the network latency and it would reduce 30% training time in our tests case(2 workers, socket version, data parallel) by applying this patch.

shiyu1994

@htgeis Thanks for your contribution! LGTM.

jameslamb · 2021-03-24T17:26:52Z

@shiyu1994 @StrikerRUS at this point, can I merge code changes like this one? I remember last release that we didn't merge things for a few days after the release went out, in case it was necessary to do a hotfix.

StrikerRUS · 2021-03-24T23:35:08Z

@jameslamb

at this point, can I merge code changes like this one? I remember last release that we didn't merge things for a few days after the release went out, in case it was necessary to do a hotfix.

Oh, great question! I completely forgot about this, was about clicking Merge button but had to go.
I think you can merge this. If there will be any hotfix release, this PR is a good thing to have there in the opposite of RC releases where we really should froze the codebase and fix only reported bugs for the RC release.

jameslamb · 2021-03-24T23:38:14Z

ok great, thanks!

and @htgeis thanks again for your help, we really appreciate it.

github-actions · 2023-08-23T23:57:38Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

fix tcp_no_deplay type by using int

bdced90

htgeis requested review from btrotta, chivee and guolinke as code owners March 10, 2021 05:30

jameslamb added the fix label Mar 10, 2021

jameslamb requested a review from shiyu1994 March 18, 2021 14:16

shiyu1994 approved these changes Mar 24, 2021

View reviewed changes

jameslamb merged commit c591b77 into microsoft:master Mar 24, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tcp_no_deplay setting by using int type #4058

Fix tcp_no_deplay setting by using int type #4058

htgeis commented Mar 10, 2021

jameslamb commented Mar 10, 2021

htgeis commented Mar 18, 2021

jameslamb commented Mar 18, 2021

htgeis commented Mar 19, 2021

shiyu1994 left a comment

jameslamb commented Mar 24, 2021

StrikerRUS commented Mar 24, 2021

jameslamb commented Mar 24, 2021

github-actions bot commented Aug 23, 2023

Fix tcp_no_deplay setting by using int type #4058

Fix tcp_no_deplay setting by using int type #4058

Conversation

htgeis commented Mar 10, 2021

jameslamb commented Mar 10, 2021

htgeis commented Mar 18, 2021

jameslamb commented Mar 18, 2021

htgeis commented Mar 19, 2021

shiyu1994 left a comment

Choose a reason for hiding this comment

jameslamb commented Mar 24, 2021

StrikerRUS commented Mar 24, 2021

jameslamb commented Mar 24, 2021

github-actions bot commented Aug 23, 2023