Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tcp_no_deplay setting by using int type #4058

Merged
merged 1 commit into from
Mar 24, 2021

Conversation

htgeis
Copy link
Contributor

@htgeis htgeis commented Mar 10, 2021

Issue: set TCP_NODELAY failed in distributed training and it could be found from the warning log.
image

The reason is that the type of kNoDelay should be int from https://man7.org/linux/man-pages/man2/getsockopt.2.html and https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-setsockopt

int setsockopt(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen);

Most socket-level options utilize an int argument for optval.
To enable a Boolean option, the optval parameter points to a nonzero integer. To disable the option optval points to an integer equal to zero.

@jameslamb
Copy link
Collaborator

Thanks for this @htgeis ! I'm not qualified to approve this PR, but want to note for other reviewers that this code is covered by the tests on lightgbm.dask, so the fact that those tests are passing is a good sign that this change is working as expected.

@htgeis
Copy link
Contributor Author

htgeis commented Mar 18, 2021

Thanks for this @htgeis ! I'm not qualified to approve this PR, but want to note for other reviewers that this code is covered by the tests on lightgbm.dask, so the fact that those tests are passing is a good sign that this change is working as expected.

thanks for your comments @jameslamb and could you tell me who is qualified to approve this PR. This PR could improve the training speed a lot as I have tested by reducing the network cost and it's better to be included in the next release version if possible.

@jameslamb
Copy link
Collaborator

Oh! I didn't understand that this had any impact on training time. From the PR description, I thought this was just a small fix to suppress a warning.

Maybe @shiyu1994 will have time to quickly review before the release goes out, but if not this is going to miss release 3.2.0 (#3872) which I expect will go out in the next day or two.

In the future, when you open pull requests it would be helpful to include information about how the change improves LightGBM, either by directly explaining that or by linking to a previous discussion that led to the change. If I'd understood that this change had performance implications, I would have @-ed other maintainers a few days ago to try to get this reviewed and included in the next release.

Thanks again for taking the time to contribute to LightGBM!

@htgeis
Copy link
Contributor Author

htgeis commented Mar 19, 2021

Oh! I didn't understand that this had any impact on training time. From the PR description, I thought this was just a small fix to suppress a warning.

Maybe @shiyu1994 will have time to quickly review before the release goes out, but if not this is going to miss release 3.2.0 (#3872) which I expect will go out in the next day or two.

In the future, when you open pull requests it would be helpful to include information about how the change improves LightGBM, either by directly explaining that or by linking to a previous discussion that led to the change. If I'd understood that this change had performance implications, I would have @-ed other maintainers a few days ago to try to get this reviewed and included in the next release.

Thanks again for taking the time to contribute to LightGBM!

Sorry, I forget to add some prerequisites. It would improve the speed of distributed training. Enable tcp_no_deplay would disable Nagle’s algorithm. Instead of coalescing small packets, it sends them to the pipe as soon as possible. In general, Nagle’s algorithm’s goal is to reduce the number of packets sent to save bandwidth and increase throughput with the trade-off sometimes to introduce increased latency to services. Distributed LightGBM training is sensitive to the network latency and it would reduce 30% training time in our tests case(2 workers, socket version, data parallel) by applying this patch.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htgeis Thanks for your contribution! LGTM.

@jameslamb
Copy link
Collaborator

@shiyu1994 @StrikerRUS at this point, can I merge code changes like this one? I remember last release that we didn't merge things for a few days after the release went out, in case it was necessary to do a hotfix.

@StrikerRUS
Copy link
Collaborator

@jameslamb

at this point, can I merge code changes like this one? I remember last release that we didn't merge things for a few days after the release went out, in case it was necessary to do a hotfix.

Oh, great question! I completely forgot about this, was about clicking Merge button but had to go.
I think you can merge this. If there will be any hotfix release, this PR is a good thing to have there in the opposite of RC releases where we really should froze the codebase and fix only reported bugs for the RC release.

@jameslamb
Copy link
Collaborator

ok great, thanks!

and @htgeis thanks again for your help, we really appreciate it.

@jameslamb jameslamb merged commit c591b77 into microsoft:master Mar 24, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants