Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use double precision in threaded calculation of linear tree coefficients (fixes #5226) #5368

Merged
merged 9 commits into from
Jul 29, 2022

Conversation

btrotta
Copy link
Collaborator

@btrotta btrotta commented Jul 12, 2022

Fixes #5266

When calculating the linear tree coefficients, we need to calculate some matrix products. This calculation is multi-threaded for efficiency. However, floating-point addition can give slightly different results depending on the order terms are added, and this was causing the calculated coefficients to vary depending on the number of threads. I've resolved this by making these matrices double precision instead of single, so the inaccuracies are less significant. This will not use much additional memory, since the size of these matrices is only O(num_features ^ 2).

I also added a preprocessor directive to make Eigen calls single-threaded (since Eigen is always called inside a for-loop that is already parallelized.

@jameslamb jameslamb added the fix label Jul 12, 2022
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for fixing this! I left a few initial suggestions for your consideration.

@@ -109,6 +109,7 @@ include_directories(${EIGEN_DIR})

# See https://gitlab.com/libeigen/eigen/-/blob/master/COPYING.README
add_definitions(-DEIGEN_MPL2_ONLY)
add_definitions(-DEIGEN_DONT_PARALLELIZE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this to the flags used by the R package as well?

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY"

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to be causing CI to fail; I think the problem is that I need to regenerate the configure file in R-package. Is there a way to do this on Windows?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-Windows, we have a comment-triggered CI job that will update configure based on changes to configure.ac. I'll trigger that job right now.

For configure.win, nothing needs to be regenerated. configure.win is executed directly, instead of being used as a template.

These things are documented at https://github.com/microsoft/LightGBM/tree/master/R-package#changing-the-cran-package but that README is fairly large so it's easy to miss.

Comment on lines 3052 to 3053
fd = FileLoader(EXAMPLES_DIR / 'binary_classification', 'binary',
'train_linear.conf')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider just re-defining EXAMPLES_DIR at the top of this file + using load_breast_cancer() to get a binary classification dataset, the way that other tests in this file do?

I worry that having one test file import from another could cause issues for pytest in the future (even though right now I don't notice any). There are not any other places in this project's Python tests today where one test_*.py file imports from another.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to use some randomly generated data (I was unable to reproduce the failure with the breast cancer dataset).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I didn't realize that it could be dataset-specific. Sorry for creating extra work for you! I guess we could have also avoided "test file importing another test file" by moving FileLoader to utils.py. That would probably good to do anyway (in a separate PR).

Anyway, if the new randomly-generated data is sufficient to reproduce the underlying issue fixed by this PR, I'm good with it!

@jameslamb
Copy link
Collaborator

/gha run r-configure

@jameslamb
Copy link
Collaborator

hmmm the run-configure job failed (build link)

Error: fatal: couldn't find remote ref refs/heads/linear-threading

I'll try one more run, and if it still fails I'll look into this later. Sorry for the disruption 😭

@jameslamb
Copy link
Collaborator

/gha run r-configure

@jameslamb
Copy link
Collaborator

I see the issue! I believe that updating R-package/configure on a PR from a fork doesn't currently work today. I've documented that in #5371 and will work on a fix, but it doesn't need to block this PR.

@btrotta I just regenerated configure locally using the dockerized steps mentioned at https://github.com/microsoft/LightGBM/tree/master/R-package#changing-the-cran-package. I tried to push those changes to your branch (which I thought I could do as a maintainer here), but unfortunately I got a "permission denied".

Thankfully, autoconf only generated a change on one line.

Can you please change this line

https://github.com/btrotta/LightGBM/blob/32f564c26017fbae7c78663046dffd7ca0eec177/R-package/configure#L1716

to

LGB_CPPFLAGS="${LGB_CPPFLAGS} -DEIGEN_MPL2_ONLY -DEIGEN_DONT_PARALLELIZE"

Sorry again for the disruption.

@jameslamb
Copy link
Collaborator

jameslamb commented Jul 13, 2022

One other thing... as a maintainer you have permissions to push branches directly to LightGBM instead of using your fork. I recommend doing that in the future.

@btrotta
Copy link
Collaborator Author

btrotta commented Jul 16, 2022

@jameslamb Thanks for your help, I've updated R-package/configure now

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@btrotta
Copy link
Collaborator Author

btrotta commented Jul 24, 2022

@jameslamb If you're happy with the changes, could you please approve?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix!

@jameslamb
Copy link
Collaborator

If you're happy with the changes, could you please approve?

I'm leaving my "request changes" review and not re-reviewing until the issue that has been LightGBM's CI for the last few days (#5362 (comment)) is fixed. I don't want this to be merged until the CI is fixed and we run it over these changes one more time.

Sorry for the delay. Hopefully that issue will be fixed soon.

@btrotta
Copy link
Collaborator Author

btrotta commented Jul 25, 2022

@jameslamb Ok, no problem.

@jameslamb
Copy link
Collaborator

Ok @btrotta our CI concerns have been resolved (#5388).

Since I don't have access to push to your fork...could you please merge latest master into this branch? Once you do that and CI passes, I'll merge this fix.

Sorry for the delay, and thanks for your patience.

@shiyu1994
Copy link
Collaborator

Maybe we can close and reopen this PR to rerun the ci? Since the ci tests are run with the merged version automatically.

@shiyu1994 shiyu1994 closed this Jul 29, 2022
@shiyu1994 shiyu1994 reopened this Jul 29, 2022
@jameslamb
Copy link
Collaborator

Maybe we can close and reopen this PR to rerun the ci? Since the ci tests are run with the merged version automatically.

I personally prefer to merge the target branch (in this repo, master) into PRs directly as a way to trigger CI, instead of just closing and re-opening, in situations like this one where the PR has changes to a central part of the codebase and when there are several other active PRs touching other related parts of the codebase.

To avoid this situation:

  1. branch-1 and branch-2 are created off of master at the same time
  2. changes made on branch-1 are proposed as PR-1
  3. changes made on branch-2 are proposed as PR-2
  4. CI passes on both PR-1 and PR-2
  5. PR-1 is merged, then PR-2 is merged without re-running CI on PR-2
  6. master is broken because the changes in PR-1 and PR-2 are incompatible (for example, PR-2 removes a function that was referenced by the code in PR-1)

If CI passes for this particular PR I'm ok with merging it to keep making progress on all the other PRs, but please consider that for the future.

@shiyu1994
Copy link
Collaborator

@jameslamb Thank you. I will follow this rule in the future.

@shiyu1994
Copy link
Collaborator

It seems that all ci tests are passed. Maybe we can cancel the change request now?

@jameslamb
Copy link
Collaborator

Maybe we can cancel the change request now?

oh yep, sorry about that! Meant to come back and approve yesterday.

@jameslamb jameslamb self-requested a review July 29, 2022 16:08
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 thanks again for the help @btrotta !

@jameslamb jameslamb merged commit 44d3718 into microsoft:master Jul 29, 2022
@jameslamb
Copy link
Collaborator

I will follow this rule in the future.

No problem! I've found it helpful. There is definitely a trade-off there. Especially since our CUDA CI jobs take 1-2 hours to run and we can only have a single job running at a time across the whole repo, extra CI runs can slow down development in the whole project.

My approach is:

  • if I need to re-trigger CI anyway, merge latest master and use that as a way to re-trigger it
  • if a PR recently passed CI and has very small, non-functional changes such as documentation fixes, just merge it without rebuilding

So, for example, once #5384 builds successfully, I'll just merge it even though this PR was merged to master after that build started. Since #5384 just contains documentation changes.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Result of model with linear trees depends on the number of used during fitting CPU threads
5 participants