-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always respect forced splits, even when feature_fraction < 1.0 (fixes #4601) #4725
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for working on this! Please see my initial comments below.
Also, I changed the title of this pull request. In this project, pull request titles become items in the release notes (see, for example, https://github.com/microsoft/LightGBM/releases/tag/v3.3.0), and I don't think "fix issue 4601" would be very informative for a user reading the release notes to understand what has changed.
I'll try to test this as soon as possible using the reproducible example provided in #4601.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in the cpp side LGTM. Just a suggestion in the test case. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Please see a few suggestions I left about the tests. I think it's important to test with force_col_wise=true
and force_row_wise=true
to be confident that this fixes #4601.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you very much for working on this!
I left only one suggestion for your consideration to make the test more strict.
assert len(tree_info) > 1 | ||
for tree in tree_info: | ||
tree_structure = tree["tree_structure"] | ||
assert tree_structure['split_feature'] == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert tree_structure['split_feature'] == 0 | |
assert tree_structure['split_feature'] == 0 | |
assert tree_structure['threshold'] == pytest.approx(0.5, abs=1e-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the actual threshold is the nearest record is about 0.52..
@jameslamb Hi James, we found that when Can we leave this in subsequent PRs to fix, and merge this for now, given that all tests are passed? |
@shiyu1994 just to be sure I understand, which of these do you mean?
If it's number 1, then I support merging this PR as-is and fixing the other bug in a later PR. If it's number 2, then that means that bug would be introduced by this PR, and I don't think it should be merged. |
@jameslamb thanks for the check here. It is for #1, "force_row_wise=True" same test would also repro on current master branch. |
Got it, thanks! Then I think it's ok to merge this and fix that other bug separately. Thanks for explaining it to me 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through this!
Never considered, when I encountered #4601, that a source of randomness could be in the choice of row-wise vs. column-wise Dataset construction. So I learned something really important through reviewing this and through your explanations ❤️
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Fixes #4601.
Root cause for this issue: The histogram for the force split feature might not construct if feature_fraction < 1.0. And it would failed at SplitInner.