-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large prediction results unless using repartition(1) in databricks with lgbm model #986
Comments
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
I am seeing something similar - enormous predictions that are sometimes positive and sometimes negative (there are no negative values in the target). It seems that if I use mse as the objective the predictions are all extremely negative and extremely positive if using a tweedie objective. The rank order / discrimination is relatively good but the estimates are orders of magnitude uncalibrated Using Data Bricks 7.3 LTS Spark 3.0.1 |
@AllardJM @user673 Could you possibly share an example dataset for us to repro? Adding @imatiach-msft who built LightGBM on Spark |
@user673 @AllardJM I've found the last iteration is sometimes creating a bad tree that seems to predict inf values, I've created an issue in lightgbm repo to track this: the very last tree with 1 leaf and 1 depth seems to output the high values; limiting the number of iterations to be one less than this seems to prevent the issue |
this might be different from what @AllardJM is seeing though since I only see inf values, I don't see negative values, and this is just the LightGBMRegressor with regression objective function |
@user673 @AllardJM FYI I believe this issue has been fixed with this PR in lightgbm repository: |
@imatiach-msft The PR has been merged, please check. |
thanks, I've updated code on latest master and have confirmed issue is fixed. I will leave this github issue open since it's pretty bad until the next release, in case others see it, so it's more easy to find. |
@imatiach-msft Any update on this issue? We are facing the same issue and using the reparation(1) workaround however it is not feasible for large datasets. |
hey, |
I'm using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will be terrible, becouse predictions are huge (around 10^37 , while target is in range from 0 to 200).
Testing, I found that using
dataset.repartition(1).cache()
fixed this problem,but with one detail - modelling began to take longer (around 1h, while 20m earlier). This is logical since all the data (about 4m rows and 150 columns) is collected before learning in one partition.I tried changing lgbm param
useBarrierExecutionMode
to True and differentparallelism
params, but this changes doesn't affect result.Is there a way not to use such workaround with repartition and still having normal results?
Code, used for training
AB#1984587
The text was updated successfully, but these errors were encountered: