-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check sampling aliases are set correctly in H2O XGBoost #8458
Comments
Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98]. Thank you for reporting this bug. Could you please provide more information on how you set up the model parameters, please? Are you using the flow only? I found the problem is when you have set the parameter col_sample_rate as well as colsample_bylevel to a different value than 1. For example, if you have colsample_bylevel=0.5 and also col_sample_rate=0.3, the colsample_bylevel overwrites col_sample_rate value*.* In this case, you can see this message in the log: {{Using user-provided parameter colsample_bylevel instead of col_sample_rate.}} If I try to train two models with different col_sample_rate for example in Python, everything works as we are expecting. But when you are reusing native parameters from one model, where colsample_bylevel was already set, the models will be always the same, if you are changing only col_sample_rate. It is definitely a bug on our side and we are working on fixing it. However, you can solve your issue by setting colsample_bylevel manually. Let me know if you have any other questions about this issue. 🙂 Veronika |
Mathijs de Jong commented: Hi [~accountid:5bd237b8dd3cc64b77e71676] , thanks for the detailed explanation. I define the model and its parameters through pysparkling python code, and I set only {{col_sample_rate}} or {{colsample_bylevel}}. After that, I checked the parameters in Flow and in the MOJO file, where they look as expected. For either parameter, when I choose it to be < 1.0, I didn’t see any sampling happening. I will have a look again if setting only {{colsample_bylevel}} will do the trick, and if it doesn’t I will share a MWE that can reproduce the issue. I am out of office for the next two weeks, so unfortunately I can only check when I am back, I hope that’s ok. :) Thanks! |
Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98], thank you for your response. I am sure, that the problem is with setting {{colsample_bylevel}} together with {{col_sample_rate}}. I can see your setting in the first image. You have {{colsample_bylevel}} set to 0.3 so if you change {{col_sample_rate}} in this setting, nothing happens. But let me know when you try it. 🙂 !param.png|width=846,height=1089! I hope I will fix this as soon as possible. |
Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], I finished the fix, where we check the dual parameters are not set simultaneously on different values. The rules are now: if you set col_sample_rate to a different value than default and don't change colsample_bylevel default value, col_sample_rate value will be usedif you set colsample_bylevel to a different value than default and don't change col_sample_rate default value, colsample_bylevel value will be usedif you set both col_sample_rate and colsample_bylevel to the same value, colsample_bylevel is usedif you set both col_sample_rate and colsample_bylevel to the different value, error is thrownThis change will be available from the next major release (version 3.34.0.1), which will be out soon. Let me know if it helps to you. 🙂 |
Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] Thanks for the detailed debugging, and apologies for the late response. Unfortunately when I try out the different parameters, I get different results from your debugging. When I set only {{col_sample_rate}} or only {{colsample_bylevel}} to < 1.0, the result is the same: there is no stochasticity, which I assume means that no sampling takes place (I do not fix the seed of course; when I set another stochastic parameter like {{sample_rate}} to < 1.0, this does result in stochasticity). The screenshot that you describe in your comment was the result of a model that had only {{colsample_bylevel}} set (to 0.3). I see indeed that {{col_sample_rate}} also has an input value (of 1.0), but this was not set when I defined the model in Python. I just double checked that. Maybe that indicates that somehow H2O thinks that the parameter is given as input, even though it’s not. I also tried setting both {{colsample_bylevel}} and {{col_sample_rate}} to the same value, and that results in the screenshot that I attach. However, also this does not give any stochasticity, so I think still no sampling takes place in that case. This gives me the idea that we might be chasing two different bugs here? !Screenshot 2021-09-06 at 16.33.16.png|width=557,height=722! |
Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98], thank you for your description. Could you send me the logs from the H2O server, please? There we can see which parameters were exactly sent to the machine. Or if you can provide an example to reproduce this issue, let me know! I am currently trying to reproduce it, but I am not successful yet. |
Mathijs de Jong commented: Hi [~accountid:5bd237b8dd3cc64b77e71676] , I have attached a MWE notebook to this comment. I added a few comments, to explain what I am testing. I hope this helps to reproduce the issue. I can of course also provide H2O logs, can you provide more detail about which logs exactly are useful, and how to obtain them? [^MWE_PUBDEV_8266_1.ipynb] |
Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98] , thank you for MWE. I tried to reproduce the issue, but your MWE does not work for me. I tried both 3.32.0.1 and the current master branch and it takes me always what we are expecting… 🤔 [^MWE_PUBDEV_8266_maurever.ipynb] |
Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] The plot thickens. When I created the MWE, I ran it on our company H2O cluster, and then column sampling does not work (like I said). When I try it on my local machine (macOS), then I get the same results as you. In both cases with H2O version {{3.32.0.2}}. Is there a way to debug this further, based on logs for example? |
Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], yes, logs could be beneficial. Here is the detailed description, how to download logs: [https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html|smart-link] . If you have any questions, let me know! |
Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] I did some additional debugging with the help of the logs, and I found the following details:
This leaves two options: There is a bug in XGBoost itself (not in the H2O implementation)I have an incorrect understanding of approximate tree building: when the approximate tree building method is used, do we not expect column sampling to take place for {{colsample_bynode}} and {{colsample_bylevel}}? |
Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98] , thank you for debugging! It looks like an XGBoost bug. I just quickly look into their code and found this issue: [https://github.com/dmlc/xgboost/issues/7002|https://github.com/dmlc/xgboost/issues/7002|smart-link]. In XGBoost they mixed approx histogram creation with sampling together. The parameter colsample_bytree should not affect histograms, so it works. However, the colsample_bynode and colsample_bylevel are affected by histogram creation, I think. I will try to find more details in the XGBoost code to be sure. |
Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] I can create an issue for the xgboost team, but I am also happy if you would want to do it (given that you have debugged it in more detail). Do you have a preference? |
Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], please, make definitely an issue. I reproduced the problem with xgboost directly, so I am sure it is their bug. I tried to find what is wrong, but I was not successful. I also closed this Jira as resolved. We improved our API a little bit based on this bug, and that is perfect. 🙂 Thank you so much for being so cooperative! |
Veronika Maurerová commented: We can't fix this issue because it is a bug on the XGBoost side. Otherwise, we improve our xgboost API to be sure alias parameters are used correctly. |
Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] Thanks again for your help. As FYI, here is the issue I reported to XGBoost: |
JIRA Issue Migration Info Jira Issue: PUBDEV-8266 Linked PRs from JIRA Attachments From Jira Attachment Name: MWE_PUBDEV_8266_1.ipynb Attachment Name: MWE_PUBDEV_8266_maurever.ipynb Attachment Name: param.png Attachment Name: Screenshot 2021-08-10 at 11.21.10.png Attachment Name: Screenshot 2021-08-10 at 11.21.17.png Attachment Name: Screenshot 2021-08-10 at 11.32.03.png Attachment Name: Screenshot 2021-08-10 at 11.32.18.png Attachment Name: Screenshot 2021-09-06 at 16.33.16.png |
[~accountid:5d9dc9eb87dd6f0dcb4d4d98] reported a bug when col_sample_rate is not working with tree_method=”approx” in XGBoost core. This issue will be solved in a new Jira ticket: [https://h2oai.atlassian.net/browse/PUBDEV-8368|https://h2oai.atlassian.net/browse/PUBDEV-8368|smart-link] .
Within this ticket, we improved H2O XGBoost API to be sure both col_sample_rate and colsample_bylevel (and other XGBoost parameters aliases) are set correctly.
The text was updated successfully, but these errors were encountered: