-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility: different result on first run with gpu_hist on single GPU #8820
Comments
I think it's caused by the global random engine used inside xgboost. The booster trained in the second iteration is affected by the one from the first iteration as they share the same random engine. |
Similar to calling |
@trivialfis, to stay with your analogy, it is possible to get reproducible results by saying:
Is there a way to achieve the same reproducibility for XGBoost? The example in the original repro does in fact produce reproducible (i.e., identical) results in each iteration when run on a CPU. However, when run on a single GPU, the first run is always different from the subsequent runs (I ran this with 1000 iterations, and a got 999 identical models after the first one.) This seems to be like a seed/rng initialization within the GPU code because once the Python interpreter has run the repro once, it will produce identical results when run again, and to repro I have to restart the Python interpreter to get it to produce a different model on the first iteration. |
@trivialfis just want to bump this up. Do you know if there is a way to set the global random engine seed you mentioned? As the example above shows, setting |
Let me take another look later |
Based on #5023 it seems like XGBoost aims to guarantee reproducibility for single GPU training with gpu_hist. That is, training again on the same hardware with the same data and the same seed should give precisely the same model bit for bit.
However, I am consistently seeing different results on the very first training run (on a freshly started Python interpreter - this is important) for the following code:
which results in
This is on Linux with a Tesla T4 and CUDA 11.7.
Is there another seed that needs to be set to ensure that the first run works off of the same seed as the subsequent runs? Or is this potentially a bug?
The text was updated successfully, but these errors were encountered: