Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility: different result on first run with gpu_hist on single GPU #8820

Open
cstefansen opened this issue Feb 16, 2023 · 5 comments

Comments

@cstefansen
Copy link

cstefansen commented Feb 16, 2023

Based on #5023 it seems like XGBoost aims to guarantee reproducibility for single GPU training with gpu_hist. That is, training again on the same hardware with the same data and the same seed should give precisely the same model bit for bit.

However, I am consistently seeing different results on the very first training run (on a freshly started Python interpreter - this is important) for the following code:

import xgboost


n_rows = 25_000
n_features = 1_000
n_rows_val = 12_500

np.random.seed(0)

x = np.random.normal(-3.0e-05, 0.5, (n_rows, n_features))
y = np.random.normal(-0.05, 1.0, n_rows)
w = np.clip(np.random.normal(7.25e+06, 1.25e+07, n_rows), 0.0, None)

x_val = np.random.normal(-3.0e-05, 0.5, (n_rows_val, n_features))
y_val = np.random.normal(-0.05, 1.0, n_rows_val)
w_val = np.clip(np.random.normal(7.25e+06, 1.25e+07, n_rows_val), 0.0, None)


print(f'XGBoost version {xgboost.__version__}')

models = []

for i in range(3):
    xgbr = xgboost.XGBRegressor(
        gpu_id=0,
        tree_method='gpu_hist',
        sampling_method='gradient_based',
        verbosity=0,
        booster='gbtree',
        n_jobs=1,
        nthreads=1,
        random_state=np.random.RandomState(0),
        seed=0,
        single_precision_histogram=False,
        max_delta_step = 0,
        colsample_bylevel = 1.0,
        scale_pos_weight = 1.0,
        base_score = 0.0,
        colsample_bynode=0.5,
        colsample_bytree=0.13,
        gamma=7_500, 
        objective='reg:squarederror',
        learning_rate=0.007,
        max_depth=6,
        min_child_weight=30_000,
        n_estimators=2_500,
        reg_alpha=8.0,
        reg_lambda=0.5,
        subsample=0.45,
    )

    xgbr.fit(x, y, sample_weight=w)
    score = xgbr.score(x_val, y_val, sample_weight=w_val)
    models.append(xgbr)
    print(i, score)

which results in

0 -0.009656077054927659
1 -0.010103391088486235
2 -0.010103391088486235

This is on Linux with a Tesla T4 and CUDA 11.7.

Is there another seed that needs to be set to ensure that the first run works off of the same seed as the subsequent runs? Or is this potentially a bug?

@trivialfis
Copy link
Member

trivialfis commented Feb 16, 2023

I think it's caused by the global random engine used inside xgboost. The booster trained in the second iteration is affected by the one from the first iteration as they share the same random engine.

@trivialfis
Copy link
Member

Similar to calling
x = np.random.normal(-3.0e-05, 0.5, (n_rows, n_features))
twice, x is different even if the np seed is specified.

@cstefansen
Copy link
Author

@trivialfis, to stay with your analogy, it is possible to get reproducible results by saying:

np.random.seed(0)
x1 = np.random.normal(-3.0e-05, 0.5, (n_rows, n_features))

np.random.seed(0)
x2 = np.random.normal(-3.0e-05, 0.5, (n_rows, n_features))

np.testing.assert_equal(x1, x2)

Is there a way to achieve the same reproducibility for XGBoost?

The example in the original repro does in fact produce reproducible (i.e., identical) results in each iteration when run on a CPU. However, when run on a single GPU, the first run is always different from the subsequent runs (I ran this with 1000 iterations, and a got 999 identical models after the first one.)

This seems to be like a seed/rng initialization within the GPU code because once the Python interpreter has run the repro once, it will produce identical results when run again, and to repro I have to restart the Python interpreter to get it to produce a different model on the first iteration.

@mingli-ts
Copy link

@trivialfis just want to bump this up. Do you know if there is a way to set the global random engine seed you mentioned? As the example above shows, setting np.random.seed does not fix the issue. The weird part is only the first run is non-deterministic. The following runs all give same results.

@trivialfis
Copy link
Member

Let me take another look later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants