Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

Closed
tfeher opened this issue Nov 25, 2020 · 1 comment
Closed

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

tfeher opened this issue Nov 25, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@tfeher
Copy link
Contributor

tfeher commented Nov 25, 2020

Describe the bug

The error happens if the following conditions are met:

  • input is fp32
  • n_rows >= 65536
  • normalize=True solver parameter is used.

Symptoms:

  • The input feature matrix X has NaN values here. This results the initial correlations to be NaN.
  • Since the initial correlations are NaN, we exit at the first iteration.

Things to note:

  • The error only happens if all the three conditions above are true. This might indicate a problem during interaction of the CuPy preprocessing and our cpp solver.
  • A cpp unit test with large input works properly
  • Adding print("X number of nans ", cp.sum(cp.isnan(X))) before and after the call to the cpp solver shows that there are 0 NaN in X according to CuPy.
  • Adding a similar check for X before the initial correlation calculation shows that there are NaNs in X.

Steps/Code to reproduce bug

fp32 support is disabled, one needs to enable it by changing the source of the Cython wrappers.

import numpy as np
import cuml
from cuml.linear_model import Lars as cumlLars
import sklearn.datasets
from sklearn.linear_model import Lars as sklLars

X, y = sklearn.datasets.make_regression(n_samples=65536, n_features=10, n_informative=10, random_state=0)
dtype = np.float32
X = X.astype(dtype)
y  = y.astype(dtype)

cpp_lars = cumlLars(precompute=False, verbose=cuml.common.logger.level_debug,  fit_path=False, normalize=True)
cpp_lars.fit(X,y)
print(cpp_lars.score(X,y))

Output:

[E] [11:27:28.775891] Correlation is not finite, aborting.
[D] [11:27:28.776140] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 0, selected feature 0 with correlation nan

Expected behavior

Fit the model correctly. Here is a sample output from fp64 fit:

[D] [11:32:53.732715] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 0, selected feature 3 with correlation 23072.404597
[D] [11:32:53.733112] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 1, selected feature 5 with correlation 20162.130672
[D] [11:32:53.733439] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 2, selected feature 8 with correlation 19225.012854
...
1.0

The last value is the score, it should be very close to 1 even in fp32.

@tfeher tfeher added the bug Something isn't working label Nov 25, 2020
@tfeher tfeher self-assigned this Nov 25, 2020
@tfeher
Copy link
Contributor Author

tfeher commented Nov 28, 2020

The problem was caused by a value dependent implicit type conversion in the following lines:

x_scale = cp.sqrt(cp.var(X, axis=0) * X.shape[0])
X = (X - x_mean) / x_scale
  • Initially X is fp32.
  • If X.shape[0] > 65535 than x_scale becomes fp64, otherwise it is fp32.
  • The data type of X after scaling will be the same of x_scale.
  • The pointer to X was passed to the cpp layer by casting it to fp32 data type.
  • The cpp solver tried to use X assuming fp32 type, but the actual data was fp64. This produced NaNs and in general invalid values.

The fix is trivial, an explicit type cast: X.dtype.type(X.shape[0]). Additionally type checks were added before the pointers are passed to the cpp solver. The fix was implemented in bda29cc

@tfeher tfeher closed this as completed Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant