[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

tfeher · 2020-11-25T11:50:40Z

Describe the bug

The error happens if the following conditions are met:

input is fp32
n_rows >= 65536
normalize=True solver parameter is used.

Symptoms:

The input feature matrix X has NaN values here. This results the initial correlations to be NaN.
Since the initial correlations are NaN, we exit at the first iteration.

Things to note:

The error only happens if all the three conditions above are true. This might indicate a problem during interaction of the CuPy preprocessing and our cpp solver.
A cpp unit test with large input works properly
Adding print("X number of nans ", cp.sum(cp.isnan(X))) before and after the call to the cpp solver shows that there are 0 NaN in X according to CuPy.
Adding a similar check for X before the initial correlation calculation shows that there are NaNs in X.

Steps/Code to reproduce bug

fp32 support is disabled, one needs to enable it by changing the source of the Cython wrappers.

import numpy as np
import cuml
from cuml.linear_model import Lars as cumlLars
import sklearn.datasets
from sklearn.linear_model import Lars as sklLars

X, y = sklearn.datasets.make_regression(n_samples=65536, n_features=10, n_informative=10, random_state=0)
dtype = np.float32
X = X.astype(dtype)
y  = y.astype(dtype)

cpp_lars = cumlLars(precompute=False, verbose=cuml.common.logger.level_debug,  fit_path=False, normalize=True)
cpp_lars.fit(X,y)
print(cpp_lars.score(X,y))

Output:

[E] [11:27:28.775891] Correlation is not finite, aborting.
[D] [11:27:28.776140] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 0, selected feature 0 with correlation nan

Expected behavior

Fit the model correctly. Here is a sample output from fp64 fit:

[D] [11:32:53.732715] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 0, selected feature 3 with correlation 23072.404597
[D] [11:32:53.733112] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 1, selected feature 5 with correlation 20162.130672
[D] [11:32:53.733439] /mydata/cuml_lars/cpp/src/solver/lars_impl.cuh:781 Iteration 2, selected feature 8 with correlation 19225.012854
...
1.0

The last value is the score, it should be very close to 1 even in fp32.

The text was updated successfully, but these errors were encountered:

tfeher · 2020-11-28T11:33:40Z

The problem was caused by a value dependent implicit type conversion in the following lines:

x_scale = cp.sqrt(cp.var(X, axis=0) * X.shape[0])
X = (X - x_mean) / x_scale

Initially X is fp32.
If X.shape[0] > 65535 than x_scale becomes fp64, otherwise it is fp32.
The data type of X after scaling will be the same of x_scale.
The pointer to X was passed to the cpp layer by casting it to fp32 data type.
The cpp solver tried to use X assuming fp32 type, but the actual data was fp64. This produced NaNs and in general invalid values.

The fix is trivial, an explicit type cast: X.dtype.type(X.shape[0]). Additionally type checks were added before the pointers are passed to the cpp solver. The fix was implemented in bda29cc

tfeher added the bug Something isn't working label Nov 25, 2020

tfeher self-assigned this Nov 25, 2020

tfeher mentioned this issue Nov 25, 2020

[REVIEW] Least Angle Regression #3160

Merged

tfeher closed this as completed Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

tfeher commented Nov 25, 2020

tfeher commented Nov 28, 2020

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

[BUG] LARS solver in fp32 mode fails to fit due to NaNs in X #3189

Comments

tfeher commented Nov 25, 2020

tfeher commented Nov 28, 2020