Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Feature name containing numbers may lead to an error in the ROC calculation process. #240

Open
1 task done
pengDLDG opened this issue Oct 21, 2024 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@pengDLDG
Copy link

Contact Details

pengdldg@gmail.com

Short description of the problem here.

Suppose that I have the node named "A0516" containing two categories (0 and 1) after discretization. With the function of 'bn.predict_probability()', we can get a DataFrame object ('predictions'), which contains two columns ('A0516_0' and 'A0516_1'). Unfortunately, the following code in the 'roc_auc()' function will cause it to become 4 columns, leading to a error when we use roc_auc().

predictions = bn.predict_probability(data, node)
predictions.rename(columns=lambda x: x.lstrip(node + "_"), inplace=True)
predictions = predictions[sorted(predictions.columns)]

The original purpose of 'x.lstrip(node+"_ ")' was to convert 'A0516_0' and 'A0516_1' into '0' and '1'. However, since both '0' and '1' are present in the string "A0516", this results in two identical empty strings, which causes the number of columns in "predictions" to double after sorting and leads to subsequent errors.

CausalNex Version

0.12.1

Python Version

3.8.20

Relevant code snippet

from causalnex.structure.notears import from_pandas
from causalnex.network import BayesianNetwork
from causalnex.discretiser import Discretiser
from causalnex.evaluation import roc_auc

sm = from_pandas(df)
...
bn = BayesianNetwork(sm)

df_discrete = df.copy()
for col in df_discrete.columns:
    df_discrete[col] = Discretiser(method="quantile",num_buckets=2).fit_transform(df_discrete[col].values)

bn = bn.fit_node_states_and_cpds(df_discrete,method="BayesianEstimator", bayes_prior="K2")

roc, auc = roc_auc(bn, df_discrete, "A0516")

Relevant log output

ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 roc, auc = roc_auc(bn, df_discrete, "A0516")
      2 print(auc)

File ~\.conda\envs\causenet_python\lib\site-packages\causalnex\evaluation\evaluation.py:106, in roc_auc(bn, data, node)
    103 predictions.rename(columns=lambda x: x.lstrip(node + "_"), inplace=True)
    104 predictions = predictions[sorted(predictions.columns)]
--> 106 fpr, tpr, _ = metrics.roc_curve(
    107     ground_truth.values.ravel(), predictions.values.ravel()
    108 )
    109 roc = list(zip(fpr, tpr))
    110 auc = metrics.auc(fpr, tpr)

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\utils\_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\metrics\_ranking.py:1095, in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    993 @validate_params(
    994     {
    995         "y_true": ["array-like"],
   (...)
   1004     y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True
   1005 ):
   1006     """Compute Receiver operating characteristic (ROC).
   1007 
   1008     Note: this implementation is restricted to the binary classification task.
   (...)
   1093     array([ inf, 0.8 , 0.4 , 0.35, 0.1 ])
   1094     """
-> 1095     fps, tps, thresholds = _binary_clf_curve(
   1096         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight
   1097     )
   1099     # Attempt to drop thresholds corresponding to points in between and
   1100     # collinear with other points. These are always suboptimal and do not
   1101     # appear on a plotted ROC curve (and thus do not affect the AUC).
   (...)
   1106     # but does not drop more complicated cases like fps = [1, 3, 7],
   1107     # tps = [1, 2, 4]; there is no harm in keeping too many thresholds.
   1108     if drop_intermediate and len(fps) > 2:

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\metrics\_ranking.py:806, in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    803 if not (y_type == "binary" or (y_type == "multiclass" and pos_label is not None)):
    804     raise ValueError("{0} format is not supported".format(y_type))
--> 806 check_consistent_length(y_true, y_score, sample_weight)
    807 y_true = column_or_1d(y_true)
    808 y_score = column_or_1d(y_score)

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\utils\validation.py:407, in check_consistent_length(*arrays)
    405 uniques = np.unique(lengths)
    406 if len(uniques) > 1:
--> 407     raise ValueError(
    408         "Found input variables with inconsistent numbers of samples: %r"
    409         % [int(l) for l in lengths]
    410     )

ValueError: Found input variables with inconsistent numbers of samples: [36, 72]

Code of Conduct

  • I agree to follow this project's Code of Conduct
@pengDLDG pengDLDG added the bug Something isn't working label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant