Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

Open
aiah123 opened this issue Jul 5, 2021 · 0 comments

Comments

@aiah123
Copy link

aiah123 commented Jul 5, 2021

Happens when training on batched data with warm_start = True and the data is unbalanced.

Error:

/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
    test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
    return _predict_forest(model, X, joint_contribution=joint_contribution)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
    return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
  File "<__array_function__ internals>", line 6, in mean
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
    out=out, **kwargs)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
    arr = asanyarray(a)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**

Reproduction:

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd

# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)

# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})


# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])

# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant