Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

aiah123 · 2021-07-05T10:22:25Z

Happens when training on batched data with warm_start = True and the data is unbalanced.

Error:

/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
    test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
    return _predict_forest(model, X, joint_contribution=joint_contribution)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
    return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
  File "<__array_function__ internals>", line 6, in mean
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
    out=out, **kwargs)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
    arr = asanyarray(a)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**

Reproduction:

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd

# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)

# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})


# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])

# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

aiah123 commented Jul 5, 2021

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training) #40

Comments

aiah123 commented Jul 5, 2021