Errors thrown when executing FeatBoost on 6-feature dataset #1

dunnkers · 2020-05-11T19:17:29Z

Input is a 6-feature dataset, found here. FeatBoost is executed using the following setup:

    # Setup estimator
    xgboost_ensemble = XGBClassifier(max_depth=3, learning_rate=0.1,\
        n_estimators=200, silent=True, objective='binary:logistic',\
        booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,\
        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,\
        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,\
        random_state=0, seed=None, missing=None)
    # Setup FS method
    fs = FeatBoostClassification(estimator=[xgboost_ensemble,\
        xgboost_ensemble, xgboost_ensemble], number_of_folds = 10,\
            siso_ranking_size = 8,\
            max_number_of_features = 100,\
            siso_order=4,\
            epsilon=1e-18,\
            verbose=2)

    # Run Feature Selection
    fs.fit(X, y)

(exactly the same setup as test.py)

First, throws an error in a print message, with parameter verbose=2.

Full error log

(venv) ➜  feature-selection git:(master) ✗  env DEBUGPY_LAUNCHER_PORT=53859 /Users/dunnkers/git/feature-selection/venv/bin/python /Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/launcher /Users/dunnkers/git/feature-selection/jobs/run-featboost.py /Users/dunnkers/git/feature-selection/data/6_bit_mutliplexer 
Ranking pool [FeatBoost_XGBoost]
Running pool... [4 workers, 1 datasets]






Ranking features iteration 01
feature importances of all available feature:
x_001   3.792205
x_003   3.277614
x_004   2.644713
x_002   2.451928
x_006   2.280755
x_005   2.112983
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 22, in ranking_pool
    ranking = ranking_func(X, y)
  File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 25, in FeatBoost_XGBoost
    fs.fit(X, y)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 188, in fit
    return self._fit(X, Y)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 272, in _fit
    selected_variable,best_acc_t = self._siso(X,Y,iteration_number)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 396, in _siso
    ranking, self.all_ranking_ = self._input_ranking(X, Y, iteration_number)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 559, in _input_ranking
    print("%s   %05f" % (self._feature_names[feature_rank[i]], feature_importance[feature_rank[i]]))
IndexError: index -7 is out of bounds for axis 0 with size 6
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 36, in <module>
    run_ranking_pool(FeatBoost_XGBoost)
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 42, in run_ranking_pool
    run_pool(ranking_pool, 'ranking', ranking_func, ranking_method)
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 99, in run_pool
    pool_results = pool.starmap(func, pool_args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
IndexError: index -7 is out of bounds for axis 0 with size 6

Second, with verbose=1, another error is thrown.

Full error log

(venv) ➜  feature-selection git:(master) ✗  env DEBUGPY_LAUNCHER_PORT=53886 /Users/dunnkers/git/feature-selection/venv/bin/python /Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/launcher /Users/dunnkers/git/feature-selection/jobs/run-featboost.py /Users/dunnkers/git/feature-selection/data/6_bit_mutliplexer 
Ranking pool [FeatBoost_XGBoost]
Running pool... [4 workers, 1 datasets]






Ranking features iteration 01
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 22, in ranking_pool
    ranking = ranking_func(X, y)
  File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 25, in FeatBoost_XGBoost
    fs.fit(X, y)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 188, in fit
    return self._fit(X, Y)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 272, in _fit
    selected_variable,best_acc_t = self._siso(X,Y,iteration_number)
  File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 397, in _siso
    self.siso_ranking_[(iteration_number-1), :] = ranking
ValueError: could not broadcast input array from shape (6) into shape (8)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 36, in <module>
    run_ranking_pool(FeatBoost_XGBoost)
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 42, in run_ranking_pool
    run_pool(ranking_pool, 'ranking', ranking_func, ranking_method)
  File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 99, in run_pool
    pool_results = pool.starmap(func, pool_args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
ValueError: could not broadcast input array from shape (6) into shape (8)

The text was updated successfully, but these errors were encountered:

amjams · 2020-05-12T08:17:39Z

I don't know if these are the reasons behind the error, but consider the following: If it's a 6-feature dataset, then:
siso_ranking_size = 8 This should be less or equal to 6.
max_number_of_features = 100. Same as above.

reasonable values in this case would be 1 and 6 respectively.

dunnkers · 2020-05-12T08:41:35Z

That seems to explain the error- the 6-feature dataset now runs normally. I didn't know that siso_ranking_size should be <= # dataset features, maybe an assertion in the code and some docs would be nice.

What are reasonable values of siso_ranking_size I could use in my tests? The amount of features in the datasets range from 6 to 100000, so probably using a value of 8 is fine for all other datasets. I could also use a fixed value of 5, so I could use the same value for all tests.

amjams · 2020-05-12T08:50:24Z

yes, you're right. Some assertions would be helpful.
You could use 5. But for larger datasets it might be helpful to increase it a bit. Just keep in mind how this can affect your runtime. I would say a good rule of thumb is to set it to 10 for datasets with over 100 features.

dunnkers mentioned this issue May 11, 2020

Duplicate features in selected subset #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors thrown when executing FeatBoost on 6-feature dataset #1

Errors thrown when executing FeatBoost on 6-feature dataset #1

dunnkers commented May 11, 2020 •

edited

Loading

amjams commented May 12, 2020

dunnkers commented May 12, 2020 •

edited

Loading

amjams commented May 12, 2020

Errors thrown when executing FeatBoost on 6-feature dataset #1

Errors thrown when executing FeatBoost on 6-feature dataset #1

Comments

dunnkers commented May 11, 2020 • edited Loading

amjams commented May 12, 2020

dunnkers commented May 12, 2020 • edited Loading

amjams commented May 12, 2020

dunnkers commented May 11, 2020 •

edited

Loading

dunnkers commented May 12, 2020 •

edited

Loading