Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nan-likes from category header #1037

Merged
merged 2 commits into from
Mar 12, 2021
Merged

Remove nan-likes from category header #1037

merged 2 commits into from
Mar 12, 2021

Conversation

PGijsbers
Copy link
Collaborator

Pandas does not accept None/nan as a category (note: of course it does allow nan-values in the data itself). However outside source (i.e. ARFF files) do allow nan as a category, so we must filter these.

Penguins has the column: @ATTRIBUTE sex {?,FEMALE,MALE,_}

Running

import openml
penguins = openml.datasets.get_dataset(42585)
data, *_ = penguins.get_data()
print(data.head())

Before:

Traceback (most recent call last):
  File "E:/repositories/openml-python/mwe.py", line 4, in <module>
    data, *_ = penguins.get_data()
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 693, in get_data
    data, categorical, attribute_names = self._load_data()
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 531, in _load_data
    return self._cache_compressed_file_from_file(file_to_load)
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 488, in _cache_compressed_file_from_file
    data, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 445, in _parse_data_from_arff
    self._unpack_categories(X[column_name], categories_names[column_name])
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 650, in _unpack_categories
    raw_cat = pd.Categorical(col, ordered=True, categories=categories)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\arrays\categorical.py", line 316, in __init__
    dtype = CategoricalDtype._from_values_or_dtype(
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 330, in _from_values_or_dtype
    dtype = CategoricalDtype(categories, ordered)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 222, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 369, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 543, in validate_categories
    raise ValueError("Categorial categories cannot be null")
ValueError: Categorial categories cannot be null

Process finished with exit code 1

After:

  species     island  culmen_length_mm  ...  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen              39.1  ...              181.0       3750.0    MALE
1  Adelie  Torgersen              39.5  ...              186.0       3800.0  FEMALE
2  Adelie  Torgersen              40.3  ...              195.0       3250.0  FEMALE
3  Adelie  Torgersen               NaN  ...                NaN          NaN     NaN
4  Adelie  Torgersen              36.7  ...              193.0       3450.0  FEMALE

[5 rows x 7 columns]

Pandas does not accept None/nan as a category (note: of course
it does allow nan-values in the data itself). However outside source
(i.e. ARFF files) do allow nan as a category, so we must filter these.
@PGijsbers PGijsbers requested a review from mfeurer March 12, 2021 09:59
@PGijsbers PGijsbers merged commit 4aec00a into develop Mar 12, 2021
@PGijsbers PGijsbers deleted the fix_1036 branch March 12, 2021 13:09
PGijsbers added a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
* Remove nan-likes from category header

Pandas does not accept None/nan as a category (note: of course
it does allow nan-values in the data itself). However outside source
(i.e. ARFF files) do allow nan as a category, so we must filter these.

* Test output of _unpack_categories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants