Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error messages for string constraints #920

Closed
joaquinvanschoren opened this issue Jun 24, 2020 · 3 comments · Fixed by #927
Closed

Better error messages for string constraints #920

joaquinvanschoren opened this issue Jun 24, 2020 · 3 comments · Fixed by #927

Comments

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 24, 2020

Description

I often hear users complain that they don't know what to do when create_dataset complains about string constraints. Typically this is because people used a space (' ') in the name (I'm not actually sure why we don't allow that) or a special character in the description.

Could we maybe return a more informative general error message, like 'Character ' ' is not allowed in field x'?

Alternatively, let the python API replace spaces in the dataset name with underscores automatically, and replace special characters with '?' or ' '.

Steps/Code to Reproduce

Example:

import openml

my_dataset = create_dataset(
    name="My cool dataset",
    description="foo",
    creator="bar"
    contributor=None,
    collection_date='01-01-2011',
    language='English',
    licence=None,
    default_target_attribute='label',
    row_id_attribute=None,
    ignore_attribute=None,
    citation="foo",
    attributes='auto',
    data=df,
    version_label='1.0',
)

Expected Results

A more informative general error message, like 'Character ' ' is not allowed in field x'?
Or: replace the 'bad' characters automatically

Actual Results

A hard-to-read stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-6289268889ab> in <module>
     13     attributes='auto',
     14     data=df,
---> 15     version_label='1.0',
     16 )

~/anaconda3/lib/python3.7/site-packages/openml/datasets/functions.py in create_dataset(name, description, creator, contributor, collection_date, language, licence, attributes, data, default_target_attribute, ignore_attribute, citation, row_id_attribute, original_data_url, paper_url, update_comment, version_label)
    774         paper_url=paper_url,
    775         update_comment=update_comment,
--> 776         dataset=arff_dataset,
    777     )
    778 

~/anaconda3/lib/python3.7/site-packages/openml/datasets/dataset.py in __init__(self, name, description, format, data_format, dataset_id, version, creator, contributor, collection_date, upload_date, language, licence, url, default_target_attribute, row_id_attribute, ignore_attribute, version_label, citation, tag, visibility, original_data_url, paper_url, update_comment, md5_checksum, data_file, features, qualities, dataset)
    121             if not re.match("^[a-zA-Z0-9_\\-\\.\\(\\),]+$", name):
    122                 # regex given by server in error message
--> 123                 raise ValueError("Invalid symbols in name: {}".format(name))
    124         # TODO add function to check if the name is casual_string128
    125         # Attributes received by querying the RESTful API

ValueError: Invalid symbols in name: My cool dataset

Versions

Darwin-19.4.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.18.4
SciPy 1.4.1
Scikit-Learn 0.23.1
OpenML 0.11.0dev

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 1, 2020

I'm not actually sure why we don't allow that

The string must be quoted if the name includes spaces (from https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/)

Alternatively, let the python API replace spaces in the dataset name with underscores automatically, and replace special characters with '?' or ' '.

I don't like the idea of meddling with data provided by the user. I prefer the idea of having better error messages.

I opened an issue in the arff parser: renatopp/liac-arff#110

@jnothman
Copy link

jnothman commented Jul 2, 2020

I'm not convinced that this is about ARFF parsing... Apart from anything else, the user is not providing ARFF input in the above code.

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 2, 2020

I'm not convinced that this is about ARFF parsing... Apart from anything else, the user is not providing ARFF input in the above code.

Thanks a lot, you're totally right. I assumed that this happened during conversion of the dataset to ARFF, but it fails before.

Then the issue is actually that we're giving a custom error message to not have to upload and wait for a server error message. But yes, we could basically check which characters are illegal and print that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants