Checking active datasets #671

ArlindKadra · 2019-04-10T13:18:37Z

openml-python/openml/datasets/functions.py

Lines 295 to 296 in 3dc6dee

    
           dataset_list = list_datasets(status='all') 
        
           active = {}

Maybe we should list_datasets(status='active') and if a dataset_id from the iterable of dataset_ids that we want to check is in the results, we can mark it as True, otherwise False.

The text was updated successfully, but these errors were encountered:

PGijsbers · 2019-04-10T13:53:34Z

The distinction here is that you no longer check whether the dataset id exists at all. I.e. a dataset which is inactive and one that does not exist altogether would be be returned as False. Unless I misunderstand your meaning.

ArlindKadra · 2019-04-10T14:31:01Z

@PGijsbers you understood the idea right and the point you are making is valid.

Then, I would propose to further improve my initial idea and only list_datasets(data_id=[dataset_ids]) for the dataset_ids that we have. This is supported on the live server now ( we should also update the documentation and function of list_datasets, because data_id is not shown as a valid filter).

In the end we can just compare len(results) == len(dataset_ids) in case we want to know if there was a dataset_id that does not exist. (If we also want the specific dataset_id, we can check the keys)

PGijsbers · 2019-04-10T21:05:16Z

I definitely think we should raise an error if the passed dataset id does not exist, consider the following use case of the function:

dids = [ ... some ids ... ]
dids_active = check_datasets_active(dids)
if all(dids_active.values()):
    # All datasets are active, do something
    ...

If there is a more efficient way to retrieve active status for the datasets from the server, I am all for it.
Even if that means afterwards specifically checking if there is a dataset missing.
Having the mentioned clause

if len(results) != len(dataset_ids): 
    <find out missing datasets> 
    <raise error about missing datasets> `

is absolutely fine as in good-weather scenarios the simple length comparison should be negligible in terms of performance. Adding a parameter raise_error_if_not_exist to only optionally raise errors is also acceptable fine, but it should (in my opinion) default to True.

ArlindKadra added the enhancement label Apr 10, 2019

ArlindKadra self-assigned this Oct 26, 2020

ArlindKadra mentioned this issue Oct 28, 2020

Improving the performance of check_datasets_active, modifying unit test #980

Merged

PGijsbers closed this as completed in #980 Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking active datasets #671

Checking active datasets #671

ArlindKadra commented Apr 10, 2019

PGijsbers commented Apr 10, 2019

ArlindKadra commented Apr 10, 2019

PGijsbers commented Apr 10, 2019

Checking active datasets #671

Checking active datasets #671

Comments

ArlindKadra commented Apr 10, 2019

PGijsbers commented Apr 10, 2019

ArlindKadra commented Apr 10, 2019

PGijsbers commented Apr 10, 2019