Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

float precision #559

Closed
joaquinvanschoren opened this issue Sep 27, 2018 · 5 comments
Closed

float precision #559

joaquinvanschoren opened this issue Sep 27, 2018 · 5 comments
Assignees
Labels
Run OpenML concept

Comments

@joaquinvanschoren
Copy link
Contributor

When a dataset is loaded (from an arff file), the values are converted to floats:

datasets.py:348: target_dtype = int if target_categorical[0] else float

That's fine in itself, but when doing regression experiments, we need to upload both the predicted and actual (truth) value. However, when exporting the actual value back to strings (in the arff file), precision is lost.

For instance, here are the first values of the training set of task 738 (test server):

6.1, 6.3, 5.8, 6.1, 6.0, 5.9

This is what is exported to the predictions arff file:

6.0999999 6.30000019 5.80000019 6.0999999 6.19999981 6. 5.9000001

What would be the best way to fix this? Load the data as Decimals instead of floats? Round the data before writing to arff?

@joaquinvanschoren
Copy link
Contributor Author

Another solution could be to store the exact values (string or decimal) as a component of the Dataset object when it first reads the data. That would ensure that you could pickle and cache it and still get the exact values back?

@mfeurer
Copy link
Collaborator

mfeurer commented Sep 28, 2018

Thanks for figuring this out. Indeed, the issue is the conversion of the target column into floating point representation.

Another solution could be to store the exact values (string or decimal) as a component of the Dataset object when it first reads the data.

That's a great idea, but doing it exactly like this will probably not work because the dataset won't know that it is a regression dataset. However, having a function called Dataset.get_column() which returns the values of a column as raw python data types could help here (and also caches it). This would then be called by the task class with the necessary argument (the column). What do you think of this solution?

@PGijsbers
Copy link
Collaborator

An actual run on the live server is e.g. 10154904.

@LennartPurucker
Copy link
Contributor

So I looked into this again, and I am somewhat sure that the problem described above was a result of the problem fixed in #1209

I am not able to reproduce the problem for task 738. The ground truth in the arff file seems to have the correct precision.

I used the following code (with the openml's version from the current dev branch):

from sklearn.dummy import DummyRegressor
from openml import tasks, runs
import openml

openml.config.server = "https://test.openml.org/api/v1/xml"

reg = DummyRegressor()
task = tasks.get_task(738)
run = runs.run_model_on_task(reg, task, avoid_duplicate_runs=False)
run.to_filesystem("./tmp_dir")

The exported ground truth values and their precision match the values of the downloaded dataset.

It might be that this only looked like a precision problem because a better model than the DummyRegressor quickly approaches the ground truth with its predictions (the task is very simple to solve). Or this problem was already fixed at some point by updates to the precision of other libraries (e.g., pandas).

Do we have another case that I could look into where this might happen/has happened?
Otherwise, I am not sure how to proceed here.

@PGijsbers
Copy link
Collaborator

PGijsbers commented Feb 27, 2023

I can't currently verify with that particular task (test server failure), but testing it with a task on production (id: 5514) the target features do indeed seem to behave. Both the dataset and the prediction file produce the column with :.1f formatting. Model predictions do still have arbitrary decimal points (as expected, and this should not be an issue).

It is indeed likely that the swapped columns already existed when this issue was opened, and we were looking at the model's predictions rather than ground truth.

Either way, ground truth is reported correctly now (on dev/in the next release), so I am closing this issue. Feel free to re-open if you find a reproducible example where the error persists. I don't think we should look into any kind of "fixing" the precision for model predictions, we should leave it as is until we move to parquet for model predictions IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Run OpenML concept
Projects
None yet
Development

No branches or pull requests

4 participants