-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float precision #559
Comments
Another solution could be to store the exact values (string or decimal) as a component of the Dataset object when it first reads the data. That would ensure that you could pickle and cache it and still get the exact values back? |
Thanks for figuring this out. Indeed, the issue is the conversion of the target column into floating point representation.
That's a great idea, but doing it exactly like this will probably not work because the dataset won't know that it is a regression dataset. However, having a function called |
An actual run on the live server is e.g. 10154904. |
So I looked into this again, and I am somewhat sure that the problem described above was a result of the problem fixed in #1209 I am not able to reproduce the problem for task 738. The ground truth in the arff file seems to have the correct precision. I used the following code (with the openml's version from the current dev branch): from sklearn.dummy import DummyRegressor
from openml import tasks, runs
import openml
openml.config.server = "https://test.openml.org/api/v1/xml"
reg = DummyRegressor()
task = tasks.get_task(738)
run = runs.run_model_on_task(reg, task, avoid_duplicate_runs=False)
run.to_filesystem("./tmp_dir") The exported ground truth values and their precision match the values of the downloaded dataset. It might be that this only looked like a precision problem because a better model than the Do we have another case that I could look into where this might happen/has happened? |
I can't currently verify with that particular task (test server failure), but testing it with a task on production (id: 5514) the target features do indeed seem to behave. Both the dataset and the prediction file produce the column with It is indeed likely that the swapped columns already existed when this issue was opened, and we were looking at the model's predictions rather than ground truth. Either way, ground truth is reported correctly now (on dev/in the next release), so I am closing this issue. Feel free to re-open if you find a reproducible example where the error persists. I don't think we should look into any kind of "fixing" the precision for model predictions, we should leave it as is until we move to parquet for model predictions IMO. |
When a dataset is loaded (from an arff file), the values are converted to floats:
That's fine in itself, but when doing regression experiments, we need to upload both the predicted and actual (truth) value. However, when exporting the actual value back to strings (in the arff file), precision is lost.
For instance, here are the first values of the training set of task 738 (test server):
This is what is exported to the predictions arff file:
What would be the best way to fix this? Load the data as Decimals instead of floats? Round the data before writing to arff?
The text was updated successfully, but these errors were encountered: