Skip to content

Slightly better Typechecking when exporting to SQL #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 18, 2023

Conversation

jkuhl-uni
Copy link
Collaborator

Hi,
I had a problem where one of my pandas dfs was not correctly exported as the check for which datatype is in a column of a table was determined by the first entry in the column, which was "None" for me. I think @PiaLJP had a similar issue a few weeks ago.
Like this, at least someone can have a free cell in the first row and the DF will still be exported.
Maybe we find an even better way for this?
Specifically, I dont like my changes to input/json.py yet, but I am not sure how to improve that.
I'd be glad if someone could have a look at this.
I have also NOT tested the implications for the json export yet, as the tests sofar do not provide the utilities for this.

@fjosw
Copy link
Owner

fjosw commented May 4, 2023

@s-kuberski should take a closer look at this, just a quick comment: Maybe null is a good way to represent None in json.

@jkuhl-uni
Copy link
Collaborator Author

Sure, could be good, I was not sure how to do that in SQL and JSON at the same time, but since the function is originally for JSON maybe that is a better idea.
However, @s-kuberski is there a use-case for exporting None to JSON?

@jkuhl-uni jkuhl-uni closed this May 4, 2023
@jkuhl-uni jkuhl-uni reopened this May 4, 2023
@s-kuberski
Copy link
Collaborator

I'll have a look. It is certainly possible to use either NaN or null in JSON (and we use it for NaN entries in correlators.
(BTW: Rather than returning "None" and doing a string comparison, you could just return None, as this is unambiguous. We can leave this for now as we might want to take a different approach.)

Copy link
Collaborator Author

@jkuhl-uni jkuhl-uni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think I handled this isn the way you meant, if not, could you give me a reference which lines/files do not work as you expect?

@@ -479,8 +481,10 @@ def import_json_string(json_string, verbose=True, full_output=False):
result : dict
if full_output=True
"""

return _parse_json_dict(json.loads(json_string), verbose, full_output)
if json_string != "NONE":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version, this is always evaluated to be True, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true, do you think we should return None or leave it as NaN, as it currently is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't really get why changes in json.py are necessary at all, as the routines work if the passed parameters have the correct form (A list of Obs, list, numpy.ndarray, Corr for create_json_string and a json string with data for import_json_string). In my opinion it would be better to call the functions only if the parameters have the correct format and to handle the other cases where they appear (in this case: when applying the functions to the data frames) by not calling the functions. This would circumvent putting these hacks in the json functions.

Comment on lines 34 to 35
assert np.all(reconstructed_df.loc[1]) == np.all(my_df.loc[1])
assert np.all(reconstructed_df.loc[3:]) == np.all(my_df.loc[3:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldnt we rather like to check that each of the entries matches between my_dfandreconstructed_df`?

@s-kuberski
Copy link
Collaborator

Hi,
I am not completely sure if I understand your changes, sorry for that.
For me, it would be good to see an example of a data frame where the export does not work in the current version of the code and where this issue is fixed using your changes and in the next step how the original data frame is reconstructed when exporting and re-importing it. I guess this is partly what you have implemented in the tests?
Maybe I did confuse you with my last comment. I did not mean that we should convert None entries to np.nan when exporting and importing a data frame. For me, the optimal version of these routines should not change a data frame at all, when it is exported and imported (and cope with all versions of data frames that a person could like to export).

@jkuhl-uni
Copy link
Collaborator Author

Hi,
I checked all changes again and I think your confusion was legitimate.
so: here a minimal example of what happens in my analysis:
my_dict = {"int": 1, "float": -0.01, "Obs1": pe.pseudo_Obs(87, 21, "test_ensemble"), "Obs2": pe.pseudo_Obs(-87, 21, "test_ensemble2")} for gz in [True]: my_df = pd.DataFrame([my_dict] * 4) my_df.loc[0, "Obs1"] = None my_df.loc[2, "Obs1"] = None print(my_df) pe.input.pandas.to_sql(my_df, 'test','test.db')
This does not work, as the export fails due to my_df.loc[0, "Obs1"] = None and the handling in the current release of pyerrors. To circumvent this, we need a way to handle None in a df in the 0th row. This in itself is no problem, as one only needs to alter
https://github.com/fjosw/pyerrors/blob/15d07de87f8cd78fd83ffec82c24bc0824801e4a/pyerrors/input/pandas.py#LL139C1-L143C33
to look for the first line which is not None.
In the last two pushes I reverted the changes to json.py, as you requested and fixed the export and import to SQL.
However it seems to me as if the export and import to csv is a bit inconsistent now, as loading the df from csv changes None to nan as seen in the adapted tests.

@s-kuberski
Copy link
Collaborator

Hi,
thanks for your explanation and for cleaning up (you could also revert the very minor change in json.py to keep it's history clean). The inconsistency comes from the handling of None values in pandas.to_csv and pandas.read_csv.
When I export your data frame, the content of the csv file is

Obs1,Obs2,float,int
,-87.0,-0.01,1
87.0,-87.0,-0.01,1
,-87.0,-0.01,1
87.0,-87.0,-0.01,1

so the None entries are just left empty (as one would maybe expect it). When I now import the whole thing, these empty entries are filled with NaN. If you set keep_default_na=False in the call to read_csv, the entries are filled with empty strings (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

The question is a bit what kind of entries you like to allow and how to handle these cases. When I adapt your code such that I can export entries that contain np.NaN, the resulting csv file does not distinguish between the two and just writes an empty entry. Therefore, you would have to decide once and for all (and maybe write it in the doc string), how empty cells are to be handled and what will be the content of the imported data frame.

@jkuhl-uni
Copy link
Collaborator Author

Thank you for testing this. You are bringing up a good question, that I did not think about.
My preferred solution would be to keep None entries as None, as I think that there is more use in allowing None in pd.dfs then there is in allowing np.NaN. However, this is obviously a very subjective answer, as my use-case requires this. Therefore I wanted to ask if you (or someone else) can think of a use-case, in which the preservation of NaN is necessary. Keeping in mind, that Obs and Corrs would keep NaNs even in the current form.

@jkuhl-uni
Copy link
Collaborator Author

I thought about this a little more, the alternative to only keeping one would again be to use some kind of token for the other case, so e.g. either None or NaN would receive some token form and the export and import is altered.
Again, I do not have the usecase for NaN outside of Obs or Corr, in the kind of dbs people would produce with pyerrors. However, I tought we could maybe throw a Warning or an error, if someone tried to export a NaN cell.

@s-kuberski
Copy link
Collaborator

Hi.
In the current version, an error would be thrown if someone would try to export a NaN (not a Nan-Obs). So for the moment, as long as nobody has a problem because they would like to export these nans, you could assume that empty strings (using keep_default_na=False) can be set to None after the import, because nothing else could have been exported.

@jkuhl-uni
Copy link
Collaborator Author

So, I think now everything works as it should. There is only one inconsistency remaining that I can think of at the moment:
If you write zipped data to a database, then unzip it outside of pyerrors, there are no None entries, but empty strings, as .encode() and gzip are not comaptible with None. Other than that it seems as it works now.

@s-kuberski
Copy link
Collaborator

The test fails because you confused input with output, I guess. The way it should work is that None is written automatically to an empty string, such that you don't have to do anything when serializing the data frame. On input, you like to convert empty strings to None.

@jkuhl-uni
Copy link
Collaborator Author

Hi, I'm not sure if I understand what you mean.
I think what happens here is that in python version 3.7 the replacement when reading the gzipped data is not done correctly. I am not sure why this happens.

Concerning your answer: you are right, but I think your way only works if the cell is not zipped. Otherwise there is an error in pandas.py line 151, as None cannot be encoded in utf-8.

@jkuhl-uni jkuhl-uni marked this pull request as ready for review May 11, 2023 19:00
@jkuhl-uni jkuhl-uni requested a review from fjosw as a code owner May 11, 2023 19:00
@s-kuberski
Copy link
Collaborator

Thanks for fixing the remaining issues! The failure of the windows test seems to be related to a third-party library.

@jkuhl-uni
Copy link
Collaborator Author

Hi @fjosw, thanks for having an other look at this, there was in fact an oversight, where in some cases the zipping was not performed. This should be fixed now. The reason this didn't show up in the tests was because _deserialize_df recognizes this difference by itself.

@fjosw
Copy link
Owner

fjosw commented May 17, 2023

Is this ready to be merged or are you still testing the changes?

@jkuhl-uni
Copy link
Collaborator Author

Hi, I found one last bug, I think this should be save to merge into develop, I'll pull the develop branch into my analysis workflow and play around with it there. For now I cannot think of any test that would further improve the quality of the code.

Copy link
Owner

@fjosw fjosw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fjosw fjosw merged commit a5b6f69 into fjosw:develop May 18, 2023
@fjosw fjosw mentioned this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants