Slightly better Typechecking when exporting to SQL #174

jkuhl-uni · 2023-05-03T14:33:54Z

Hi,
I had a problem where one of my pandas dfs was not correctly exported as the check for which datatype is in a column of a table was determined by the first entry in the column, which was "None" for me. I think @PiaLJP had a similar issue a few weeks ago.
Like this, at least someone can have a free cell in the first row and the DF will still be exported.
Maybe we find an even better way for this?
Specifically, I dont like my changes to input/json.py yet, but I am not sure how to improve that.
I'd be glad if someone could have a look at this.
I have also NOT tested the implications for the json export yet, as the tests sofar do not provide the utilities for this.

fjosw · 2023-05-04T07:04:34Z

@s-kuberski should take a closer look at this, just a quick comment: Maybe null is a good way to represent None in json.

jkuhl-uni · 2023-05-04T07:16:17Z

Sure, could be good, I was not sure how to do that in SQL and JSON at the same time, but since the function is originally for JSON maybe that is a better idea.
However, @s-kuberski is there a use-case for exporting None to JSON?

s-kuberski · 2023-05-04T07:48:54Z

I'll have a look. It is certainly possible to use either NaN or null in JSON (and we use it for NaN entries in correlators.
(BTW: Rather than returning "None" and doing a string comparison, you could just return None, as this is unambiguous. We can leave this for now as we might want to take a different approach.)

jkuhl-uni

Hi, I think I handled this isn the way you meant, if not, could you give me a reference which lines/files do not work as you expect?

s-kuberski · 2023-05-09T08:14:15Z

pyerrors/input/json.py

@@ -479,8 +481,10 @@ def import_json_string(json_string, verbose=True, full_output=False):
    result : dict
        if full_output=True
    """
-
-    return _parse_json_dict(json.loads(json_string), verbose, full_output)
+    if json_string != "NONE":


In the current version, this is always evaluated to be True, right?

Yes, that is true, do you think we should return None or leave it as NaN, as it currently is?

Actually, I don't really get why changes in json.py are necessary at all, as the routines work if the passed parameters have the correct form (A list of Obs, list, numpy.ndarray, Corr for create_json_string and a json string with data for import_json_string). In my opinion it would be better to call the functions only if the parameters have the correct format and to handle the other cases where they appear (in this case: when applying the functions to the data frames) by not calling the functions. This would circumvent putting these hacks in the json functions.

s-kuberski · 2023-05-09T08:23:24Z

tests/pandas_test.py

+        assert np.all(reconstructed_df.loc[1]) == np.all(my_df.loc[1])
+        assert np.all(reconstructed_df.loc[3:]) == np.all(my_df.loc[3:])


Wouldnt we rather like to check that each of the entries matches between my_dfandreconstructed_df`?

s-kuberski · 2023-05-09T08:24:15Z

Hi,
I am not completely sure if I understand your changes, sorry for that.
For me, it would be good to see an example of a data frame where the export does not work in the current version of the code and where this issue is fixed using your changes and in the next step how the original data frame is reconstructed when exporting and re-importing it. I guess this is partly what you have implemented in the tests?
Maybe I did confuse you with my last comment. I did not mean that we should convert None entries to np.nan when exporting and importing a data frame. For me, the optimal version of these routines should not change a data frame at all, when it is exported and imported (and cope with all versions of data frames that a person could like to export).

jkuhl-uni · 2023-05-09T13:58:23Z

Hi,
I checked all changes again and I think your confusion was legitimate.
so: here a minimal example of what happens in my analysis:
my_dict = {"int": 1, "float": -0.01, "Obs1": pe.pseudo_Obs(87, 21, "test_ensemble"), "Obs2": pe.pseudo_Obs(-87, 21, "test_ensemble2")} for gz in [True]: my_df = pd.DataFrame([my_dict] * 4) my_df.loc[0, "Obs1"] = None my_df.loc[2, "Obs1"] = None print(my_df) pe.input.pandas.to_sql(my_df, 'test','test.db')
This does not work, as the export fails due to my_df.loc[0, "Obs1"] = None and the handling in the current release of pyerrors. To circumvent this, we need a way to handle None in a df in the 0th row. This in itself is no problem, as one only needs to alter
https://github.com/fjosw/pyerrors/blob/15d07de87f8cd78fd83ffec82c24bc0824801e4a/pyerrors/input/pandas.py#LL139C1-L143C33
to look for the first line which is not None.
In the last two pushes I reverted the changes to json.py, as you requested and fixed the export and import to SQL.
However it seems to me as if the export and import to csv is a bit inconsistent now, as loading the df from csv changes None to nan as seen in the adapted tests.

s-kuberski · 2023-05-09T14:44:31Z

Hi,
thanks for your explanation and for cleaning up (you could also revert the very minor change in json.py to keep it's history clean). The inconsistency comes from the handling of None values in pandas.to_csv and pandas.read_csv.
When I export your data frame, the content of the csv file is

Obs1,Obs2,float,int
,-87.0,-0.01,1
87.0,-87.0,-0.01,1
,-87.0,-0.01,1
87.0,-87.0,-0.01,1

so the None entries are just left empty (as one would maybe expect it). When I now import the whole thing, these empty entries are filled with NaN. If you set keep_default_na=False in the call to read_csv, the entries are filled with empty strings (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

The question is a bit what kind of entries you like to allow and how to handle these cases. When I adapt your code such that I can export entries that contain np.NaN, the resulting csv file does not distinguish between the two and just writes an empty entry. Therefore, you would have to decide once and for all (and maybe write it in the doc string), how empty cells are to be handled and what will be the content of the imported data frame.

jkuhl-uni · 2023-05-09T15:49:43Z

Thank you for testing this. You are bringing up a good question, that I did not think about.
My preferred solution would be to keep None entries as None, as I think that there is more use in allowing None in pd.dfs then there is in allowing np.NaN. However, this is obviously a very subjective answer, as my use-case requires this. Therefore I wanted to ask if you (or someone else) can think of a use-case, in which the preservation of NaN is necessary. Keeping in mind, that Obs and Corrs would keep NaNs even in the current form.

jkuhl-uni · 2023-05-10T12:04:00Z

I thought about this a little more, the alternative to only keeping one would again be to use some kind of token for the other case, so e.g. either None or NaN would receive some token form and the export and import is altered.
Again, I do not have the usecase for NaN outside of Obs or Corr, in the kind of dbs people would produce with pyerrors. However, I tought we could maybe throw a Warning or an error, if someone tried to export a NaN cell.

s-kuberski · 2023-05-10T12:39:25Z

Hi.
In the current version, an error would be thrown if someone would try to export a NaN (not a Nan-Obs). So for the moment, as long as nobody has a problem because they would like to export these nans, you could assume that empty strings (using keep_default_na=False) can be set to None after the import, because nothing else could have been exported.

jkuhl-uni · 2023-05-10T14:28:42Z

So, I think now everything works as it should. There is only one inconsistency remaining that I can think of at the moment:
If you write zipped data to a database, then unzip it outside of pyerrors, there are no None entries, but empty strings, as .encode() and gzip are not comaptible with None. Other than that it seems as it works now.

s-kuberski · 2023-05-10T15:05:18Z

The test fails because you confused input with output, I guess. The way it should work is that None is written automatically to an empty string, such that you don't have to do anything when serializing the data frame. On input, you like to convert empty strings to None.

jkuhl-uni · 2023-05-11T07:37:21Z

Hi, I'm not sure if I understand what you mean.
I think what happens here is that in python version 3.7 the replacement when reading the gzipped data is not done correctly. I am not sure why this happens.

Concerning your answer: you are right, but I think your way only works if the cell is not zipped. Otherwise there is an error in pandas.py line 151, as None cannot be encoded in utf-8.

s-kuberski · 2023-05-11T19:20:50Z

Thanks for fixing the remaining issues! The failure of the windows test seems to be related to a third-party library.

pyerrors/input/pandas.py

jkuhl-uni · 2023-05-16T19:30:44Z

Hi @fjosw, thanks for having an other look at this, there was in fact an oversight, where in some cases the zipping was not performed. This should be fixed now. The reason this didn't show up in the tests was because _deserialize_df recognizes this difference by itself.

pyerrors/input/pandas.py

fjosw · 2023-05-17T14:00:07Z

Is this ready to be merged or are you still testing the changes?

jkuhl-uni · 2023-05-18T12:45:38Z

Hi, I found one last bug, I think this should be save to merge into develop, I'll pull the develop branch into my analysis workflow and play around with it there. For now I cannot think of any test that would further improve the quality of the code.

fjosw

Thanks!

jkuhl-uni added 2 commits May 3, 2023 12:52

corret type clause

4c91a06

add tests, changes in create_json_string

caa7073

jkuhl-uni closed this May 4, 2023

jkuhl-uni reopened this May 4, 2023

Merge branch 'develop' into fix/pandas_type_check

7e4d793

create json-string now gives back None

719187a

jkuhl-uni commented May 8, 2023

View reviewed changes

s-kuberski reviewed May 9, 2023

View reviewed changes

jkuhl-uni added 3 commits May 9, 2023 13:16

revert changes

f396028

fix panda sql export

c042d7b

add SQL test

81ff8f0

fixed None type export for csv and sql.gz

42190da

jkuhl-uni added 2 commits May 11, 2023 06:51

move None parsing to json io

05ad1dc

alter regex

be18382

jkuhl-uni added 3 commits May 11, 2023 14:35

revert changes

67c8e1e

only replace None with empty str when necessary

77e7e67

fixed deserialze_df for python 3.7

ee27454

jkuhl-uni marked this pull request as ready for review May 11, 2023 19:00

jkuhl-uni requested a review from fjosw as a code owner May 11, 2023 19:00

add more tesets

58375a2

fjosw reviewed May 16, 2023

View reviewed changes

pyerrors/input/pandas.py Outdated Show resolved Hide resolved

pyerrors/input/pandas.py Outdated Show resolved Hide resolved

fix case where gz was ignored

4267ae8

fjosw reviewed May 16, 2023

View reviewed changes

pyerrors/input/pandas.py Show resolved Hide resolved

fjosw reviewed May 16, 2023

View reviewed changes

pyerrors/input/pandas.py Outdated Show resolved Hide resolved

jkuhl-uni and others added 5 commits May 16, 2023 19:51

hand voer gz explicitly

84202ae

replace nan by None in non-Obs columns

05a1da7

moved warning to csv export, mroe tests

e06faec

only values able to be nan are put in np.isnan()

8eb5c64

Merge branch 'develop' into fix/pandas_type_check

2fbb28c

added python float for warning

ffe95a1

fjosw approved these changes May 18, 2023

View reviewed changes

fjosw merged commit a5b6f69 into fjosw:develop May 18, 2023

jkuhl-uni mentioned this pull request May 22, 2023

taking care of cols with only None values #184

Merged

fjosw mentioned this pull request May 30, 2023

Warning in pandas tests #187

Closed

		assert np.all(reconstructed_df.loc[1]) == np.all(my_df.loc[1])
		assert np.all(reconstructed_df.loc[3:]) == np.all(my_df.loc[3:])

Slightly better Typechecking when exporting to SQL #174

Slightly better Typechecking when exporting to SQL #174

Uh oh!

Conversation

jkuhl-uni commented May 3, 2023

Uh oh!

fjosw commented May 4, 2023

Uh oh!

jkuhl-uni commented May 4, 2023

Uh oh!

s-kuberski commented May 4, 2023

Uh oh!

jkuhl-uni left a comment

Choose a reason for hiding this comment

Uh oh!

s-kuberski May 9, 2023

Choose a reason for hiding this comment

Uh oh!

jkuhl-uni May 9, 2023

Choose a reason for hiding this comment

Uh oh!

s-kuberski May 9, 2023

Choose a reason for hiding this comment

Uh oh!

s-kuberski May 9, 2023

Choose a reason for hiding this comment

Uh oh!

s-kuberski commented May 9, 2023

Uh oh!

jkuhl-uni commented May 9, 2023

Uh oh!

s-kuberski commented May 9, 2023

Uh oh!

jkuhl-uni commented May 9, 2023

Uh oh!

jkuhl-uni commented May 10, 2023

Uh oh!

s-kuberski commented May 10, 2023

Uh oh!

jkuhl-uni commented May 10, 2023

Uh oh!

s-kuberski commented May 10, 2023

Uh oh!

jkuhl-uni commented May 11, 2023

Uh oh!

s-kuberski commented May 11, 2023

Uh oh!

Uh oh!

Uh oh!

jkuhl-uni commented May 16, 2023

Uh oh!

Uh oh!

Uh oh!

fjosw commented May 17, 2023

Uh oh!

jkuhl-uni commented May 18, 2023

Uh oh!

fjosw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!