Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Integer column index breaks json roundtrip with orient=table #46392

Open
2 of 3 tasks
rikardn opened this issue Mar 16, 2022 · 12 comments
Open
2 of 3 tasks

BUG: Integer column index breaks json roundtrip with orient=table #46392

rikardn opened this issue Mar 16, 2022 · 12 comments
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@rikardn
Copy link

rikardn commented Mar 16, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

col1 = [1.0, 2.0, 3.5, 6.75]
col2 = [2.1, 3.1, 4.1, 5.1]
df = pd.DataFrame({1: col1, 2:col2}, index=[110, 112, 113, 121])
df.index.name = 'ID'
s = df.to_json(orient='table')
new = pd.read_json(s, orient='table')

Issue Description

The new dataframe will become

      1   2
ID         
110 NaN NaN
112 NaN NaN
113 NaN NaN
121 NaN NaN

Expected Behavior

The expected dataframe would look like this:

        1    2
ID            
110  1.00  2.1
112  2.00  3.1
113  3.50  4.1
121  6.75  5.1

Changing to strings instead of integers in the column index will give the expected result:

col1 = [1.0, 2.0, 3.5, 6.75]
col2 = [2.1, 3.1, 4.1, 5.1]
df = pd.DataFrame({'1': col1, '2':col2}, index=[110, 112, 113, 121])
df.index.name = 'ID'
s = df.to_json(orient='table')
new = pd.read_json(s, orient='table')

Installed Versions

This crashed in my environment with the error assert '_distutils' in core.__file__, core.__file__ raised from lib/python3.9/site-packages/_distutils_hack/__init__.py", line 59, in ensure_local_distutils

@rikardn rikardn added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 16, 2022
@SeanRBurton
Copy link

I'm willing to tackle this, but I'm not sure of the best approach:

  1. Disallow non-string keys unless orient="values".
  2. If orient="table", reconstruct the keys with the correct type using the schema. Raise an exception if the table has two keys which serialize to the same value, e.g. 1 and "1".
  3. Something else?

@rikardn
Copy link
Author

rikardn commented Mar 17, 2022

I can add my user perspective:

The use case I have is to serialize a DataFrame to json and be able to get back the exact same DataFrame regardless of type of Index etc. Given this use case it is really not interesting to select orient.

@gpapadok
Copy link

Playing around I found another issue.

col1 = [1.0, 2.0, 3.5, 6.75]
col2 = [2.1, 3.1, 4.1, 5.1]
col3 = [2.2, 3.3, 4.4, 5.5]
df = pd.DataFrame({'1': col1, '2':col2, 1:col3}, index=[110, 112, 113, 121])
df.index.name = 'ID'
s = df.to_json()
new = pd.read_json(s)

The df data frame has two columns with same name but different type:

        1    2    1
ID                 
110  1.00  2.1  2.2
112  2.00  3.1  3.3
113  3.50  4.1  4.4
121  6.75  5.1  5.5

By serializing and deserializing we get back new as:

       1    2
110  2.2  2.1
112  3.3  3.1
113  4.4  4.1
121  5.5  5.1

Column types also get converted from strings to ints:

In [118]: df.columns, new.columns
Out[118]: (Index(['1', '2', 1], dtype='object'), Int64Index([1, 2], dtype='int64'))

@mroeschke mroeschke changed the title BUG: Integer column index breaks json roundtrip BUG: Integer column index breaks json roundtrip with orient=table Jul 6, 2022
@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@jmg-duarte
Copy link
Contributor

jmg-duarte commented Sep 9, 2022

The issue is in:

content = lib.dicts_to_array(data, list(columns))

This is the result:
image

Called by:

df = DataFrame(table["data"], columns=col_order)[col_order]

I'm interested in helping more, do you where can I look further @mroeschke?

@mroeschke
Copy link
Member

Not too familiar with this code path, so I'm not exactly sure off the top of my head how nans are getting introduced, but further investigation welcome!

@jmg-duarte
Copy link
Contributor

I found the issue and have a fix, but the code I am touching is ~10 years old according to Gitlens git blame.

Either way, the problem is that when reading the JSON, pandas will have an array of dicts like so:

{"ID":110,"1":1.0,"2":2.1}

Since JSON only supports strings as keys, we have an issue here, because we don't have a 1:1 mapping (https://stackoverflow.com/questions/1450957/pythons-json-module-converts-int-dictionary-keys-to-strings).

In the following code, the row will be a dictionary with strings as keys, and columns (and k) will be the proper column array.

pandas/pandas/_libs/lib.pyx

Lines 432 to 455 in 1b2646a

@cython.wraparound(False)
@cython.boundscheck(False)
def dicts_to_array(dicts: list, columns: list):
cdef:
Py_ssize_t i, j, k, n
ndarray[object, ndim=2] result
dict row
object col, onan = np.nan
k = len(columns)
n = len(dicts)
result = np.empty((n, k), dtype='O')
for i in range(n):
row = dicts[i]
for j in range(k):
col = columns[j]
if col in row:
result[i, j] = row[col]
else:
result[i, j] = onan
return result

Meaning there is a mismatch as 1 is not in {"ID":110,"1":1.0,"2":2.1}, though "1" is.

I've tested the following fix:

- col = columns[j]
+ col = str(columns[j])

And it solves this issue, I'm currently running the test suite to check if nothing else got broken.

@jmg-duarte
Copy link
Contributor

I've finished running the test suite, before and after, this change seems to 4 introduce failures. Is there a way I can check the summary of both runs and diff between them, even if manually?

@coatless
Copy link

coatless commented Dec 5, 2022

@jmg-duarte did you end up solving the issue?

If not, @mroeschke could you suggest a way for @jmg-duarte to diff between summary runs? It's not ideal that the JSON being generated is invalid.

@jmg-duarte
Copy link
Contributor

@coatless I discussed a potential fix in #46392 (comment) but got no response as you can see :/

@Coderambling
Copy link

Hi @jmg-duarte, @coatless and @mroeschke . Do you know of any updates to this issue? It seems @jmg-duarte has identified the cause and proposed a solution with a small code change.

Is there a way forward, or will this potentially be solved by another PR that you know of?

@coatless
Copy link

Relies on @mroeschke & pandas team. Nothing has changed AFAIK.

@Coderambling
Copy link

Thanks for your reply @mroeschke !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

7 participants