Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to specify schema of pandas columns #603

Open
ektar opened this issue Feb 4, 2022 · 0 comments
Open

Add ability to specify schema of pandas columns #603

ektar opened this issue Feb 4, 2022 · 0 comments
Assignees
Milestone

Comments

@ektar
Copy link

ektar commented Feb 4, 2022

Inspired by #600, further discussion there

Problem:

  • DataFrames have dimension checking (# of rows/columns) and column name checking, but no dtype checking
  • This would be particularly useful on schema deserialization - datetimes and numbers can be ambiguous in json, currently are loaded in as ints or strings, depending on how they were serialized

Proposal:

  • add "schema" to df init func, accept a dict of dtypes, same formats as pandas' as_type
  • for the columns schema is set for, check in _validate - this would likely involve casting the columns specified in schema and failing on error... potentially could also save that to the dataframe so downstream code wouldn't have to do the casting.
  • allow schema to be used by json deserializer - cast specified columns after pandas.read_json

Example of problem with dates - full recovery only possible when "iso" string output is used instead of epoch, and col is cast from str to datetime by pandas:

import pandas as pd
from IPython.display import display
df = pd.DataFrame({'a': [pd.Timestamp('20200309'), 
                         pd.Timestamp('20200309')],
                  'b': [1, 2]})
# Also works with time-zone aware timestamps
# df = pd.DataFrame({'a': [pd.Timestamp('20200309T120000.000000-0000'), 
#                          pd.Timestamp('20200309T130000.000000-0000')],
#                   'b': [1, 2]})
display(df)

df_json_1 = df.to_json()
display(df_json_1)
df_deser_1 = pd.read_json(df_json_1)
display(df_deser_1)
display(df_deser_1.dtypes)
df_deser_1 = df_deser_1.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_1)
display(df_deser_1.dtypes)

df_json_2 = df.to_json(date_format='iso')
display(df_json_2)
df_deser_2 = pd.read_json(df_json_2)
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))
df_deser_2 = df_deser_2.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))

image

@MridulS MridulS added this to the Wishlist milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants