Add ability to specify schema of pandas columns #603

ektar · 2022-02-04T16:56:11Z

Inspired by #600, further discussion there

Problem:

DataFrames have dimension checking (# of rows/columns) and column name checking, but no dtype checking
This would be particularly useful on schema deserialization - datetimes and numbers can be ambiguous in json, currently are loaded in as ints or strings, depending on how they were serialized

Proposal:

add "schema" to df init func, accept a dict of dtypes, same formats as pandas' as_type
for the columns schema is set for, check in _validate - this would likely involve casting the columns specified in schema and failing on error... potentially could also save that to the dataframe so downstream code wouldn't have to do the casting.
allow schema to be used by json deserializer - cast specified columns after pandas.read_json

Example of problem with dates - full recovery only possible when "iso" string output is used instead of epoch, and col is cast from str to datetime by pandas:

import pandas as pd
from IPython.display import display
df = pd.DataFrame({'a': [pd.Timestamp('20200309'), 
                         pd.Timestamp('20200309')],
                  'b': [1, 2]})
# Also works with time-zone aware timestamps
# df = pd.DataFrame({'a': [pd.Timestamp('20200309T120000.000000-0000'), 
#                          pd.Timestamp('20200309T130000.000000-0000')],
#                   'b': [1, 2]})
display(df)

df_json_1 = df.to_json()
display(df_json_1)
df_deser_1 = pd.read_json(df_json_1)
display(df_deser_1)
display(df_deser_1.dtypes)
df_deser_1 = df_deser_1.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_1)
display(df_deser_1.dtypes)

df_json_2 = df.to_json(date_format='iso')
display(df_json_2)
df_deser_2 = pd.read_json(df_json_2)
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))
df_deser_2 = df_deser_2.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))

The text was updated successfully, but these errors were encountered:

MridulS assigned jlstevens Feb 7, 2022

MridulS added this to the Wishlist milestone Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to specify schema of pandas columns #603

Add ability to specify schema of pandas columns #603

ektar commented Feb 4, 2022

Add ability to specify schema of pandas columns #603

Add ability to specify schema of pandas columns #603

Comments

ektar commented Feb 4, 2022