Why storing `type`s in `DataFrame`s is a Bad Idea #127

bengsparks · 2022-08-14T19:58:26Z

TL;DR: In order to unpickle DataFrames with types, every type stored therein must be imported prior to calling pandas.read_pickle.

Our initial implementation of the Tracer would store raw type objects into the DataFrame.
This was advantageous for us, as every instances' real type, location on disk for import-related tasks, and base class information were made available to anyone who loaded the DataFrame.
We had tested this on multiple basic types, such as str and int, then went on to trace a numpy test, which produced instances of numpy.arrays. At this stage, the DataFrame STILL remained loadable.

Satisfied with this, we declared that we could build upon this DataFrame format, and began implementing further features on top of this. Then, when working on an MRE for inline annotation generation that would use class names as for the type hints, our DataFrame refused to load, stating that the given types could not be loaded.

As the TL;DR states states, this problem was fixed by importing the relevant class prior to unpickling.
However, this was not a viable solution for us.

The reason that even tracing the numpy test worked, is that when we imported pandas for unpickling, numpy was imported as a dependency of pandas, which made numpy.array accessible.

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import sys
>>> sys.modules["numpy"]
<module 'numpy' from '/home/benji/.cache/pypoetry/
virtualenvs/solid-eureka-K39uNdS7-py3.10/lib/python3.10/site-packages/numpy/__init__.py'>

To reproduce: import pandas and numpy, store an object from numpy in the DataFrame, then pickle and store it.

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"a": [np.array([1, 2, 3])]}, columns=["a"])
>>> df
           a
0  [1, 2, 3]
>>> df.values[0]
array([array([1, 2, 3])], dtype=object)
>>> df.to_pickle("/tmp/df.pkl")

Then, import only pandas for the unpickling procedure:

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_pickle("/tmp/df.pkl")
>>> df["a"].values[0]
array([1, 2, 3])

This was fixed by storing the module and name of the type as strings within the DataFrame, which is sufficient to dynamically import desired types from the project, the standard library and from virtualenvs.

The text was updated successfully, but these errors were encountered:

bengsparks added Bug Documentation labels Aug 14, 2022

bengsparks self-assigned this Aug 14, 2022

bengsparks pinned this issue Aug 14, 2022

bengsparks closed this as completed Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why storing `type`s in `DataFrame`s is a Bad Idea #127

Why storing `type`s in `DataFrame`s is a Bad Idea #127

bengsparks commented Aug 14, 2022 •

edited

Loading

Why storing types in DataFrames is a Bad Idea #127

Why storing types in DataFrames is a Bad Idea #127

Comments

bengsparks commented Aug 14, 2022 • edited Loading

Why storing `type`s in `DataFrame`s is a Bad Idea #127

Why storing `type`s in `DataFrame`s is a Bad Idea #127

bengsparks commented Aug 14, 2022 •

edited

Loading