Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why storing types in DataFrames is a Bad Idea #127

Closed
bengsparks opened this issue Aug 14, 2022 · 0 comments
Closed

Why storing types in DataFrames is a Bad Idea #127

bengsparks opened this issue Aug 14, 2022 · 0 comments
Assignees

Comments

@bengsparks
Copy link
Collaborator

bengsparks commented Aug 14, 2022

TL;DR: In order to unpickle DataFrames with types, every type stored therein must be imported prior to calling pandas.read_pickle.

Our initial implementation of the Tracer would store raw type objects into the DataFrame.
This was advantageous for us, as every instances' real type, location on disk for import-related tasks, and base class information were made available to anyone who loaded the DataFrame.
We had tested this on multiple basic types, such as str and int, then went on to trace a numpy test, which produced instances of numpy.arrays. At this stage, the DataFrame STILL remained loadable.

Satisfied with this, we declared that we could build upon this DataFrame format, and began implementing further features on top of this. Then, when working on an MRE for inline annotation generation that would use class names as for the type hints, our DataFrame refused to load, stating that the given types could not be loaded.

As the TL;DR states states, this problem was fixed by importing the relevant class prior to unpickling.
However, this was not a viable solution for us.

The reason that even tracing the numpy test worked, is that when we imported pandas for unpickling, numpy was imported as a dependency of pandas, which made numpy.array accessible.

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import sys
>>> sys.modules["numpy"]
<module 'numpy' from '/home/benji/.cache/pypoetry/
virtualenvs/solid-eureka-K39uNdS7-py3.10/lib/python3.10/site-packages/numpy/__init__.py'>

To reproduce: import pandas and numpy, store an object from numpy in the DataFrame, then pickle and store it.

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"a": [np.array([1, 2, 3])]}, columns=["a"])
>>> df
           a
0  [1, 2, 3]
>>> df.values[0]
array([array([1, 2, 3])], dtype=object)
>>> df.to_pickle("/tmp/df.pkl")

Then, import only pandas for the unpickling procedure:

λ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_pickle("/tmp/df.pkl")
>>> df["a"].values[0]
array([1, 2, 3])

This was fixed by storing the module and name of the type as strings within the DataFrame, which is sufficient to dynamically import desired types from the project, the standard library and from virtualenvs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant