Skip to content

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

@danieldjewell

Description

@danieldjewell

I'm not sure whether or not this falls into a feature/bug or perhaps an "unfulfilled/unrealistic assumption". Further, my experience here might be "the tip of the iceberg" on a bigger underlying issue with to_sql() (or it could be a red herring that is only a problem for me 😁)

Conditions

DataFrame with multiple dtypes - my example has 200+ columns with dtypes of: {dtype('O'), dtype('int16'), dtype('uint8'), dtype('uint16'), dtype('uint32'), dtype('float64')}. My test case is using Postgres 12 as the destination DB.

Current Behavior

DataFrame.to_sql(...) does work to Postgres. However, from the various dtypes listed above, only 3 Postgres column types are created:

  1. bigint (for all integers)
  • equivalent to int64 (64-bit / 8-byte precision integer)
  1. double precision (for all floats)
  • techncially not directly equivalent (I'm not having an issue, but others might)
  1. text (for all objects)

See: Postgres Numeric Data Types for reference.

The Issue/Effects

The ultimate issue is that the created table, although it appears to work, does not really conform to the data types in the DataFrame. Specifically, with regard to integer types. Postgres (unlike MySQL) doesn't have unsigned integer types but does have (generally) 3 integer types:

  • smallint (2-bytes / 16-bit integer)
  • integer (4-bytes / 32-bit integer)
  • bigint (8-bytes / 64-bit integer)

I'm still reading through the to_sql() code but if the docstring (quoted below) from SQLTable() is accurate, it would appear that Pandas is relying on SQLAlchemy to handle the conversion? If so, the assumption that SQLAlchemy is doing the conversion correctly/well looks to be unfounded.

pandas/pandas/io/sql.py

Lines 663 to 669 in bfac136

class SQLTable(PandasObject):
"""
For mapping Pandas tables to SQL tables.
Uses fact that table is reflected by SQLAlchemy to
do better type conversions.
Also holds various flags needed to avoid having to
pass them between functions all the time.

Some of the Impacts

Bigger tables. Presumably longer insert times. Potentially conversion issues (I haven't tried feeding to_sql() a uint64 .... )

Thoughts

If my understanding is correct that Pandas is relying on SQLAlchemy to do the type conversions, I guess this could be seen as either a Pandas issue or an SQLAlchemy issue:

A. As an SQLAlchemy Issue

  • Pandas is feeding the data to SQLAlchemy with the assumption that it will properly convert the datatypes into various dialects (e.g. PGSQL, MySQL, SQLITE, etc.) correctly
  • Therefore, SQLAlchemy needs to do a better job of converting the input data

B. As a Pandas issue

  • AFAIK, SQLAlchemy was never designed with the plethora of "fully functional" data types provided by Numpy and which Pandas uses - it was designed around Python data types which are much simpler. So, SQLAlchemy can't be expected to make intelligent/informed decisions based on Numpy data types.
  • Therefore, Pandas needs to either provide SQLAlchemy with more metadata (I'm not sure this is possible) and/or not assume that SQLAlchemy is going to do the conversion properly

Potential Further Impacts

I'm really not well versed enough in the Pandas/Numpy/SQLAlchemy code base to understand the potential impacts. However, I'm especially curious about the handling of float values and also about the interaction with other database engines. (For example, something that seems absolutely crazy to me about SQLite3: column data types aren't strict, they are more of a suggestion ... )

Summary/The Bottom Line/TL;DR;

I'd like to see to_sql() create a table with the best (e.g. smallest/most appropriate) data type - preferably a close match to the DataFrame.dtype. Currently, at least in the case of Postgres, it does not.

Finally, perhaps someone with more knowledge than I could double check to see if the assumption that SQLAlchemy is actually reliably converting all datatypes is correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions