-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
I'm not sure whether or not this falls into a feature/bug or perhaps an "unfulfilled/unrealistic assumption". Further, my experience here might be "the tip of the iceberg" on a bigger underlying issue with to_sql() (or it could be a red herring that is only a problem for me 😁)
Conditions
DataFrame with multiple dtypes - my example has 200+ columns with dtypes of: {dtype('O'), dtype('int16'), dtype('uint8'), dtype('uint16'), dtype('uint32'), dtype('float64')}. My test case is using Postgres 12 as the destination DB.
Current Behavior
DataFrame.to_sql(...) does work to Postgres. However, from the various dtypes listed above, only 3 Postgres column types are created:
- bigint (for all integers)
- equivalent to int64 (64-bit / 8-byte precision integer)
- double precision (for all floats)
- techncially not directly equivalent (I'm not having an issue, but others might)
- text (for all objects)
See: Postgres Numeric Data Types for reference.
The Issue/Effects
The ultimate issue is that the created table, although it appears to work, does not really conform to the data types in the DataFrame. Specifically, with regard to integer types. Postgres (unlike MySQL) doesn't have unsigned integer types but does have (generally) 3 integer types:
- smallint (2-bytes / 16-bit integer)
- integer (4-bytes / 32-bit integer)
- bigint (8-bytes / 64-bit integer)
I'm still reading through the to_sql() code but if the docstring (quoted below) from SQLTable() is accurate, it would appear that Pandas is relying on SQLAlchemy to handle the conversion? If so, the assumption that SQLAlchemy is doing the conversion correctly/well looks to be unfounded.
Lines 663 to 669 in bfac136
| class SQLTable(PandasObject): | |
| """ | |
| For mapping Pandas tables to SQL tables. | |
| Uses fact that table is reflected by SQLAlchemy to | |
| do better type conversions. | |
| Also holds various flags needed to avoid having to | |
| pass them between functions all the time. |
Some of the Impacts
Bigger tables. Presumably longer insert times. Potentially conversion issues (I haven't tried feeding to_sql() a uint64 .... )
Thoughts
If my understanding is correct that Pandas is relying on SQLAlchemy to do the type conversions, I guess this could be seen as either a Pandas issue or an SQLAlchemy issue:
A. As an SQLAlchemy Issue
- Pandas is feeding the data to SQLAlchemy with the assumption that it will properly convert the datatypes into various dialects (e.g. PGSQL, MySQL, SQLITE, etc.) correctly
- Therefore, SQLAlchemy needs to do a better job of converting the input data
B. As a Pandas issue
- AFAIK, SQLAlchemy was never designed with the plethora of "fully functional" data types provided by Numpy and which Pandas uses - it was designed around Python data types which are much simpler. So, SQLAlchemy can't be expected to make intelligent/informed decisions based on Numpy data types.
- Therefore, Pandas needs to either provide SQLAlchemy with more metadata (I'm not sure this is possible) and/or not assume that SQLAlchemy is going to do the conversion properly
Potential Further Impacts
I'm really not well versed enough in the Pandas/Numpy/SQLAlchemy code base to understand the potential impacts. However, I'm especially curious about the handling of float values and also about the interaction with other database engines. (For example, something that seems absolutely crazy to me about SQLite3: column data types aren't strict, they are more of a suggestion ... )
Summary/The Bottom Line/TL;DR;
I'd like to see to_sql() create a table with the best (e.g. smallest/most appropriate) data type - preferably a close match to the DataFrame.dtype. Currently, at least in the case of Postgres, it does not.
Finally, perhaps someone with more knowledge than I could double check to see if the assumption that SQLAlchemy is actually reliably converting all datatypes is correct?