-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206
Comments
@simonvanderveldt Would you mind providing more details, a reproducible code sample, if possible? and any other info would be really helpful with investigating the cause, thanks! |
@HemangChothani Sure! I'll take some time tomorrow to create a reproducible case. |
Thanks for the report @simonvanderveldt , I was able to reproduce the issue. It can be fixed by providing an explicit schema: job_config = bigquery.LoadJobConfig(
time_partitioning=bigquery.table.TimePartitioning(field="execution_date"),
schema=[
bigquery.SchemaField(name="id", field_type="STRING"),
bigquery.SchemaField(name="status", field_type="STRING"),
bigquery.SchemaField(name="created_at", field_type="TIMESTAMP"),
bigquery.SchemaField(name="execution_date", field_type="TIMESTAMP"),
]
) FWIW, autodetecting schema for new tables has recently been scheduled for deprecation, since it turned out that autodetection is unreliable in too many cases. One can see the pending deprecation warning by switching on these warnings at the top of the script (they are disabled in Python < 3.7): import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning)
warnings.simplefilter("always", category=DeprecationWarning) This reveals the following in the output, just prior to the exception traceback:
|
@simonvanderveldt Just checking, did an explicit schema solve the issue for you? Or is it necessary to investigate this further? Thanks. |
@plamut sorry, super busy with other stuff that's why I didn't get to this yet. Using an explicit schema won't solve the issue for us since we're using So we don't want to/can't provide a schema in the code. We're using parquet files as input which already contain all the required schema information, which after reading in the parquet file translates to the dataframe as well. P.S. We've switched the problematic (datetime) columns to
TIMESTAMP instead of integer.
|
@simonvanderveldt Are there any
to_parquet method is used, which we have less control over (thus why we have deprecated that path).
@plamut I wonder if we can avoid
Rather than return None when any column's type can't be determined, maybe we explicitly set the SchemaField 's type to None and fallback to pyarrow's type conversions when we convert from pandas Series to pyarrow Array?
|
@tswast Yeah, we have two columns that are of dtype |
@tswast I am experimenting with that, but there are quite a few moving parts. I did notice that Update: The detail is probably in the detail, though, depending on how reliable pyarrow's detection is. But it seems that it would still bring an improvement to the current state. |
Versions:
We were initially using 1.18.0 but I noticed #9044 was included in 1.19.0 so we tried that as well, but it made no difference.
We're using a pandas dataframe read from parquet, example data would be
df.dtypes
shows:When trying to load this into BigQuery using
load_table_from_dataframe()
and setting the job config'stime_partitioning
tobigquery.table.TimePartitioning(field="execution_date")
we get the following error:Which doesn't really make sense, since the field is clearly a
datetime64
.The job config shown in the console looks correct (ie it's set to partitioned by day and it's using the correct field).
edit:
It seems the cause for this is that Dataframe columns of type
datetime64
are being converted to typeINTEGER
instead ofDATE
(orTIMESTAMP
? I'm not sure which one would be the correct type in BigQuery).edit2:
Could it be this mapping is wrong and it should be
DATE
instead ofDATETIME
?google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py
Line 55 in dce1326
The text was updated successfully, but these errors were encountered: