BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

simonvanderveldt · 2019-09-11T07:27:58Z

Versions:

google-cloud-bigquery==1.19.0

We were initially using 1.18.0 but I noticed #9044 was included in 1.19.0 so we tried that as well, but it made no difference.

We're using a pandas dataframe read from parquet, example data would be

id status created_at execution_date
0 1 NEW 2018-09-12 09:24:31.291 2019-05-10 17:40:00
1 2 NEW 2018-09-12 09:26:45.890 2019-05-10 17:40:00
[628 rows x 4 columns]

df.dtypes shows:

id object
status object
created_at datetime64[ns]
dataplatform_execution_date datetime64[ns]
dtype: object

When trying to load this into BigQuery using load_table_from_dataframe() and setting the job config's time_partitioning to bigquery.table.TimePartitioning(field="execution_date") we get the following error:

The field specified for time partitioning can only be of type TIMESTAMP or DATE. The type found is: INTEGER.

Which doesn't really make sense, since the field is clearly a datetime64.
The job config shown in the console looks correct (ie it's set to partitioned by day and it's using the correct field).

edit:
It seems the cause for this is that Dataframe columns of type datetime64 are being converted to type INTEGER instead of DATE (or TIMESTAMP? I'm not sure which one would be the correct type in BigQuery).

edit2:
Could it be this mapping is wrong and it should be DATE instead of DATETIME?

google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 55 in dce1326

"datetime64[ns]": "DATETIME",

The text was updated successfully, but these errors were encountered:

HemangChothani · 2019-09-12T12:22:27Z

@simonvanderveldt Would you mind providing more details, a reproducible code sample, if possible? and any other info would be really helpful with investigating the cause, thanks!

simonvanderveldt · 2019-09-12T18:22:27Z

@HemangChothani Sure! I'll take some time tomorrow to create a reproducible case.

plamut · 2019-09-14T14:55:03Z

Thanks for the report @simonvanderveldt , I was able to reproduce the issue.

It can be fixed by providing an explicit schema:

job_config = bigquery.LoadJobConfig(
    time_partitioning=bigquery.table.TimePartitioning(field="execution_date"),
    schema=[
        bigquery.SchemaField(name="id", field_type="STRING"),
        bigquery.SchemaField(name="status", field_type="STRING"),
        bigquery.SchemaField(name="created_at", field_type="TIMESTAMP"),
        bigquery.SchemaField(name="execution_date", field_type="TIMESTAMP"),
    ]
)

FWIW, autodetecting schema for new tables has recently been scheduled for deprecation, since it turned out that autodetection is unreliable in too many cases. One can see the pending deprecation warning by switching on these warnings at the top of the script (they are disabled in Python < 3.7):

import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning) 
warnings.simplefilter("always", category=DeprecationWarning)

This reveals the following in the output, just prior to the exception traceback:

/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'id'.
  warnings.warn(u"Unable to determine type of column '{}'.".format(column))
/home/peter/workspace/google-cloud-python/reproduce/reproduce_9206.py:52: PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.
  load_job = client.load_table_from_dataframe(df, table_ref, job_config=job_config)

plamut · 2019-09-23T10:24:36Z

@simonvanderveldt Just checking, did an explicit schema solve the issue for you? Or is it necessary to investigate this further? Thanks.

simonvanderveldt · 2019-09-23T15:51:19Z

@plamut sorry, super busy with other stuff that's why I didn't get to this yet.

Using an explicit schema won't solve the issue for us since we're using load_table_from_dataframe() in a generic component that is used by our users. The users define the schema for the input data themselves so the component needs to be generic.

So we don't want to/can't provide a schema in the code. We're using parquet files as input which already contain all the required schema information, which after reading in the parquet file translates to the dataframe as well.
Since a dataframe already has a schema which kind of "autodetecting schema for new tables" do you mean exactly? The mapping from dataframe/pandas/numpy types to BigQuery types?

P.S. We've switched the problematic (datetime) columns to datetime64[ns, UTC] after looking at

google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 54 in dce1326

"datetime64[ns, UTC]": "TIMESTAMP",

which does get correctly mapped to TIMESTAMP instead of integer.

tswast · 2019-09-23T17:37:17Z

@simonvanderveldt Are there any object dtype columns (e.g. for strings or NULL values) in the DataFrame you're providing? If so, the mapping you linked to (

google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 55 in dce1326

"datetime64[ns]": "DATETIME",

) does not get used. Instead, the pandas to_parquet method is used, which we have less control over (thus why we have deprecated that path).

@plamut I wonder if we can avoid to_parquet altogether if pyarrow is installed? This is the problematic line:

google-cloud-python//bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 276 in dce1326

return None

Rather than return None when any column's type can't be determined, maybe we explicitly set the SchemaField's type to None and fallback to pyarrow's type conversions when we convert from pandas Series to pyarrow Array?

google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 187 in dce1326

return pyarrow.array(series, type=arrow_type)

simonvanderveldt · 2019-09-24T10:59:26Z

@tswast Yeah, we have two columns that are of dtype object (a list of all columns and their dtypes is in the description at the top), seems like you're right and that might be happening.

plamut · 2019-09-25T14:25:11Z

@tswast I am experimenting with that, but there are quite a few moving parts.

I did notice that pyarrow.Table types are correct (in a limited test case), and it might work to use that to fill in the missing types (if we send null values, the backend complains).

Update:
Seems like this could work - pyarrow correctly detects strings and dates (did not test other types yet), and this info can be used to set a correct field_type on every SchemaField() instance.

The detail is probably in the detail, though, depending on how reliable pyarrow's detection is. But it seems that it would still bring an improvement to the current state.

plamut added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Sep 11, 2019

plamut self-assigned this Sep 25, 2019

tswast mentioned this issue Sep 25, 2019

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

Closed

plamut mentioned this issue Sep 26, 2019

fix(bigquery): use pyarrow fallback for improved schema detection #9321

Merged

plamut closed this as completed in #9321 Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

simonvanderveldt commented Sep 11, 2019 •

edited

Loading

HemangChothani commented Sep 12, 2019

simonvanderveldt commented Sep 12, 2019

plamut commented Sep 14, 2019

plamut commented Sep 23, 2019

simonvanderveldt commented Sep 23, 2019 •

edited

Loading

tswast commented Sep 23, 2019 •

edited

Loading

simonvanderveldt commented Sep 24, 2019

plamut commented Sep 25, 2019 •

edited

Loading

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

Comments

simonvanderveldt commented Sep 11, 2019 • edited Loading

HemangChothani commented Sep 12, 2019

simonvanderveldt commented Sep 12, 2019

plamut commented Sep 14, 2019

plamut commented Sep 23, 2019

simonvanderveldt commented Sep 23, 2019 • edited Loading

tswast commented Sep 23, 2019 • edited Loading

simonvanderveldt commented Sep 24, 2019

plamut commented Sep 25, 2019 • edited Loading

simonvanderveldt commented Sep 11, 2019 •

edited

Loading

simonvanderveldt commented Sep 23, 2019 •

edited

Loading

tswast commented Sep 23, 2019 •

edited

Loading

plamut commented Sep 25, 2019 •

edited

Loading