Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

Closed
simonvanderveldt opened this issue Sep 11, 2019 · 8 comments · Fixed by #9321
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.

Comments

@simonvanderveldt
Copy link

simonvanderveldt commented Sep 11, 2019

Versions:

google-cloud-bigquery==1.19.0

We were initially using 1.18.0 but I noticed #9044 was included in 1.19.0 so we tried that as well, but it made no difference.

We're using a pandas dataframe read from parquet, example data would be

id status created_at execution_date
0 1 NEW 2018-09-12 09:24:31.291 2019-05-10 17:40:00
1 2 NEW 2018-09-12 09:26:45.890 2019-05-10 17:40:00
[628 rows x 4 columns]

df.dtypes shows:

id object
status object
created_at datetime64[ns]
dataplatform_execution_date datetime64[ns]
dtype: object

When trying to load this into BigQuery using load_table_from_dataframe() and setting the job config's time_partitioning to bigquery.table.TimePartitioning(field="execution_date") we get the following error:

The field specified for time partitioning can only be of type TIMESTAMP or DATE. The type found is: INTEGER.

Which doesn't really make sense, since the field is clearly a datetime64.
The job config shown in the console looks correct (ie it's set to partitioned by day and it's using the correct field).

edit:
It seems the cause for this is that Dataframe columns of type datetime64 are being converted to type INTEGER instead of DATE (or TIMESTAMP? I'm not sure which one would be the correct type in BigQuery).

edit2:
Could it be this mapping is wrong and it should be DATE instead of DATETIME?

@plamut plamut added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Sep 11, 2019
@HemangChothani
Copy link
Contributor

@simonvanderveldt Would you mind providing more details, a reproducible code sample, if possible? and any other info would be really helpful with investigating the cause, thanks!

@simonvanderveldt
Copy link
Author

@HemangChothani Sure! I'll take some time tomorrow to create a reproducible case.

@plamut
Copy link
Contributor

plamut commented Sep 14, 2019

Thanks for the report @simonvanderveldt , I was able to reproduce the issue.

It can be fixed by providing an explicit schema:

job_config = bigquery.LoadJobConfig(
    time_partitioning=bigquery.table.TimePartitioning(field="execution_date"),
    schema=[
        bigquery.SchemaField(name="id", field_type="STRING"),
        bigquery.SchemaField(name="status", field_type="STRING"),
        bigquery.SchemaField(name="created_at", field_type="TIMESTAMP"),
        bigquery.SchemaField(name="execution_date", field_type="TIMESTAMP"),
    ]
)

FWIW, autodetecting schema for new tables has recently been scheduled for deprecation, since it turned out that autodetection is unreliable in too many cases. One can see the pending deprecation warning by switching on these warnings at the top of the script (they are disabled in Python < 3.7):

import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning) 
warnings.simplefilter("always", category=DeprecationWarning) 

This reveals the following in the output, just prior to the exception traceback:

/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'id'.
  warnings.warn(u"Unable to determine type of column '{}'.".format(column))
/home/peter/workspace/google-cloud-python/reproduce/reproduce_9206.py:52: PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.
  load_job = client.load_table_from_dataframe(df, table_ref, job_config=job_config)

@plamut
Copy link
Contributor

plamut commented Sep 23, 2019

@simonvanderveldt Just checking, did an explicit schema solve the issue for you? Or is it necessary to investigate this further? Thanks.

@simonvanderveldt
Copy link
Author

simonvanderveldt commented Sep 23, 2019

@plamut sorry, super busy with other stuff that's why I didn't get to this yet.

Using an explicit schema won't solve the issue for us since we're using load_table_from_dataframe() in a generic component that is used by our users. The users define the schema for the input data themselves so the component needs to be generic.

So we don't want to/can't provide a schema in the code. We're using parquet files as input which already contain all the required schema information, which after reading in the parquet file translates to the dataframe as well.
Since a dataframe already has a schema which kind of "autodetecting schema for new tables" do you mean exactly? The mapping from dataframe/pandas/numpy types to BigQuery types?

P.S. We've switched the problematic (datetime) columns to datetime64[ns, UTC] after looking at

which does get correctly mapped to TIMESTAMP instead of integer.

@tswast
Copy link
Contributor

tswast commented Sep 23, 2019

@simonvanderveldt Are there any object dtype columns (e.g. for strings or NULL values) in the DataFrame you're providing? If so, the mapping you linked to (

) does not get used. Instead, the pandas to_parquet method is used, which we have less control over (thus why we have deprecated that path).

@plamut I wonder if we can avoid to_parquet altogether if pyarrow is installed? This is the problematic line:


Rather than return None when any column's type can't be determined, maybe we explicitly set the SchemaField's type to None and fallback to pyarrow's type conversions when we convert from pandas Series to pyarrow Array?
return pyarrow.array(series, type=arrow_type)

@simonvanderveldt
Copy link
Author

@tswast Yeah, we have two columns that are of dtype object (a list of all columns and their dtypes is in the description at the top), seems like you're right and that might be happening.

@plamut
Copy link
Contributor

plamut commented Sep 25, 2019

@tswast I am experimenting with that, but there are quite a few moving parts.

I did notice that pyarrow.Table types are correct (in a limited test case), and it might work to use that to fill in the missing types (if we send null values, the backend complains).

Update:
Seems like this could work - pyarrow correctly detects strings and dates (did not test other types yet), and this info can be used to set a correct field_type on every SchemaField() instance.

The detail is probably in the detail, though, depending on how reliable pyarrow's detection is. But it seems that it would still bring an improvement to the current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants