-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema conflict when storing dataframes with datetime objects using load_table_from_dataframe() #6542
Comments
@tatome I'm not reproducing this one: >>> import pandas as pd
>>> df = pd.DataFrame({'idx': [0, 1], 'test_column': ['2018-12-24', '2019-01-01']})
>>> df
idx test_column
0 0 2018-12-24
1 1 2019-01-01
>>> pd.to_datetime(df['test_column'])
0 2018-12-24
1 2019-01-01
Name: test_column, dtype: datetime64[ns]
>>> df['test_column'] = pd.to_datetime(df['test_column']).values.astype('datetime64[ms]')
>>> df
idx test_column
0 0 2018-12-24
1 1 2019-01-01
>>> df['test_column']
0 2018-12-24
1 2019-01-01
Name: test_column, dtype: datetime64[ns]
>>> df['test_column'][0]
Timestamp('2018-12-24 00:00:00')
>>> type(df['test_column'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
>>> df = df.set_index('idx')
>>> from google.cloud import bigquery
>>> client = bigquery.Client()
>>> ds = client.create_dataset('gcp_6542')
>>> tref = ds.table('test')
>>> job = client.load_table_from_dataframe(df, tref, location='US')
>>> print(job.result())
<google.cloud.bigquery.job.LoadJob object at 0x7f72a7adc6d8>
>>> job.done()
True Can you suggest something I missed? |
@tseaver I think that you missed setting the schema first. Because in this way (I guess) you are using the 'autodetect' schema from BigQuery. I'm facing the same problem when trying to load data from a dataframe on a previously determined schema.
Pre-defining schema:
Error when it runs:
Does it help? Thanks. |
@tiagobevilaqua You are converting the |
@tseaver So, it means that I cannot force to store as |
Not if you've already converted the value using the |
@tseaver Sorry if I'm bothering you... I just tried using as
I'm getting a similar error message:
Can you suggest something I missed? |
@tiagobevilaqua Not to worry. Summoning @tswast to see if he can help figure out the right path. |
I think this has to do with the conversion from pandas dataframe to parquet. Edit: Avro has a
I'll have to investigate further to see if Parquet has something equivalent. Then we'll have to convince pyarrow to use that type when encoding a dataframe with a naive datetime or maybe it has a way to supply logical types directly? |
I can't find an equivalent DATETIME in the Parquet logical types list at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. I've filed https://jira.apache.org/jira/browse/PARQUET-1545 In the meantime, we have a few options:
Thoughts? |
Good to know that! I'll try a workaround with the 1st option and wait for the 2nd one. Thanks a lot for your help. |
https://jira.apache.org/jira/browse/PARQUET-1545 wasn't the cause. Parquet does support a logical type for naive datetime and timezone-aware timestamp. Arrow also supports both naive and timezone-aware timestamps, but it seems by default it converts to naive types (https://issues.apache.org/jira/browse/ARROW-4965). I haven't seen where |
Closed by #8105 and #9064 which allow BigQuery types to be overrided by supplying a schema to google-cloud-python/bigquery/samples/load_table_dataframe.py Lines 18 to 77 in cab728b
|
Hi @tswast, I do not understand how supplying a schema to QueryJobConfig.schema can fix the problem. Can you provide an example specific to this issue (with a DATETIME field)? Thanks! |
Unfortunately, the backend API for loading Parquet files (the serialization format we use for dataframes) does not support DATETIME values, yet. |
Thank you for the clarification @tswast !
|
The output
is produced by the following code:
The text was updated successfully, but these errors were encountered: