BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

dabasmoti · 2019-08-13T14:59:47Z

pyarrow-0.14.0
pandas '0.24.2'
windows 10
Hi,
I am tring to load dataframe to big query that looks like that

uid_first	agg_col
1001	[{'page_type': 1}, {'record_type': 1}, {'non_consectutive_home': 0}]

the agg_col is list of dicts
I also tried dict

Schema config:

schema = [
          bigquery.SchemaField("uid_first","STRING",mode="NULLABLE"),
          bigquery.SchemaField("agg_col","RECORD",mode="NULLABLE",fields=[
                  bigquery.SchemaField("page_type", "INTEGER", mode="NULLABLE"),
                  bigquery.SchemaField("record_type", "INTEGER", mode="NULLABLE"),
                  bigquery.SchemaField("non_consectutive_home", "INTEGER", mode="NULLABLE")])]

load command

dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('table')
table = bigquery.Table(table_ref,schema=schema)
table = client.load_table_from_dataframe(dff, table).result()

The error message

Traceback (most recent call last):

  File "<ipython-input-167-60a73e366976>", line 4, in <module>
    table = client.load_table_from_dataframe(dff, table).result()

  File "C:\ProgramData\Anaconda3\envs\p37\lib\site-packages\google\cloud\bigquery\client.py", line 1546, in load_table_from_dataframe
    os.remove(tmppath)

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpcxotr6mb_job_5c94b06f.parquet'

The text was updated successfully, but these errors were encountered:

tseaver · 2019-08-13T18:35:32Z

@dabasmoti Is there another exception being raised when that os.remove() statement (which occurs in a finally: clause) raises this exception? Can you show the full traceback?

dabasmoti · 2019-08-13T18:40:52Z

@tseaver - No,
The command, client.load_table_from_dataframe(), comes before it

peter765 · 2019-08-14T19:17:10Z

I'm getting the same issue on mac.

Traceback (most recent call last): File "loadjob.py", line 19, in <module> job = client.load_table_from_dataframe(dataframe, table_ref, location="US") File "/usr/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1567, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/_v/wj4pm45x4txg4vv02kptkl7c0000gn/T/tmpvsbi2rrx_job_1cb60ca1.parquet'

tswast · 2019-08-14T20:25:49Z

Do you have write permissions to those temp directories?

We originally started using tempfiles because fastparquet does not support in-memory file objects, but I wonder if there are systems in which tempfiles cannot be created?

tswast · 2019-08-14T20:26:34Z

Note: pyarrow-0.14.0 had some bad pip wheels, so this may be related to that.

dabasmoti · 2019-08-14T20:31:57Z

@tswast - what version should i use?

dabasmoti · 2019-08-14T20:33:34Z

I have to mention that the error occur only when use type dict in the dataframe column

dabasmoti · 2019-08-14T20:35:07Z

I am running as admin

tswast · 2019-08-14T20:39:45Z

what version should i use?

0.14.1 and 0.13.0 are good releases of pyarrow.

I have to mention that the error occur only when use type dict in the dataframe column

Thank you for mentioning this. STRUCT / RECORD columns are not yet supported by the pandas connector. https://github.com/googleapis/google-cloud-python/issues/8191 Neither are ARRAY / REPEATED columns, unfortunately. https://github.com/googleapis/google-cloud-python/issues/8544 Those issues are currently blocked on improvements to the Parquet file serialization logic.

@plamut Can you investigate this further? Hopefully pyarrow can provide an exception that we can catch when trying to write a table with unsupported data types to a parquet file. If no exception is thrown, perhaps we need to check for these and raise a ValueError?

plamut · 2019-08-15T16:59:49Z

TL; DR - pyarrow does not yet support serializing nested fields to parquet (there is an active PR for it, though), thus for the time being we can catch these exceptions and and propagate them to the users in an informative way. Or detecting nesting columns ourselves without relying on pyarrow's exceptions.

I was able to reproduce the reported behavior. Using the posted code and the following dataframe:

data = {
    "uid_first": "1001",
    "agg_col": [
        {"page_type": 1},
        {"record_type": 1},
        {"non_consectutive_home": 0},
    ]
}
df = pandas.DataFrame(data=data)

I got the following traceback in Python 3.6:

Traceback (most recent call last):
  File "/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/client.py", line 1552, in load_table_from_dataframe
    dataframe.to_parquet(tmppath, compression=parquet_compression)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2203, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/io/parquet.py", line 122, in write
    coerce_timestamps=coerce_timestamps, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1271, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 427, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../reproduce/reproduce_9024.py", line 41, in <module>
    load_job = client.load_table_from_dataframe(df, table)
  File "/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/client.py", line 1568, in load_table_from_dataframe
    os.remove(tmppath)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr7gxstqv_job_2f382186.parquet'

Trying the same with Python 2.7, I only got the second part of the traceback, i.e. the OSError about a missing file - seems like @dabasmoti is using Python 2.7.

That was with pandas==0.24.2 and pyarrow=0.14.1., and the root cause in both Python versions was an ArrowInvalid error: "Nested column branch had multiple children."

We could try catching this error in client.load_table_from_dataframe() and act upon it.

Edit:
FWIW, trying the same with pyarrow==1.13.0 produces a different error

Traceback (most recent call last):
    ...
    raise NotImplementedError(str(arrow_type))
NotImplementedError: struct<non_consectutive_home: int64, page_type: int64, record_type: int64>

More recent versions of pyarrow do not raise NotImplementedError anymore when determining the logical type of composite types, and instead return 'object' for them, hence the difference.

dabasmoti · 2019-08-15T17:11:42Z

@plamut - I am using python 3.7

plamut · 2019-08-15T17:14:26Z

@dabasmoti I see, let me try with Python 3.7, too, just in case ... although the outcome should probably be the same.

Update:
The same error occurs:

pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

... which is then followed by the FileNotFoundError when trying to remove the temp .parquet file that was never created.

tseaver added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Aug 13, 2019

tseaver changed the title ~~Push Pandas DataFrame to BigQuery with nested column~~ BigQuery: 'load_table_from_dataframe' raises OSError. Aug 13, 2019

tswast assigned plamut Aug 14, 2019

tswast changed the title ~~BigQuery: 'load_table_from_dataframe' raises OSError.~~ BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. Aug 14, 2019

tswast mentioned this issue Aug 16, 2019

BigQuery: Deprecate automatic schema conversion in load_table_from_dataframe #9042

Closed

4 tasks

plamut mentioned this issue Aug 18, 2019

BigQuery: Raise helpful error when loading table from dataframe with STRUCT columns #9053

Merged

plamut closed this as completed in #9053 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

dabasmoti commented Aug 13, 2019

tseaver commented Aug 13, 2019 •

edited

Loading

dabasmoti commented Aug 13, 2019 •

edited

Loading

peter765 commented Aug 14, 2019

tswast commented Aug 14, 2019

tswast commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

tswast commented Aug 14, 2019 •

edited

Loading

plamut commented Aug 15, 2019 •

edited

Loading

dabasmoti commented Aug 15, 2019

plamut commented Aug 15, 2019 •

edited

Loading

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

Comments

dabasmoti commented Aug 13, 2019

tseaver commented Aug 13, 2019 • edited Loading

dabasmoti commented Aug 13, 2019 • edited Loading

peter765 commented Aug 14, 2019

tswast commented Aug 14, 2019

tswast commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

dabasmoti commented Aug 14, 2019

tswast commented Aug 14, 2019 • edited Loading

plamut commented Aug 15, 2019 • edited Loading

dabasmoti commented Aug 15, 2019

plamut commented Aug 15, 2019 • edited Loading

tseaver commented Aug 13, 2019 •

edited

Loading

dabasmoti commented Aug 13, 2019 •

edited

Loading

tswast commented Aug 14, 2019 •

edited

Loading

plamut commented Aug 15, 2019 •

edited

Loading

plamut commented Aug 15, 2019 •

edited

Loading