Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery integration incorrect for list type columns #16938

Closed
2 tasks done
lmmx opened this issue Jun 13, 2024 · 4 comments · Fixed by #20292
Closed
2 tasks done

BigQuery integration incorrect for list type columns #16938

lmmx opened this issue Jun 13, 2024 · 4 comments · Fixed by #20292
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@lmmx
Copy link

lmmx commented Jun 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

If I write a list[int] type field (let's say linked_ids) from Polars to BQ with the example code

# Write DataFrame to stream as parquet file; does not hit disk
with io.BytesIO() as stream:
    df.write_parquet(stream)
    stream.seek(0)
    job = client.load_table_from_file(
        stream,
        destination='tablename',
        project='projectname',
        job_config=bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
        ),
    )
job.result()  # Waits for the job to complete

What you get is a record named "linked_ids" within which is a record named "list" within which is a repeated integer field named "item"!

Screenshot from 2024-06-13 15-49-11

I can of course create this manually: create a BigQuery table with an integer column and an array of integers column

CREATE TABLE `your-project.your_dataset.LouisExample` (
    id INT64,
    linked_ids ARRAY<INT64>
);

and write to it

INSERT INTO `your-project.your_dataset.LouisExample` (id, linked_ids)
VALUES
    (1, [101, 102, 103]),
    (2, [201, 202]),
    (3, [301, 302, 303, 304]);

Log output

No response

Issue description

I can see that this comes from pyarrow.parquet.write_table but this causes a major discontinuity between Polars and BigQuery.

It's well known that list field types are one of Polars' main USPs, so this is an equally major hurdle to encounter when uploading to a database source which is compatible with array-type fields.

Expected behavior

I'd like some way to upload list-type columns, and am unclear on what the best approach to get that here is.

To my understanding the point of streaming parquet is for compression and automatic schema application. Here the schema is essentially being lost in transit.

Perhaps the right approach is to go via JSON [without schema interference]? 🤔

I appreciate this is in a dependency but it's also being suggested as how to use this library so I think it falls in the remit of Polars development to consider how to make it work, so I hope it's not seen as out of place to raise it in this issue tracker. For me it's a major usability concern.

Installed versions

--------Version info---------
Polars:              0.19.16
Index type:          UInt32
Platform:            Linux-5.15.0-87-generic-x86_64-with-glibc2.31
Python:              3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          3.5.2
numpy:               1.24.3
openpyxl:            <not installed>
pandas:              1.5.1
pyarrow:             9.0.0
pydantic:            2.5.3
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          1.4.49
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@lmmx lmmx added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 13, 2024
@lmmx
Copy link
Author

lmmx commented Jun 13, 2024

I've confirmed that JSON works instead of Parquet for nested dtype fields:

  • Instead of write_parquet use write_ndjson
  • Instead of BytesIO use StringIO
  • Schema must be supplied if the table does not already exist, e.g.
        job = bq_client.load_table_from_file(
            stream,
            destination=dst_table,
            project=bq_project_id,
            job_config=bigquery.LoadJobConfig(
                schema=[
                    bigquery.SchemaField("id", "INT64"),
                    bigquery.SchemaField("linked_ids", "INT64", mode="REPEATED"),
                ],
                source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
            ),
        )

@alexander-beedie alexander-beedie added the A-io-parquet Area: reading/writing Parquet files label Jun 14, 2024
@deanm0000
Copy link
Collaborator

Are you aware of any library that can produce a parquet file that you can load into bigquery and have it load the way you expect?

If the answer is no then there's nothing for polars to do and it's something google needs to fix.

If yes, then please supply such an example.

@henryharbeck
Copy link
Contributor

henryharbeck commented Sep 2, 2024

I believe this bug is on the BigQuery side so I have raised it there - googleapis/python-bigquery#2008

I've reproduced the issue with parquet files written by Polars and PyArrow - so nothing seems to be astray on the Polars side.

@lmmx - I appreciate that is is frustrating to not have the method suggested in the user guide work, but in this case there is nothing Polars can do about it. Please feel free to thumbs up or comment on the BQ issue I raised.

@henryharbeck
Copy link
Contributor

@lmmx @deanm0000 @alexander-beedie - I believe this can now be closed if one of you could please do so. There is nothing for Polars to do here. Details in previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants