Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema #1981

Open
henryharbeck opened this issue Jul 20, 2024 · 8 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.

Comments

@henryharbeck
Copy link

Environment details

  • OS type and version: Ubuntu 20.04 (WSL)
  • Python version: 3.11.8
  • pip version: 24.1.2
  • google-cloud-bigquery version: 3.25.0

Steps to reproduce

  1. Create a table that has a single field with mode=REQUIRED
  2. Attempt to append to the table using client.load_table_from_file with a parquet file written from memory to a BytesIO buffer. The library writing to the buffer either does not have an option of nullable/required fields (Polars), or nullable=False is provided to the field (PyArrow).

Issue details

I am unable to use client.load_table_from_file to append to an existing table with a REQUIRED field, without providing the table schema in the LoadJobConfig. The docs say that the schema does not need to be supplied if the table already exits.

This issue is similar to googleapis/google-cloud-python#8093, but relates to load_table_from_file rather than load_table_from_dataframe. It is also somewhat related to googleapis/google-cloud-python#8142 (as explicitly suppling the BigQuery table schema fixes the issue), but again this relates to load_table_from_file rather than load_table_from_dataframe.

As an aside, the fix should definitely not require PyArrow. The current Polars code functions without PyArrow if the table BigQuery schema is provided.

I am filing this as a bug rather than a feature request as the docs for schema in JobConfigurationLoad say

The schema can be omitted if the destination table already exists

Which does not hold up in the below example.

Code example

Apologies, in advance that the example is a bit long.

It demonstrates Parquet files written to BytesIO buffers from both Polars and Pyarrow unable to be written to a BigQuery table with mode=REQUIRED.

Code example
from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq
from google.cloud import bigquery

import polars as pl

PROJECT = "<project>"


def create_and_return_table(table_name: str, client: bigquery.Client) -> bigquery.Table:
    schema = [bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED")]
    table = bigquery.Table(f"{PROJECT}.testing.{table_name}", schema=schema)

    client.delete_table(table, not_found_ok=True)
    return client.create_table(table)


def polars_way(table: bigquery.Table, client: bigquery.Client):
    df = pl.DataFrame({"foo": [1, 2, 3]})

    with BytesIO() as stream:
        df.write_parquet(stream)

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )

        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def pyarrow_way(table: bigquery.Table, client: bigquery.Client):
    # nullable=True is the default, but make it explicit
    # This issue does not occur if nullable=False, but the docs imply that shouldn't be
    # required
    pyarrow_schema = pa.schema([pa.field("foo", pa.int64(), nullable=True)])
    pyarrow_table = pa.Table.from_pydict({"foo": [1, 2, 3]}, schema=pyarrow_schema)

    with BytesIO() as stream:
        writer = pq.ParquetWriter(stream, pyarrow_schema)
        writer.write(pyarrow_table)
        writer.close()

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )
        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def main():
    client = bigquery.Client()
    table = create_and_return_table("test_pl", client)
    polars_way(table, client)
    table = create_and_return_table("test_pa", client)
    pyarrow_way(table, client)


if __name__ == "__main__":
    main()

Stack trace

Both the polars_way and the pyarrow_way raise with the error. Here they both are.

# polars_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 93, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 80, in main
    polars_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 41, in polars_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pl.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pl. Field foo has changed mode from REQUIRED to NULLABLE

# pyarrow_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 86, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 82, in main
    pyarrow_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 74, in pyarrow_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pa.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pa. Field foo has changed mode from REQUIRED to NULLABLE
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 20, 2024
@tswast
Copy link
Contributor

tswast commented Jul 29, 2024

For some additional context, load_table_from_dataframe does a call to get_table if no schema override is supplied by the user:

table = self.get_table(destination)
to solve a very similar problem (googleapis/google-cloud-python#8142, googleapis/google-cloud-python#8093).

@henryharbeck
Copy link
Author

Thanks for the pointer @tswast
Would you accept a PR implementing something similar for load_table_from_file?

@chalmerlowe
Copy link
Collaborator

Conceptually, I am amenable to accepting such a PR. Thoughts/Questions:

  • It does represent a hit on latency. There is a round trip cost to check the schema. For a large upload, this is likely negligible, but for many very small uploads, repeatedly checking the schema could become prohibitive (same as with the dataframe case).
  • Do you feel comfortable providing unit tests as well? (testing multiple use cases: local schema present/local schema not present... there may be more, but it is early in the morning. We can discuss what makes the most sense.)

@tswast How do you feel about this change? I am leaning toward this being a breaking change. Users will start having checks performed and having schemas set in ways that they are not expecting.

@tswast
Copy link
Contributor

tswast commented Jul 30, 2024

Potentially, but we need to be very careful, as we don't want to call get_table twice from the load from dataframe code path. Might be worth a refactor remove from load from dataframe and move to load from file.

A few things we should test for / watch out for:

  • Only do the fetch logic with the following true: file type is parquet, no schema on job config, no auto detect set, no nullable relaxation set
  • Test with a file with few columns than the original.
  • Test with a file with reordered columns from the original.
  • Test with a file with more columns than the original (and field addition turned on)

As far as extra latency goes, I agree that's a concern. I would much prefer this get addressed in the backend. At first glance, it seems feasible, as seen in this issue it appears that the initial schema nullability check is redundant with per row validation to check for NULL.

@tswast
Copy link
Contributor

tswast commented Jul 30, 2024

Regarding if it's a breaking change, that's a concern for me too. I would want to make sure we've validated every possible combination of local data and table schema if we were to accept such a change.

@tswast
Copy link
Contributor

tswast commented Jul 30, 2024

I would much prefer this get addressed in the backend.

I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well.

@henryharbeck
Copy link
Author

Thank you both the responses and apologies for taking a while to come back.

I would much prefer this get addressed in the backend.

I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well.

Yes, that would be my ideal preference as well. As stated in the issue description, the current behaviour when using load_table_from_file could be argued to be a bug given the docs for the schema parameter in JobConfigurationLoad say "The schema can be omitted if the destination table already exists". Yet, providing a schema explicitly is currently the only way to append data into a REQUIRED field using this method.

So, I would agree that an internal fix that does not require fetching the schema would be the ideal solution.

If that does not amount, then see below for response to other comments:

  • It does represent a hit on latency. There is a round trip cost to check the schema.
    • Yes, I definitely acknowledge this. Another option could possibly be that is is not default behaviour for load_table_from_file (which would then be different to load_table_from_dataframe)
  • Do you feel comfortable providing unit tests as well?
    • Yes, gladly
  • We need to be very careful, as we don't want to call get_table twice from the load from dataframe code path. Might be worth a refactor remove from load from dataframe and move to load from file.
    • I agree multiple calls to get_table should be avoided, so a refactor seems reasonable

I'll also mention that should #1979 be accepted and implemented into load_table_from_dataframe, that would remove my personal need for this issue. In that case, the current get_table path would also be applied to Polars DataFrames. But again, a fix in the backend avoiding the needing to fetch the schema at all would be great in both cases!

If any updates on the internal issue can be provided here as they arise, that would be greatly appreciated. With that in mind, I will also hold off on a PR for the time being.

@henryharbeck
Copy link
Author

henryharbeck commented Dec 18, 2024

I've filed internal issue 356379648 to investigate this option. If fixed in the backend, we should be able to remove the workaround from the pandas code path as well.

@tswast Has there been any movement on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.
Projects
None yet
Development

No branches or pull requests

3 participants