Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Streaming insert return sucessful but not insert data after table update #7433

Closed
bgbraga opened this issue Feb 24, 2019 · 12 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.

Comments

@bgbraga
Copy link

bgbraga commented Feb 24, 2019

I know that is an issue insert data after deleting and recreating table:
#3822

The workaround should wait for 2 minutes as we can see the comments.

Here I had the same problem to insert data but it seems not related with time.

After update a table, I always had problem insert the data using insert_rows method.
So I made a test.

  1. Update Table:
table.schema = schema
bigquery_client.update_table(table, ['schema'])
  1. Try to insert data calling the insert_rows twice
bigquery_client.insert_rows(table, payload)
bigquery_client.insert_rows(table, payload)

It always write the payload once after table update.
So the first call never works. Only the second call.

I think it shows that this case is not related with time for buffer cache. The table issue seems solved after first insert_rows operation.

I can´t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.

@bgbraga bgbraga changed the title [BigQuery] Streaming insert return sucessuful but not insert data after table update [BigQuery] Streaming insert return sucessful but not insert data after table update Feb 24, 2019
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Feb 25, 2019
@tseaver tseaver added type: question Request for information or clarification. Not an issue. api: bigquery Issues related to the BigQuery API. and removed triage me I really want to be triaged. labels Feb 25, 2019
@tseaver tseaver changed the title [BigQuery] Streaming insert return sucessful but not insert data after table update BigQuery: Streaming insert return sucessful but not insert data after table update Feb 25, 2019
@tseaver
Copy link
Contributor

tseaver commented Feb 26, 2019

@tswast Does this ring any bells for you?

@tswast
Copy link
Contributor

tswast commented Feb 26, 2019

@bgbraga Does this issue still occur if you wait 2 minutes? I believe the buffer issue might affect table updates as well as recreating a table, especially if the table update adds a new column to the schema. If this is the case, I'd expect the error in the first call to insert_rows to fail due to an unexpected column.

/cc @shollyman who is working on additional streaming docs.

@tswast
Copy link
Contributor

tswast commented Feb 26, 2019

I can´t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.

If you specify a manual row ID, BigQuery will de-duplicate the writes.

@tswast tswast closed this as completed Mar 5, 2019
@bgbraga
Copy link
Author

bgbraga commented Mar 6, 2019

@tswast sorry, I have to test what occur if I wait for 2 minutes.
But if immediately after table update (add field) I have two insert_rows returning successful but only the second one is writing data, it break the theory of 2 minutes... or not?

Two minutes could be the time to process/update cache area.
But an insert data at this cache area also force this update.
So after first insert, the cache area is ready to receive data.

For me it is very strange. If there is a way to force the cache update, it should be done before the insert call to fix that problem without missing any data.

@tswast
Copy link
Contributor

tswast commented Mar 6, 2019

What does the response of the first call to bigquery_client.insert_rows(table, payload) look like? Note that you need to check for errors in the response manually because insert_rows is designed to support partial failures.

@tswast tswast reopened this Mar 6, 2019
@bgbraga
Copy link
Author

bgbraga commented Mar 11, 2019

@tswast you are right about first insert_rows call. That is the error:
[{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'phone', 'debugInfo': '', 'message': 'no such field.'}]}]
obs: phone is the field that I'm inserting.

But the second insert_rows call that is called in sequence, always works.

        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force insert in case of table update
            response = self.bigquery_client.insert_rows(table, payload)

Apparently the first call is forcing the cache area update or something like that.

So I made two extra tests:
1) add a sleep time (5 seconds) before first insert_rows call.
Purpose: check if the second insert_rows is working just because it was been called 1 second after the first one and that time is enough to update table.

        import time
        time.sleep(5)
        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force table refresh
            response = self.bigquery_client.insert_rows(table, payload)

Result: false. The firt command continue failing.

2) add a sleep time (2 minutes) before first insert_rows call.
Purpose: validate 2 minutes theory about cache area

        import time
        time.sleep(121)
        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force table refresh
            response = self.bigquery_client.insert_rows(table, payload)

Result: false. The firt command continue failing. Even if I wait 2 minutes to call first insert_rows.

Conclusion:
In my tests any insert_rows after table update is failing. After that failure everything works again.
Maybe that failure is forcing cache update. I don´t know.

Do you have any guess about that?

In my tests I'm calling Google remotely from my machine and my code is using these google API versions:
google==2.0.1
google-api-core==1.4.0
google-api-python-client==1.7.4
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.9.0
google-cloud-core==0.28.1
google-cloud-datastore==1.7.0
google-cloud-error-reporting==0.30.0
google-cloud-logging==1.7.0
google-cloud-storage==1.12.0

My files (data1.json is the original file, data2.json is the new file with phone field and schema.txt is my new schema that I used to update table)
data1.json.txt
data2.json.txt
schema.txt

@bansalrajat
Copy link

Are there any updates on this issue ?

@kubaceg
Copy link

kubaceg commented Oct 9, 2019

I have same problem using golang libray ;/

@kalanyuz
Copy link

kalanyuz commented Oct 31, 2019

I have this problem with python apache beam when trying to update table schema as well. Note that this doesn't happen with JavaSDK.

@tswast
Copy link
Contributor

tswast commented Nov 12, 2019

This is a known issue with the backend, due to the same problem as deleting and recreating a table. https://issuetracker.google.com/64329577#comment3

A workaround is to update the table by using a load job with the "ALLOW_FIELD_ADDITION" value set in schema_update_options rather than updating the schema and inserting.

# Configures the load job to append the data to the destination table,
# allowing field addition
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION
]

@lordnynex
Copy link

This is closed? How is this an acceptable issue? Data gets sprayed into the void.

@tswast
Copy link
Contributor

tswast commented Apr 22, 2020

@lordnynex This repo only holds issues for bugs in the client. Since this is a backend issue, it needs to go to the API team by going through support https://cloud.google.com/support/ or filing an issue here: https://issuetracker.google.com/issues/new?component=187149&template=1162659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

8 participants