Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BigQuery] Streaming insert drops records? #3344

Closed
martinstuder opened this issue Jun 5, 2018 · 10 comments
Closed

[BigQuery] Streaming insert drops records? #3344

martinstuder opened this issue Jun 5, 2018 · 10 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@martinstuder
Copy link

I'm facing an issue with BigQuery streaming inserts (Table.insert(...), specifically insert(Iterable<InsertAllRequest.RowToInsert> rows, boolean skipInvalidRows, boolean ignoreUnknownValues) with skipInvalidRows = false and ignoreUnknownValues = false) where (sometimes) records don't seem to be available after one or more insert requests. The InsertAllRequests complete successfully, i.e. no exceptions are thrown and no errors are reported (InsertAllResponse.hasErrors returns false). I checked availability of streamed data in the BigQuery Web UI and using the Table.list(...) API. According to https://cloud.google.com/bigquery/streaming-data-into-bigquery I would expect streamed data to be available for query a few seconds after insertion. In cases where some records were missing after the initial check, I tried again after 10s, 30s, 60s, 1h, ... but to no avail. So it looks like the records have been dropped for some reason.

@martinstuder
Copy link
Author

On https://cloud.google.com/bigquery/troubleshooting-errors#streaming it says: "Certain operations in BigQuery do not interact with the streaming buffer, such as table copy jobs and API methods like tabledata.list". So I guess this includes Table.list(...). Is this also true for the BigQuery Web UI? Is there any other means in the API of paging through a table that does consider the streaming buffer?

@andreamlin andreamlin added api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jun 5, 2018
@andreamlin
Copy link
Contributor

andreamlin commented Jun 5, 2018

Hi @martinstuder,

Thanks for submitting this issue!

The lack of streaming on certain BigQuery methods is a limitation of the API itself, so the problem would persist using the Web UI or HTTP REST requests. Indeed, an internal bug was just filed by someone using the Python library who ran into the same problem.

The internal bug is live and has traction; it definitely seems to be a problem that the Insert request returns without error even though the operation did not complete server-side.

@andreamlin andreamlin added type: question Request for information or clarification. Not an issue. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jun 5, 2018
@andreamlin andreamlin self-assigned this Jun 5, 2018
@andreamlin
Copy link
Contributor

@martinstuder Would you mind checking response status code in addition to checking for errors presence? The request might fail even before reaching streaming system, due to quota exceeded, for example. In this case 503 would be returned instead of 200.

@martinstuder
Copy link
Author

@andreamlin Thank you for your reply. InsertAllReponse does not expose a status code and as such I would expect any non-200 status to be turned into an appropriate BigQueryException when calling table.insert(Iterable<InsertAllRequest.RowToInsert> rows, boolean skipInvalidRows, boolean ignoreUnknownValues).

@andreamlin
Copy link
Contributor

@martinstuder Fair point :) I'll just keep this issue updated while they're working on it.

@yihanzhen yihanzhen added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. status: blocked Resolving the issue is dependent on other work. and removed type: question Request for information or clarification. Not an issue. labels Jun 27, 2018
@JustinBeckwith JustinBeckwith added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jun 27, 2018
@yihanzhen yihanzhen added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jun 27, 2018
@taylor-rolison
Copy link

Are there any updates on this issue?

We have a relatively complex Dataflow pipeline where we are streaming data from Pubsub and writing to GCS, Azure, and BiqQuery. We are seeing a small number of dropped rows in BiqQuery but no missing data in the other destinations. All the writers use the same pubsub code and we are sure the issue is in the BiqQuery writer. The insertAll response does no contain any errors, but we can not say for certain if all the rows were inserted because there is no way to easily tell that without doing a separate query and there are timing/buffer issues related to that.

@JustinBeckwith JustinBeckwith added the 🚨 This issue needs some love. label Dec 2, 2018
@vitamon
Copy link

vitamon commented Dec 13, 2018

I had similar experience, my service account had not enough permissions, but InsertAll run smoothly and swallowed any exceptions. But rest api returned errors like this:

error { domain: "cloud.helix.ErrorDomain" code: "ACCESS_DENIED" argument: "Table" argument: "mytable.usage_events" 
argument: "The user my-service-account@iam.gserviceaccount.com does not have bigquery.tables.updateData permission for table myproject:mytable.usage_events

@andreamlin andreamlin removed their assignment Feb 7, 2019
@JesseLovelace JesseLovelace removed their assignment Mar 14, 2019
@sduskis sduskis assigned pmakani and unassigned sduskis Apr 9, 2019
@pmakani
Copy link

pmakani commented Apr 26, 2019

I had similar experience, my service account had not enough permissions, but InsertAll run smoothly and swallowed any exceptions. But rest api returned errors like this:

error { domain: "cloud.helix.ErrorDomain" code: "ACCESS_DENIED" argument: "Table" argument: "mytable.usage_events" 
argument: "The user my-service-account@iam.gserviceaccount.com does not have bigquery.tables.updateData permission for table myproject:mytable.usage_events

@vitamon
I tried to reproduce this but its working fine for me. Below is full exception log. Can you please try with latest version ?

Exception in thread "main" com.google.cloud.bigquery.BigQueryException: Access Denied: Table bigquery-3344:3344_dataset.3344_table: The user bigquery-viewer@bigquery-3344.iam.gserviceaccount.com does not have bigquery.tables.updateData permission for table bigquery-3344:3344_dataset.3344_table.
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:100)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:309)
	at com.google.cloud.bigquery.BigQueryImpl.insertAll(BigQueryImpl.java:599)
	at com.google.cloud.bigquery.BigqueryStreamingInsert.insertData(BigqueryStreamingInsert.java:48)
	at com.google.cloud.bigquery.BigqueryStreamingInsert.main(BigqueryStreamingInsert.java:16)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Table bigquery-3344:3344_dataset.3344_table: The user bigquery-viewer@bigquery-3344.iam.gserviceaccount.com does not have bigquery.tables.updateData permission for table bigquery-3344:3344_dataset.3344_table.",
    "reason" : "accessDenied"
  } ],
  "message" : "Access Denied: Table bigquery-3344:3344_dataset.3344_table: The user bigquery-viewer@bigquery-3344.iam.gserviceaccount.com does not have bigquery.tables.updateData permission for table bigquery-3344:3344_dataset.3344_table.",
  "status" : "PERMISSION_DENIED"
}
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:401)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1132)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:307)
	... 3 more

@vitamon
Copy link

vitamon commented Apr 26, 2019

Just checked on version "google-cloud-bigquery:1.66.0" -- yes, it throws the exception now.

@sduskis sduskis closed this as completed Apr 30, 2019
@zmm021
Copy link

zmm021 commented Aug 26, 2019

Anyone experiencing this issue now? I am using the 1.88.0 version for some testing and experienced a few times of record missing, every 10K API call may occur 1-2 times of record missing, the API did not return any error.
PS:it happens when I insert the data in a batch way (100 rows per request), my recent 11 times 10000 requests (1 row per request) works fine, my testing data are the same one, and I am not using rowid. So maybe it is not the API side, but the server buffer related issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

10 participants