-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BigQuery] Streaming insert drops records? #3344
Comments
On https://cloud.google.com/bigquery/troubleshooting-errors#streaming it says: "Certain operations in BigQuery do not interact with the streaming buffer, such as table copy jobs and API methods like tabledata.list". So I guess this includes |
Hi @martinstuder, Thanks for submitting this issue! The lack of streaming on certain BigQuery methods is a limitation of the API itself, so the problem would persist using the Web UI or HTTP REST requests. Indeed, an internal bug was just filed by someone using the Python library who ran into the same problem. The internal bug is live and has traction; it definitely seems to be a problem that the Insert request returns without error even though the operation did not complete server-side. |
@martinstuder Would you mind checking response status code in addition to checking for errors presence? The request might fail even before reaching streaming system, due to quota exceeded, for example. In this case 503 would be returned instead of 200. |
@andreamlin Thank you for your reply. |
@martinstuder Fair point :) I'll just keep this issue updated while they're working on it. |
Are there any updates on this issue? We have a relatively complex Dataflow pipeline where we are streaming data from Pubsub and writing to GCS, Azure, and BiqQuery. We are seeing a small number of dropped rows in BiqQuery but no missing data in the other destinations. All the writers use the same pubsub code and we are sure the issue is in the BiqQuery writer. The insertAll response does no contain any errors, but we can not say for certain if all the rows were inserted because there is no way to easily tell that without doing a separate query and there are timing/buffer issues related to that. |
I had similar experience, my service account had not enough permissions, but InsertAll run smoothly and swallowed any exceptions. But rest api returned errors like this:
|
@vitamon
|
Just checked on version "google-cloud-bigquery:1.66.0" -- yes, it throws the exception now. |
Anyone experiencing this issue now? I am using the 1.88.0 version for some testing and experienced a few times of record missing, every 10K API call may occur 1-2 times of record missing, the API did not return any error. |
I'm facing an issue with BigQuery streaming inserts (
Table.insert(...)
, specificallyinsert(Iterable<InsertAllRequest.RowToInsert> rows, boolean skipInvalidRows, boolean ignoreUnknownValues)
withskipInvalidRows = false
andignoreUnknownValues = false
) where (sometimes) records don't seem to be available after one or more insert requests. TheInsertAllRequest
s complete successfully, i.e. no exceptions are thrown and no errors are reported (InsertAllResponse.hasErrors
returnsfalse
). I checked availability of streamed data in the BigQuery Web UI and using theTable.list(...)
API. According to https://cloud.google.com/bigquery/streaming-data-into-bigquery I would expect streamed data to be available for query a few seconds after insertion. In cases where some records were missing after the initial check, I tried again after 10s, 30s, 60s, 1h, ... but to no avail. So it looks like the records have been dropped for some reason.The text was updated successfully, but these errors were encountered: