Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery Stream Insert Missing Data #15

Closed
zmm021 opened this issue Aug 27, 2019 · 10 comments
Closed

Bigquery Stream Insert Missing Data #15

zmm021 opened this issue Aug 27, 2019 · 10 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/java-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@zmm021
Copy link

zmm021 commented Aug 27, 2019

I have been running testing for inserting data into Google Bigquery, and I have experienced data missing in my testing. Below is the detail about this issue.
Testing scenario: Insert data into Bigquery using the latest (v1.88.0) streaming API.
Code:

public Map<Long, List<BigQueryError>> performWriteRequest(TableId tableId,
                                                          List<InsertAllRequest.RowToInsert> rows) {
    InsertAllRequest request = createInsertAllRequest(tableId, rows);
    InsertAllResponse writeResponse = bigQuery.insertAll(request);
    if (writeResponse.hasErrors()) {
        System.out.println("Error inserting into BQ"); 
        return writeResponse.getInsertErrors();
    } else {
        logger.debug("table insertion completed with no reported errors"); 
        return new HashMap<>();
    }
}

Testing information:

  • Each round the program calls Bigquery 10K - 20K times.
  • Each Bigquery.insertAll request only inserts 1 row (some rounds also insert 100 rows per request), without using rowid to do deduplication, so rows should not be filtered.
  • Each round takes about 5 minutes so it is not possible to hit any quota limitation (data is small).
  • Totally tested 30-40 rounds.
  • Some rounds use exactly the same data, others use different data.
  • I did not drop the table and recreate, always use the new table with a different name.
  • In the final round of the test, the program also pushed data to GCS and dump to a local file as well for comparison.

Observations:

  • In about 5-10 rounds I see the data missing. No pattern shows when and what data might be missing, kind like random.
  • No error returned from the Bigquery.insertall, from the client-side, all requests were successfully executed.
  • There is a retry policy in the program, but it is never triggered since no error returned.
  • Important: For new tables, I observed that the estimated number of rows in Bigquery stream buffer equals with the number of rows that I inserted, also equals with the number of rows in the file pushed to GCS(last round testing). But using select count(), I get fewer rows (about 1 missing in every 10-20K requests). After a while, the stream buffer info will be gone and the number of rows showed in the table info is the same with select count(), which is smaller than it is supposed to be.

I understand that the stream buffer only provides an estimated number, and we should not trust it, but every time after the data push, the number is exactly the same with the number of rows being pushed. Maybe this suggests that Bigquery received the data but finally dropped it for some unknown reason? This might be a bug.

This issue might related to #7433, #876, googleapis/google-cloud-java#3344 , and #3822.

@zmm021
Copy link
Author

zmm021 commented Sep 4, 2019 via email

@JD-V
Copy link

JD-V commented Sep 11, 2019

I am also facing this issue. Please keep us posted if any updates.

@janrockdev
Copy link

+1

@pmakani pmakani transferred this issue from googleapis/google-cloud-java Dec 10, 2019
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 10, 2019
@pmakani pmakani added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Dec 10, 2019
@stephaniewang526 stephaniewang526 self-assigned this Dec 12, 2019
@cumhuronat
Copy link

I'm facing the same bug on the nodejs library.

@stephaniewang526
Copy link
Contributor

Hi, thank you for your question.

We are not aware of any known issues for this case. There are known issues if the table was recently deleted and recreated.

Could you provide us with a code example that can allow us to reproduce the issue? We will investigate further based on that.

Thanks,
Stephanie

@stephaniewang526 stephaniewang526 pinned this issue Dec 27, 2019
@stephaniewang526 stephaniewang526 unpinned this issue Dec 27, 2019
@google-cloud-label-sync google-cloud-label-sync bot added the api: bigquery Issues related to the googleapis/java-bigquery API. label Jan 29, 2020
@stephaniewang526
Copy link
Contributor

Closing due to no activity.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 6, 2020
@anicoll
Copy link

anicoll commented Apr 30, 2020

Hi, thank you for your question.

We are not aware of any known issues for this case. There are known issues if the table was recently deleted and recreated.

Could you provide us with a code example that can allow us to reproduce the issue? We will investigate further based on that.

Thanks,
Stephanie

@stephaniewang526 Can you expand on the issue around newly deleted/created tables?
I am having this exact issue and is reproducible 9/10 times.

I can provide code examples if needed.

Golang using v1.6.0 OR v1.4.0
This never used to be an issue for us but has recently become a problem

@larssn
Copy link

larssn commented Apr 15, 2021

Also seeing this occasionally, where a row is lost without any apparent errors. It's rare but not irrelevant, and too periodic to reproduce.

@jnt0009
Copy link

jnt0009 commented Jul 20, 2021

I am running into this issue as well where occasionally the first row or two do not make it into the table.
Node JS: 14.16.1
@google-cloud/bigquery: 5.6.0
No errors are returned.

@stephaniewang526
Copy link
Contributor

@jnt0009 please open an issue in the bigquery NodeJs repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/java-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

10 participants