Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination Bigquery: rewrite connector to use Bulk upload instead of current one #5296

Closed
etsybaev opened this issue Aug 10, 2021 · 2 comments · Fixed by #5614
Closed

Destination Bigquery: rewrite connector to use Bulk upload instead of current one #5296

etsybaev opened this issue Aug 10, 2021 · 2 comments · Fixed by #5614

Comments

@etsybaev
Copy link
Contributor

etsybaev commented Aug 10, 2021

Current Behavior

Currently, we create all connectors and init all clients at the very beginning of connector creation. But it seems that google sdk big query connector may fail (with 404 response code) if we create it and wait for a long time (ex. 12 hours) while the source connector is reading data to migrate.
Bulk loading (https://cloud.google.com/bigquery/docs/batch-loading-data) would allow us to stage the data on GCS entirely before loading it into bigquery. This is already implemented in Snowflake so we'd need something similar implemented for bigquery.

Expected Behavior

HelpFiles.zip
The connector should use Bulk upload type

Logs

See attached archive

ToDo as part of this ticket

  1. Update bigquery connector to use a Bulk loading (collect data to GCS and then onClose move it to bigQuery using Bulk loadong)
    Good to be also done:
  2. Use the "destination bigquery creds" from lastpass to get a secret for testing
  3. Run the performance test from attached archive to make sure that application doesn't return 404 anymore. (Create a new destination connector -> Wait for 12 + hours -> try write some message and stop container -> check that message appears on the cloud). For more details you may also check last comments from Destination Bigquery returns 404 when uploading data + resumable #3549

Notes:
Some scoping required:
We already have some destination-gcs - how this can be re-used?
Probably snowflake destination's implementation may be also useful to check before start working on this one

@etsybaev etsybaev added the type/bug Something isn't working label Aug 10, 2021
@etsybaev etsybaev changed the title Destination Bigquery: rewrite connector to use Bulk loading upload instead of current one Destination Bigquery: rewrite connector to use Bulk upload instead of current one Aug 10, 2021
@etsybaev
Copy link
Contributor Author

Hi @sherifnada . I've created this follow-up ticket (from #3549) as you proposed.
Could you please provide a little bit more information about your expectation and acceptance criteria.
Just to make sure I understood you correctly, it would use a GCS (like in snowflake) - then customer would also create some storage and provide us with creds for it, right?

I've also found this ticket (#4745) for snowflake and GCS that hasn't been solved for a while. So wouldn't we get a same issue here then?
Or maybe I just didn't get something. Many thanks in advance!

@sherifnada
Copy link
Contributor

@etsybaev this should work pretty much the same way as it does in Snowflake. The idea is to stage the data on GCS first, changing whatever is needed about the input to the connector to make that happen.

However, contrary to what the title of this ticket might imply, we should not rewrite the connector -- users should still be able to use the current INSERT mechanism (it's easier for PoCs) where possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants