-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4161] BigQuery to Mysql Operator #5711
Conversation
# Max results is set to 1000 because bq job has an hardcoded limit to 1300. | ||
response = cursor.get_tabledata(dataset_id=self.dataset_id, | ||
table_id=self.table_id, | ||
max_results=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: It would be nice to have it extracted as constant or even better provide it as parameters with default. I imagine you want to run it in smaller batches - for various reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice - but small change requested. Having magic constants like that in the code is not a good thing. Especially if in the future BQ will increase the limit, we should be able to change the batch size without changing the code of the operator - just providing the parameter.
Adding 'batch_size' as an entry parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
table_id, | ||
mysql_table, | ||
selected_fields=None, | ||
bigquery_conn_id='bigquery_default', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id.
Reference: #4818
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use gcp_conn_id
as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201
mysql_table, | ||
selected_fields=None, | ||
bigquery_conn_id='bigquery_default', | ||
mysql_conn_id='mysql_default', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id?
Reference: #4818
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use gcp_conn_id
as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input, this is fixed!
What is expected to happen if one of the selected columns is |
Fixed pylint too many arguments error Changed bq_conn_id to gcp_conn_id
I think the script should be able to pass it as a json type to mysql. I didn't try this scenario thought. |
if you added new file: |
Codecov Report
@@ Coverage Diff @@
## master #5711 +/- ##
==========================================
+ Coverage 79.58% 79.84% +0.25%
==========================================
Files 494 495 +1
Lines 31725 31781 +56
==========================================
+ Hits 25248 25375 +127
+ Misses 6477 6406 -71
Continue to review full report at Codecov.
|
Thanks @adussarps ! Really nice addition to the suite of GCP operators |
@potiuk This commit adds a top-level |
Ough :( . It is an edge case but I wonder if we could something to prevent it. For sure mounting whole source directory /adding it to .docker ignore is a bad idea - performance especially on Mac will suffer and docker caching will be impacted. We could however make a simple test to validate that there are no unexpected entries at the top level and fail the build if there are some unexpected files/directories. We can simply whitelist those that are expected and add the files to that whitelist if we intentionally add it. I think that might solve the problem permanently. WDYT. |
Could do that. It's probably an exception that won't ever byte us again anyway. |
(cherry picked from commit 3724c2a)
(cherry picked from commit 3724c2a)
(cherry picked from commit 3724c2a)
Make sure you have checked all steps below.
Jira
Description
New Connector to copy a bigquery table into a mysql table.
Tests
Tested base behaviour in bq test suite.
Commits
Documentation
Code Quality
flake8