[AIRFLOW-4161] BigQuery to Mysql Operator #5711

adussarps · 2019-08-02T16:07:08Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-XXX
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:
New Connector to copy a bigquery table into a mysql table.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Tested base behaviour in bq test suite.

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

Passes flake8

potiuk · 2019-08-05T16:12:52Z

airflow/contrib/operators/bigquery_to_mysql_operator.py

+            # Max results is set to 1000 because bq job has an hardcoded limit to 1300.
+            response = cursor.get_tabledata(dataset_id=self.dataset_id,
+                                            table_id=self.table_id,
+                                            max_results=1000,


NIT: It would be nice to have it extracted as constant or even better provide it as parameters with default. I imagine you want to run it in smaller batches - for various reasons.

potiuk

Very nice - but small change requested. Having magic constants like that in the code is not a good thing. Especially if in the future BQ will increase the limit, we should be able to change the batch size without changing the code of the operator - just providing the parameter.

Adding 'batch_size' as an entry parameter

potiuk

Nice!

ryanyuan · 2019-08-05T07:14:59Z

airflow/contrib/operators/bigquery_to_mysql_operator.py

+                 table_id,
+                 mysql_table,
+                 selected_fields=None,
+                 bigquery_conn_id='bigquery_default',


Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id.
Reference: #4818

It is better to use gcp_conn_id as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201

ryanyuan · 2019-08-05T07:15:10Z

airflow/contrib/operators/bigquery_to_mysql_operator.py

+                 mysql_table,
+                 selected_fields=None,
+                 bigquery_conn_id='bigquery_default',
+                 mysql_conn_id='mysql_default',


Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id?
Reference: #4818

It is better to use gcp_conn_id as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201

Thanks for the input, this is fixed!

eladkal · 2019-08-06T09:55:13Z

What is expected to happen if one of the selected columns is RECORD type?

Fixed pylint too many arguments error Changed bq_conn_id to gcp_conn_id

adussarps · 2019-08-06T17:35:59Z

What is expected to happen if one of the selected columns is RECORD type?

I think the script should be able to pass it as a json type to mysql.
The mysql column receiving should be json type for it to work.

I didn't try this scenario thought.

OmerJog · 2019-08-06T19:10:05Z

if you added new file:
airflow/contrib/operators/bigquery_to_mysql_operator.py
I think the tests needs to be in:
/tests/contrib/operators/test_bigquery_to_mysql_operator.py
Not in:
tests/contrib/operators/test_bigquery_operator.py

codecov-io · 2019-08-06T20:24:46Z

Codecov Report

Merging #5711 into master will increase coverage by 0.25%.
The diff coverage is 72.91%.

@@            Coverage Diff             @@
##           master    #5711      +/-   ##
==========================================
+ Coverage   79.58%   79.84%   +0.25%     
==========================================
  Files         494      495       +1     
  Lines       31725    31781      +56     
==========================================
+ Hits        25248    25375     +127     
+ Misses       6477     6406      -71

Impacted Files	Coverage Δ
...ow/contrib/operators/bigquery_to_mysql_operator.py	`72.91% <72.91%> (ø)`
airflow/jobs/scheduler_job.py	`71.62% <0%> (-0.31%)`	⬇️
airflow/utils/dag_processing.py	`58.98% <0%> (+0.18%)`	⬆️
airflow/hooks/dbapi_hook.py	`88.59% <0%> (+0.87%)`	⬆️
airflow/models/connection.py	`65% <0%> (+1.11%)`	⬆️
airflow/hooks/hive_hooks.py	`77.6% <0%> (+1.78%)`	⬆️
airflow/operators/latest_only_operator.py	`92.85% <0%> (+2.85%)`	⬆️
airflow/utils/sqlalchemy.py	`79.06% <0%> (+4.65%)`	⬆️
airflow/operators/mysql_operator.py	`100% <0%> (+100%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19238f6...4ac5e1e. Read the comment docs.

potiuk · 2019-08-07T14:51:20Z

@OmerJog - it's OK to keep it as it is now. We are going to make a big move of GCP operators soon following AIP-21 which has just been accepted and we are going to reshuffle the operators anyway to follow single convention.

potiuk · 2019-08-07T14:53:02Z

Thanks @adussarps ! Really nice addition to the suite of GCP operators

ashb · 2019-08-14T13:27:59Z

@potiuk This commit adds a top-level __init__.py file which breaks some workflows, but the tests didn't break because of the whitelisting approach ;) https://issues.apache.org/jira/browse/AIRFLOW-5179

potiuk · 2019-08-14T14:00:48Z

Ough :( . It is an edge case but I wonder if we could something to prevent it. For sure mounting whole source directory /adding it to .docker ignore is a bad idea - performance especially on Mac will suffer and docker caching will be impacted.

We could however make a simple test to validate that there are no unexpected entries at the top level and fail the build if there are some unexpected files/directories. We can simply whitelist those that are expected and add the files to that whitelist if we intentionally add it. I think that might solve the problem permanently. WDYT.

ashb · 2019-08-14T14:06:02Z

Could do that. It's probably an exception that won't ever byte us again anyway.

(cherry picked from commit 3724c2a)

[AIRFLOW-4161] BigQuery to Mysql Operator

69e543c

mik-laj added the provider:google Google (including GCP) related issues label Aug 3, 2019

potiuk reviewed Aug 5, 2019

View reviewed changes

potiuk requested changes Aug 5, 2019

View reviewed changes

[AIRFLOW-4161] BigQuery to Mysql Operator

a4f9c3a

Adding 'batch_size' as an entry parameter

potiuk approved these changes Aug 5, 2019

View reviewed changes

ryanyuan reviewed Aug 6, 2019

View reviewed changes

[AIRFLOW-4161] BigQuery to Mysql Operator

4ac5e1e

Fixed pylint too many arguments error Changed bq_conn_id to gcp_conn_id

potiuk merged commit 3724c2a into apache:master Aug 7, 2019

ashb mentioned this pull request Aug 14, 2019

[AIRFLOW-5179] Remove top level __init__.py #5818

Merged

6 tasks

kaxil pushed a commit that referenced this pull request Dec 14, 2019

[AIRFLOW-4161] BigQuery to Mysql Operator (#5711)

da4522d

(cherry picked from commit 3724c2a)

kaxil pushed a commit that referenced this pull request Dec 14, 2019

[AIRFLOW-4161] BigQuery to Mysql Operator (#5711)

b003936

(cherry picked from commit 3724c2a)

kaxil pushed a commit that referenced this pull request Dec 17, 2019

[AIRFLOW-4161] BigQuery to Mysql Operator (#5711)

b9e76ed

(cherry picked from commit 3724c2a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

adussarps commented Aug 2, 2019 •

edited

Loading

potiuk Aug 5, 2019 •

edited

Loading

potiuk left a comment

potiuk left a comment

ryanyuan Aug 5, 2019

mik-laj Aug 6, 2019

ryanyuan Aug 5, 2019

mik-laj Aug 6, 2019

adussarps Aug 6, 2019

eladkal commented Aug 6, 2019

adussarps commented Aug 6, 2019

OmerJog commented Aug 6, 2019

codecov-io commented Aug 6, 2019

potiuk commented Aug 7, 2019

potiuk commented Aug 7, 2019

ashb commented Aug 14, 2019

potiuk commented Aug 14, 2019

ashb commented Aug 14, 2019

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

Conversation

adussarps commented Aug 2, 2019 • edited Loading

Jira

Description

Tests

Commits

Documentation

Code Quality

potiuk Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

potiuk left a comment

Choose a reason for hiding this comment

potiuk left a comment

Choose a reason for hiding this comment

ryanyuan Aug 5, 2019

Choose a reason for hiding this comment

mik-laj Aug 6, 2019

Choose a reason for hiding this comment

ryanyuan Aug 5, 2019

Choose a reason for hiding this comment

mik-laj Aug 6, 2019

Choose a reason for hiding this comment

adussarps Aug 6, 2019

Choose a reason for hiding this comment

eladkal commented Aug 6, 2019

adussarps commented Aug 6, 2019

OmerJog commented Aug 6, 2019

codecov-io commented Aug 6, 2019

Codecov Report

potiuk commented Aug 7, 2019

potiuk commented Aug 7, 2019

ashb commented Aug 14, 2019

potiuk commented Aug 14, 2019

ashb commented Aug 14, 2019

adussarps commented Aug 2, 2019 •

edited

Loading

potiuk Aug 5, 2019 •

edited

Loading