[AIRFLOW-2085] Add SparkJdbc operator #3021

danielvdende · 2018-02-09T07:33:22Z

Hi all,
This PR is to add a Spark JDBC hook/operator pair. I've personally started using it to replace
Sqoop, and have found it much more stable and reliable. Thought it would be nice to make it
available to the community. Curious to hear your thoughts.

JIRA

My PR addresses the following AIRFLOW-2085 issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-2085

Description

Here are some details about my PR, including screenshots of any UI changes:
Add the Spark JDBC hook/operator pair. This pair extends
the existing SparkSubmitOperator. Also includes the
spark_jdbc_script, which is the actual Spark job that is run.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason: Tests included for both the spark_jdbc_hook and the spark_jdbc_operator.

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
Passes git diff upstream/master -u -- "*.py" | flake8 --diff

codecov-io · 2018-02-09T08:04:40Z

Codecov Report

Merging #3021 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3021      +/-   ##
==========================================
- Coverage   73.18%   73.17%   -0.01%     
==========================================
  Files         177      177              
  Lines       12412    12412              
==========================================
- Hits         9084     9083       -1     
- Misses       3328     3329       +1

Impacted Files	Coverage Δ
airflow/models.py	`87.03% <0%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44551e2...13b5c56. Read the comment docs.

Fokko · 2018-02-09T08:25:00Z

👏

Fokko

LGTM in the greater scheme of things, some small changes

Fokko · 2018-02-09T08:26:20Z

airflow/contrib/hooks/spark_jdbc_hook.py

This log should be already available:
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L2023

Fokko · 2018-02-09T08:32:55Z

airflow/contrib/hooks/spark_jdbc_hook.py

We should generalise the operator it a not make it specific to Postgresql

Fokko · 2018-02-09T08:34:53Z

airflow/contrib/hooks/spark_jdbc_script.py

Can we condense this a bit?

reader = spark.read reader = set_common_options(reader, url, jdbc_table, user, password, driver)

to

reader = set_common_options(spark.read, url, jdbc_table, user, password, driver)

mutable variables make me cry in general, but I don't see a nicer way to do this in Python.

Fokko · 2018-02-09T08:37:53Z

airflow/contrib/hooks/spark_jdbc_script.py

Maybe change this. I guess this is legacy from sqoop. This part always made me cry since import and export does not really describe the direction of the data. Now we have the opportunity to fix this, my suggestion would be to change this to spark_to_sql and sql_to_spark to describe the direction of the flow 👍

I've renamed them now to spark_to_jdbc and jdbc_to_spark to avoid any confusion with spark-sql

Perfect, thanks! 👍

Fokko · 2018-02-09T08:39:37Z

tests/contrib/hooks/test_spark_jdbc_hook.py

For now these tests are sufficient for me. Maybe in the future we can try to actually perform a spark to postgres job, and assert the result :)

That would be a nice test indeed :)

Why dont you add one @danielvdende ? Postgres is already available, if you launch a stand-alone spark job it should be possible?

This would also require the Spark binary, which is quite heavy, but I really like the idea

👍 I'll try to get it working. Just to check I've understood you correctly, you guys mean to use the postgres instance that is started as part of the Travis CI config? (in that case, we could even consider doing the same for MySQL)

Try it, but dont over complicate. It’s more an integration test and we should only run those once and not for every setup of Travis. So it might be more for the future. @Fokko can we merge as is?

Yes, let me merge it. I've checked it and it looks good to me.

I really like this operator. This PR introduces a new layer of operators. Instead of having low level operations that just do one thing, we leverage the existing SparkSubmitOperator to add a new layer of logic.

Add the Spark JDBC hook/operator pair. This pair extends the existing SparkSubmitOperator. Also includes the spark_jdbc_script, which is the actual Spark job that is run.

Add the Spark JDBC hook/operator pair. This pair extends the existing SparkSubmitOperator. Also includes the spark_jdbc_script, which is the actual Spark job that is run. Closes apache#3021 from danielvdende/AIRFLOW-2085-add- spark-jdbc-operator

Fokko requested changes Feb 9, 2018

View reviewed changes

[AIRFLOW-2085] Add SparkJdbc operator

13b5c56

Add the Spark JDBC hook/operator pair. This pair extends the existing SparkSubmitOperator. Also includes the spark_jdbc_script, which is the actual Spark job that is run.

danielvdende force-pushed the AIRFLOW-2085-add-spark-jdbc-operator branch from b783dee to 13b5c56 Compare February 9, 2018 09:24

asfgit closed this in d5b5ae0 Feb 10, 2018

Fokko mentioned this pull request Feb 15, 2018

[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource apache/spark#20057

Closed

[AIRFLOW-2085] Add SparkJdbc operator #3021

[AIRFLOW-2085] Add SparkJdbc operator #3021

Uh oh!

Conversation

danielvdende commented Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA

Description

Tests

Commits

Uh oh!

codecov-io commented Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Fokko commented Feb 9, 2018

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielvdende commented Feb 9, 2018 •

edited

Loading

codecov-io commented Feb 9, 2018 •

edited

Loading