Skip to content

Conversation

@danielvdende
Copy link
Contributor

@danielvdende danielvdende commented Feb 9, 2018

Hi all,
This PR is to add a Spark JDBC hook/operator pair. I've personally started using it to replace
Sqoop, and have found it much more stable and reliable. Thought it would be nice to make it
available to the community. Curious to hear your thoughts.

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Add the Spark JDBC hook/operator pair. This pair extends
    the existing SparkSubmitOperator. Also includes the
    spark_jdbc_script, which is the actual Spark job that is run.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason: Tests included for both the spark_jdbc_hook and the spark_jdbc_operator.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"
  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

@codecov-io
Copy link

codecov-io commented Feb 9, 2018

Codecov Report

Merging #3021 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3021      +/-   ##
==========================================
- Coverage   73.18%   73.17%   -0.01%     
==========================================
  Files         177      177              
  Lines       12412    12412              
==========================================
- Hits         9084     9083       -1     
- Misses       3328     3329       +1
Impacted Files Coverage Δ
airflow/models.py 87.03% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44551e2...13b5c56. Read the comment docs.

@Fokko
Copy link
Contributor

Fokko commented Feb 9, 2018

👏

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in the greater scheme of things, some small changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generalise the operator it a not make it specific to Postgresql

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we condense this a bit?

reader = spark.read
reader = set_common_options(reader, url, jdbc_table, user, password, driver)

to

reader = set_common_options(spark.read, url, jdbc_table, user, password, driver)

mutable variables make me cry in general, but I don't see a nicer way to do this in Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this. I guess this is legacy from sqoop. This part always made me cry since import and export does not really describe the direction of the data. Now we have the opportunity to fix this, my suggestion would be to change this to spark_to_sql and sql_to_spark to describe the direction of the flow 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed them now to spark_to_jdbc and jdbc_to_spark to avoid any confusion with spark-sql

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks! 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now these tests are sufficient for me. Maybe in the future we can try to actually perform a spark to postgres job, and assert the result :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a nice test indeed :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dont you add one @danielvdende ? Postgres is already available, if you launch a stand-alone spark job it should be possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also require the Spark binary, which is quite heavy, but I really like the idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll try to get it working. Just to check I've understood you correctly, you guys mean to use the postgres instance that is started as part of the Travis CI config? (in that case, we could even consider doing the same for MySQL)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try it, but dont over complicate. It’s more an integration test and we should only run those once and not for every setup of Travis. So it might be more for the future. @Fokko can we merge as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me merge it. I've checked it and it looks good to me.

I really like this operator. This PR introduces a new layer of operators. Instead of having low level operations that just do one thing, we leverage the existing SparkSubmitOperator to add a new layer of logic.

Add the Spark JDBC hook/operator pair. This pair extends
the existing SparkSubmitOperator. Also includes the
spark_jdbc_script, which is the actual Spark job that is run.
@danielvdende danielvdende force-pushed the AIRFLOW-2085-add-spark-jdbc-operator branch from b783dee to 13b5c56 Compare February 9, 2018 09:24
@asfgit asfgit closed this in d5b5ae0 Feb 10, 2018
aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019
Add the Spark JDBC hook/operator pair. This pair
extends
the existing SparkSubmitOperator. Also includes
the
spark_jdbc_script, which is the actual Spark job
that is run.

Closes apache#3021 from danielvdende/AIRFLOW-2085-add-
spark-jdbc-operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants