Skip to content

Conversation

@bfeif
Copy link

@bfeif bfeif commented Sep 29, 2023

Changes included in this PR:

  • Added get_polars_df() method to BigQueryHook and DbApiHook.
  • Added polars to required packages.

Closes #33911.

Note to Reviewer: This is my first PR in Airflow, and my first time encountering mock. Can you provide any guidance or resources on creating tests for these new functions? I looked to create an analogous test to test_get_pandas_df() in airflow/tests/providers/google/cloud/hooks/test_bigquery.py, but wasn't sure how to proceed. #newbiequestion


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Sep 29, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

setup.py Outdated
"bcrypt>=2.0.0",
"flask-bcrypt>=0.7.1",
]
polars = ["polars>=0.19.5"]
Copy link
Contributor

@eladkal eladkal Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is right(?)
Polars should be extra dependency of common.sql provider (it should be defined in provider.yaml)

I know we have pandas here but I think its due to legacy/backward compatible (pandas used to be part of Airflow core)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably correct, @eladkal ; I mostly just mirrored what I saw pandas doing.

As for common.sql, you mentioned about it in the issue as well, but I didn't understand what you meant there. Now that you say provider.yaml, I suppose I get it, you mean like here?

Copy link
Contributor

@eladkal eladkal Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite... by having it as dependency of the provider you force every user of the provider to install this lib this is not desired.
What we should do is have it as optional extra for the provider. Thus you need to define it in:

then users can have it when they explicitly ask to install the extra :
pip install apache-airflow-providers-google[polars]

But I think this extra dependency needs to be set in the common.sql provider not in google one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm okay. I've removed polars from core dependencies (I think), and added it to airflow/airflow/providers/google/provider.yaml.

Sorry, I'm still not clear on which you file you mean by "the common.sql provider", what's the precise path to which you're referring?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eladkal can you please advise here?

Comment on lines +221 to +239
def get_polars_df(self, sql, **kwargs):
"""
Executes the sql and returns a polars dataframe.
:param sql: the sql statement to be executed (str) or a list of
sql statements to execute
:param parameters: The parameters to render the SQL query with.
:param kwargs: (optional) passed into polars.read_database method
"""
try:
import polars as pl
except ImportError:
raise Exception(
"polars library not installed, run: pip install "
"'apache-airflow-providers-common-sql[polars]'."
)

with closing(self.get_conn()) as conn:
return pl.read_database(sql, connection=conn, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how for polaris, but for pandas the same method is broken. Let me explain:

  1. Pandas expect to get SQLAlchemy connection rather than DBAPI with only one exception for SQLite
  2. get_sqlalchemy_engine is also broken in some cases because get_url is broken, see discussion in dev list: https://lists.apache.org/thread/8rhmz3qh30hvkondct4sfmgk4vd07mn5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is not a case for polaris I'm not familiar with this lib

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue it was suggested to have get_df (generic function) that accept as parameter if it should be pandas/polars - why did you choose not to have it eventually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggest that we should not merge something in single implementation especially if no one know how pandas and polaris interface compatible. I just worried that if we merge it now, than it could turned into the something like BackfillJobRunner._backfill_job_runner.py which literally have 170 different statements.

And in additional pandas is broken for at least half implementations: #34679 (comment), so my vision that we should fix it first before we could create some generic method

Copy link
Author

@bfeif bfeif Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Taragolis the polars doc on read_database() states the following:

This function supports a wide range of native database drivers (ranging from local databases such as SQLite to large cloud databases such as Snowflake), as well as generic libraries such as ADBC, SQLAlchemy and various flavours of ODBC.

... and the connection object eventually gets passed in the constructor to the polars ConnectionExecutor here. Does this answer your question?

Copy link
Author

@bfeif bfeif Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue it was suggested to have get_df (generic function) that accept as parameter if it should be pandas/polars - why did you choose not to have it eventually?

@eladkal you are right (link to suggestion in issue), this was suggested.

I made this decision because the APIs of polars and pandas are quite syntactically different for the actual querying, so the get_df generic function would have just been a big switch statement, which seemed to me to add no value (this syntactic difference is larger on the BigQueryHook than on the DbApiHook, I do admit).

Furthermore, I'm not sure why we'd want to add get_df without also removing (or at least removing exposure of) the underlying methods get_polars_df and get_pandas_df; having get_df, get_polars_df, and get_pandas. Doing this removal, however, would be a breaking change for a lot of users' code.

It seemed to me like such a get_df redefinition may fall better under the scope of a refactoring ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to remove anything. We can deprecate and raise deprecation warning.
For such deprecation we probably will give very long time before actually removing it

Copy link
Author

@bfeif bfeif Oct 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems to me to be over-engineering, at least for now @eladkal.

@github-actions
Copy link

github-actions bot commented Dec 7, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 7, 2023
@github-actions github-actions bot closed this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:providers kind:documentation provider:common-sql provider:google Google (including GCP) related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants