-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Feature add get polars df to dbapihook and bigqueryhook #34679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature add get polars df to dbapihook and bigqueryhook #34679
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
setup.py
Outdated
| "bcrypt>=2.0.0", | ||
| "flask-bcrypt>=0.7.1", | ||
| ] | ||
| polars = ["polars>=0.19.5"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is right(?)
Polars should be extra dependency of common.sql provider (it should be defined in provider.yaml)
I know we have pandas here but I think its due to legacy/backward compatible (pandas used to be part of Airflow core)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably correct, @eladkal ; I mostly just mirrored what I saw pandas doing.
As for common.sql, you mentioned about it in the issue as well, but I didn't understand what you meant there. Now that you say provider.yaml, I suppose I get it, you mean like here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite... by having it as dependency of the provider you force every user of the provider to install this lib this is not desired.
What we should do is have it as optional extra for the provider. Thus you need to define it in:
airflow/airflow/providers/google/provider.yaml
Line 1191 in 23e2c95
| additional-extras: |
then users can have it when they explicitly ask to install the extra :
pip install apache-airflow-providers-google[polars]
But I think this extra dependency needs to be set in the common.sql provider not in google one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm okay. I've removed polars from core dependencies (I think), and added it to airflow/airflow/providers/google/provider.yaml.
Sorry, I'm still not clear on which you file you mean by "the common.sql provider", what's the precise path to which you're referring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eladkal can you please advise here?
| def get_polars_df(self, sql, **kwargs): | ||
| """ | ||
| Executes the sql and returns a polars dataframe. | ||
| :param sql: the sql statement to be executed (str) or a list of | ||
| sql statements to execute | ||
| :param parameters: The parameters to render the SQL query with. | ||
| :param kwargs: (optional) passed into polars.read_database method | ||
| """ | ||
| try: | ||
| import polars as pl | ||
| except ImportError: | ||
| raise Exception( | ||
| "polars library not installed, run: pip install " | ||
| "'apache-airflow-providers-common-sql[polars]'." | ||
| ) | ||
|
|
||
| with closing(self.get_conn()) as conn: | ||
| return pl.read_database(sql, connection=conn, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how for polaris, but for pandas the same method is broken. Let me explain:
- Pandas expect to get SQLAlchemy connection rather than DBAPI with only one exception for SQLite
get_sqlalchemy_engineis also broken in some cases becauseget_urlis broken, see discussion in dev list: https://lists.apache.org/thread/8rhmz3qh30hvkondct4sfmgk4vd07mn5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is not a case for polaris I'm not familiar with this lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue it was suggested to have get_df (generic function) that accept as parameter if it should be pandas/polars - why did you choose not to have it eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggest that we should not merge something in single implementation especially if no one know how pandas and polaris interface compatible. I just worried that if we merge it now, than it could turned into the something like BackfillJobRunner._backfill_job_runner.py which literally have 170 different statements.
And in additional pandas is broken for at least half implementations: #34679 (comment), so my vision that we should fix it first before we could create some generic method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Taragolis the polars doc on read_database() states the following:
This function supports a wide range of native database drivers (ranging from local databases such as SQLite to large cloud databases such as Snowflake), as well as generic libraries such as ADBC, SQLAlchemy and various flavours of ODBC.
... and the connection object eventually gets passed in the constructor to the polars ConnectionExecutor here. Does this answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue it was suggested to have
get_df(generic function) that accept as parameter if it should be pandas/polars - why did you choose not to have it eventually?
@eladkal you are right (link to suggestion in issue), this was suggested.
I made this decision because the APIs of polars and pandas are quite syntactically different for the actual querying, so the get_df generic function would have just been a big switch statement, which seemed to me to add no value (this syntactic difference is larger on the BigQueryHook than on the DbApiHook, I do admit).
Furthermore, I'm not sure why we'd want to add get_df without also removing (or at least removing exposure of) the underlying methods get_polars_df and get_pandas_df; having get_df, get_polars_df, and get_pandas. Doing this removal, however, would be a breaking change for a lot of users' code.
It seemed to me like such a get_df redefinition may fall better under the scope of a refactoring ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to remove anything. We can deprecate and raise deprecation warning.
For such deprecation we probably will give very long time before actually removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems to me to be over-engineering, at least for now @eladkal.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Changes included in this PR:
get_polars_df()method toBigQueryHookandDbApiHook.polarsto required packages.Closes #33911.
Note to Reviewer: This is my first PR in Airflow, and my first time encountering
mock. Can you provide any guidance or resources on creating tests for these new functions? I looked to create an analogous test totest_get_pandas_df()inairflow/tests/providers/google/cloud/hooks/test_bigquery.py, but wasn't sure how to proceed. #newbiequestion^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.