Skip to content

Conversation

@guan404ming
Copy link
Member

@guan404ming guan404ming commented Apr 16, 2025

Related Issue

Why

get_pandas and get_pandas_by_chunks is deprecated in #48875, which would be replaced by get_df. Thus, we need to migrate them in providers.

How

This PR is focus on migration of test migration only case as they are more simple and general. Thus I thought that put them together would be better.

  • Apache Drill
  • Apache Druid
  • Apache Impala
  • Apache Pinot
  • Elasticsearch
  • SQLite
  • Vertica

In this test, the original get_pandas_df would still be tested since it's implemented like

def get_pandas_df(
self,
sql,
parameters: list | tuple | Mapping[str, Any] | None = None,
**kwargs,
) -> DataFrame:
"""
Execute the sql and returns a pandas dataframe.
:param sql: the sql statement to be executed (str) or a list of sql statements to execute
:param parameters: The parameters to render the SQL query with.
:param kwargs: (optional) passed into pandas.io.sql.read_sql method
"""
return self._get_pandas_df(sql, parameters, **kwargs)

and get_df is like this

if df_type == "pandas":
return self._get_pandas_df(sql, parameters, **kwargs)
if df_type == "polars":
return self._get_polars_df(sql, parameters, **kwargs)

thus it still handled the backwards compatibility.

cc: @eladkal


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@guan404ming guan404ming force-pushed the miragte-test branch 5 times, most recently from 2449640 to 84979e9 Compare April 19, 2025 11:31
@guan404ming
Copy link
Member Author

guan404ming commented Apr 19, 2025

Hi @eladkal after my investigation, I found that the cursor execution logic is different for pandas and polars. Reference to the polars.read_database source code, there is a abstraction for cursor execution when calling read_database:

https://github.com/pola-rs/polars/blob/5c114f0f32a62bf796629e4ccf01b93146ae22c4/py-polars/polars/io/database/functions.py#L247-L256 and the executer https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/database/_executor.py

Thus we have also need to mock the execution abstraction for polars, while pandas just do cursor.fetchall() directly which meaning that the current implement for test is suitable for pandas. Also, we need to skip the polars test for low_deps and AF2 in ci, so I split them into two functions rather than using @pytest.mark.parametrize to handle the test here.

@guan404ming
Copy link
Member Author

guan404ming commented Apr 19, 2025

I currently removed the test for polars for Elasticsearchin this PR, since Elasticsearch provider has its own ElasticsearchSQLCursor which is not compatable with polars cursor execution. I think the modification which the Elasticsearch need is maybe quite different with the other ones in this PR thus I would try open another one to deal with Elasticsearch and I have added it in todo list in #49334

@guan404ming guan404ming requested a review from eladkal April 19, 2025 11:54
@guan404ming guan404ming force-pushed the miragte-test branch 4 times, most recently from 6cf4613 to 5d498dc Compare April 23, 2025 07:08
@guan404ming guan404ming reopened this Apr 23, 2025
@eladkal eladkal requested a review from potiuk April 23, 2025 13:34
@potiuk
Copy link
Member

potiuk commented Apr 23, 2025

I do not think we should base it on polars importability. We should not check for Airflow 3 here, but instead we should put minimum common.sql version (to the new version that will be released ) in all providers that are going to use the new feature. This is a "common.sql" not airflow feature, so theoretically when you install the new provider and they have common.sql >=NEW_VERSION the tests should also work for compatibility tests and polars should be installed - because we are going to install the new common.sql there.

Or am I missing something?.

@guan404ming guan404ming force-pushed the miragte-test branch 2 times, most recently from 8383a71 to 3012315 Compare April 28, 2025 07:49
@potiuk potiuk enabled auto-merge (squash) April 28, 2025 08:30
@potiuk
Copy link
Member

potiuk commented Apr 28, 2025

There is still one problem left - compat tests - because there polars is not present as extra in airflow. But what we can do, we can somewhat modify the scripts to add polars extra to common.sql - cheating a bit.

In this script : https://github.com/apache/airflow/blob/main/scripts/in_container/install_airflow_and_providers.py#L720 we are looping through providers available in /dist folder and install them. So if find apache_airflow_providers_common_sql .whl file there we could add [polars] to it.

@guan404ming
Copy link
Member Author

I agree with you.

Initially, I wanted to bump the Apache Airflow version to >= 3.1.0 in the providers. However, it seems like that is not the best approach because it will break compatibility. Considering this is not "really" incompatible but lack of package for test, I thought we can proceed with that.

auto-merge was automatically disabled April 28, 2025 09:24

Head branch was pushed to by a user without write access

@potiuk
Copy link
Member

potiuk commented Apr 28, 2025

I agree with you.

Initially, I wanted to bump the Apache Airflow version to >= 3.1.0 in the providers. However, it seems like that is not the best approach because it will break compatibility. Considering this is not "really" incompatible but lack of package for test, I thought we can proceed with that.

Nice.. Looks good. Let's see if CI agrees with this approach :)

@guan404ming
Copy link
Member Author

guan404ming commented Apr 28, 2025

It works!! 🎉
Force push for the conflict.

guan404ming and others added 3 commits April 28, 2025 18:12
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
@eladkal
Copy link
Contributor

eladkal commented Apr 28, 2025

It works!! 🎉 Force push for the conflict.

🎉

@potiuk potiuk enabled auto-merge (squash) April 28, 2025 10:36
@potiuk
Copy link
Member

potiuk commented Apr 28, 2025

It works!! 🎉 Force push for the conflict.

🎉

Nice!!!!

@potiuk potiuk merged commit c6ca8a2 into apache:main Apr 28, 2025
96 checks passed
@guan404ming
Copy link
Member Author

guan404ming commented Apr 28, 2025

Thanks for the review and suggestions!

@github-actions
Copy link

Backport failed to create: v3-0-test. View the failure log Run details

Status Branch Result
v3-0-test Commit Link

You can attempt to backport this manually by running:

cherry_picker c6ca8a2 v3-0-test

This should apply the commit to the v3-0-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

@guan404ming guan404ming deleted the miragte-test branch April 28, 2025 11:04
potiuk added a commit to potiuk/airflow that referenced this pull request Apr 28, 2025
…est (apache#49339)

* test: migrate get_pandas_df to get_df in provider test

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>

* chore: add polars in deps

* fix: install polars in ci

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
(cherry picked from commit c6ca8a2)
potiuk added a commit that referenced this pull request Apr 28, 2025
…est (#49339) (#49885)

* test: migrate get_pandas_df to get_df in provider test




* chore: add polars in deps

* fix: install polars in ci

---------




(cherry picked from commit c6ca8a2)

Co-authored-by: Guan Ming(Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
mvfc pushed a commit to mvfc/airflow that referenced this pull request Apr 29, 2025
…49339)

* test: migrate get_pandas_df to get_df in provider test

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>

* chore: add polars in deps

* fix: install polars in ci

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
mvfc pushed a commit to mvfc/airflow that referenced this pull request Apr 29, 2025
…49339)

* test: migrate get_pandas_df to get_df in provider test

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>

* chore: add polars in deps

* fix: install polars in ci

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
mvfc pushed a commit to mvfc/airflow that referenced this pull request Apr 29, 2025
…49339)

* test: migrate get_pandas_df to get_df in provider test

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>

* chore: add polars in deps

* fix: install polars in ci

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
jroachgolf84 pushed a commit to jroachgolf84/airflow that referenced this pull request Apr 30, 2025
…49339)

* test: migrate get_pandas_df to get_df in provider test

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>

* chore: add polars in deps

* fix: install polars in ci

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Jarek Potiuk <potiuk@apache.org>
Co-authored-by: Wesley Chiu <a-wesleychiu@microsoft.com>
@guan404ming guan404ming changed the title test: migrate get_pandas_df to get_df in provider test Migrate get_pandas_df to get_df in provider test Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants