Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in test_requirements.txt #216

Open
jmholzer opened this issue May 19, 2023 · 11 comments

Comments

@jmholzer
Copy link
Contributor

jmholzer commented May 19, 2023

Description

Spark 3.4.0 was released in April. Our databricks and spark datasets should support this newer version of Spark, though it currently causes many tests to fail.

Also with this change, we should enforce Pandas >= 2, as earlier versions of Pandas are not compatible with Spark >= 3.4. This change will also enable us to upgrade delta-spark.

Context

This is an important change as it will ensure our datasets work with the latest version of Spark.

@jmholzer jmholzer changed the title [kedro-datasets] Upgrade PySpark to 3.4.0, Pandas >= 2 in test_requirements.txt [kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in test_requirements.txt May 19, 2023
@MatthiasRoels
Copy link
Contributor

MatthiasRoels commented Jun 12, 2023

I can't comment on spark, but be careful when forcing something like pandas >= 2.0 as users typically use other packages that might not be compatible with pandas 2.0 yet. For example great expectations (see explicit comment in the requirements file on their GitHub here). Furthermore, I would also set upper limits not to get into trouble later on.

@astrojuanlu

This comment was marked as off-topic.

@astrojuanlu

This comment was marked as off-topic.

@noklam
Copy link
Contributor

noklam commented Jun 12, 2023

I also have doubt to pin pandas >=2.0, I don't see the ecosystem will catch up that quickly and this shouldn't be done in at least the coming 12 months.

The test suite is a separate problem. It's an additional question how we should test our datasets. In any case, I would say we should tackle this in our test suite but not forcing it to our users.

For example, if an user is using pyspark==3.2.0, and pandas==1.5.3, it shouldn't be blocked by kedro-datasets[spark].

@noklam
Copy link
Contributor

noklam commented Jun 12, 2023

Okay I just read the title carefully, so this is only about test_requirements.txt, I misunderstood it is about the installation.

Do we have some idea what's failing when we have pandas>2.0? Potentially we will touch/fix it when we try to add Python3.11 support.

@astrojuanlu
Copy link
Member

Woops I also misread the title, thanks @noklam 👍

@MatthiasRoels
Copy link
Contributor

Haha I also misread the title 😅. Thanks @noklam for pointing it out!

@noklam
Copy link
Contributor

noklam commented Jun 12, 2023

This is some kind of collective hallucinations 😂

@MatthiasRoels
Copy link
Contributor

Maybe a good remark to add here. All current versions of Spark are not compatible with Pandas >= 2! If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

@astrojuanlu
Copy link
Member

If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

Do you have a link? I tried a quick search but Jira and I cannot be friends

@MatthiasRoels
Copy link
Contributor

MatthiasRoels commented Sep 27, 2023

@astrojuanlu: Sure, here is the link (note the affects version).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants