[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt` #216

jmholzer · 2023-05-19T16:19:58Z

Description

Spark 3.4.0 was released in April. Our databricks and spark datasets should support this newer version of Spark, though it currently causes many tests to fail.

Also with this change, we should enforce Pandas >= 2, as earlier versions of Pandas are not compatible with Spark >= 3.4. This change will also enable us to upgrade delta-spark.

Context

This is an important change as it will ensure our datasets work with the latest version of Spark.

The text was updated successfully, but these errors were encountered:

MatthiasRoels · 2023-06-12T12:53:20Z

I can't comment on spark, but be careful when forcing something like pandas >= 2.0 as users typically use other packages that might not be compatible with pandas 2.0 yet. For example great expectations (see explicit comment in the requirements file on their GitHub here). Furthermore, I would also set upper limits not to get into trouble later on.

noklam · 2023-06-12T14:15:42Z

I also have doubt to pin pandas >=2.0, I don't see the ecosystem will catch up that quickly and this shouldn't be done in at least the coming 12 months.

The test suite is a separate problem. It's an additional question how we should test our datasets. In any case, I would say we should tackle this in our test suite but not forcing it to our users.

Check if improvements are needed for test_requirements.txt setup kedro#1498

For example, if an user is using pyspark==3.2.0, and pandas==1.5.3, it shouldn't be blocked by kedro-datasets[spark].

noklam · 2023-06-12T14:22:03Z

Okay I just read the title carefully, so this is only about test_requirements.txt, I misunderstood it is about the installation.

Do we have some idea what's failing when we have pandas>2.0? Potentially we will touch/fix it when we try to add Python3.11 support.

astrojuanlu · 2023-06-12T15:08:01Z

Woops I also misread the title, thanks @noklam 👍

MatthiasRoels · 2023-06-12T19:54:08Z

Haha I also misread the title 😅. Thanks @noklam for pointing it out!

noklam · 2023-06-12T20:11:34Z

This is some kind of collective hallucinations 😂

MatthiasRoels · 2023-09-27T08:15:31Z

Maybe a good remark to add here. All current versions of Spark are not compatible with Pandas >= 2! If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

astrojuanlu · 2023-09-27T09:27:21Z

If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

Do you have a link? I tried a quick search but Jira and I cannot be friends

MatthiasRoels · 2023-09-27T11:12:05Z

@astrojuanlu: Sure, here is the link (note the affects version).

jmholzer changed the title ~~[kedro-datasets] Upgrade PySpark to 3.4.0, Pandas >= 2 in test_requirements.txt~~ [kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in test_requirements.txt May 19, 2023

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt` #216

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt` #216

jmholzer commented May 19, 2023 •

edited

Loading

MatthiasRoels commented Jun 12, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

noklam commented Jun 12, 2023 •

edited

Loading

noklam commented Jun 12, 2023

astrojuanlu commented Jun 12, 2023

MatthiasRoels commented Jun 12, 2023

noklam commented Jun 12, 2023

MatthiasRoels commented Sep 27, 2023

astrojuanlu commented Sep 27, 2023

MatthiasRoels commented Sep 27, 2023 •

edited

Loading

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in test_requirements.txt #216

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in test_requirements.txt #216

Comments

jmholzer commented May 19, 2023 • edited Loading

Description

Context

MatthiasRoels commented Jun 12, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

noklam commented Jun 12, 2023 • edited Loading

noklam commented Jun 12, 2023

astrojuanlu commented Jun 12, 2023

MatthiasRoels commented Jun 12, 2023

noklam commented Jun 12, 2023

MatthiasRoels commented Sep 27, 2023

astrojuanlu commented Sep 27, 2023

MatthiasRoels commented Sep 27, 2023 • edited Loading

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt` #216

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt` #216

jmholzer commented May 19, 2023 •

edited

Loading

MatthiasRoels commented Jun 12, 2023 •

edited

Loading

noklam commented Jun 12, 2023 •

edited

Loading

MatthiasRoels commented Sep 27, 2023 •

edited

Loading