Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50324][PYTHON][CONNECT] Make createDataFrame trigger Config RPC at most once #48856

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Nov 15, 2024

What changes were proposed in this pull request?

Get all configs in batch

Why are the changes needed?

there are too many related configs in createDataFrame, they are fetched one by one (or group by group) in different branches:
1, it is possible no Config RPC is triggered, e.g. in this branch:

elif isinstance(data, Sized) and len(data) == 0:
if _schema is not None:
return DataFrame(LocalRelation(table=None, schema=_schema.json()), self)
else:
raise PySparkValueError(
errorClass="CANNOT_INFER_EMPTY_SCHEMA",
messageParameters={},
)

2, multiple Config RPCs for different configs, e.g. in this branch:

prefer_timestamp_ntz = is_timestamp_ntz_preferred()
(timezone,) = self._client.get_configs("spark.sql.session.timeZone")

Does this PR introduce any user-facing change?

no

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

@xinrong-meng
Copy link
Member

LGTM thank you!

@zhengruifeng zhengruifeng changed the title [SPARK-50324][PYTHON][CONNECT] Make createDataFrame trigger Config RPC at most once [SPARK-50324][PYTHON][CONNECT] Make createDataFrame trigger Config RPC at most once Nov 15, 2024
@@ -706,9 +724,9 @@ def createDataFrame(
else:
local_relation = LocalRelation(_table)

cache_threshold = self._client.get_configs("spark.sql.session.localRelationCacheThreshold")
cache_threshold = conf_getter["spark.sql.session.localRelationCacheThreshold"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just get all the confs in batch eagerly? Seems like we should get the conf once anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 cases that we don't need the configs:
1, the local data is empty, and the schema is specified, it returns a valid empty df;
2, the creation fails due to some assertions

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another way of doing this is to maintain a sized dictionary in Python side, and cache the value retrieved within the Python side.

e.g.,

  • spark.get("a")

    • look up cacehd["a"] = v
      • if not, spark.get("a")
  • spark.set("a", aa")

    • empty cache cached["a"]
    • spark.set("a")

and create a dictioanry with TTL and max size

@HyukjinKwon
Copy link
Member

which will work all for Spark Classic and Spark Connect.

@zhengruifeng
Copy link
Contributor Author

I think another way of doing this is to maintain a sized dictionary in Python side, and cache the value retrieved within the Python side.

e.g.,

  • spark.get("a")

    • look up cacehd["a"] = v

      • if not, spark.get("a")
  • spark.set("a", aa")

    • empty cache cached["a"]
    • spark.set("a")

and create a dictioanry with TTL and max size

A problem is that spark.conf is not the only entry point for config operations.
Users can also set/unset config by spark.sql

@zhengruifeng
Copy link
Contributor Author

It seems we don't need this helper class to achieve the goal, will have another try

@zhengruifeng zhengruifeng marked this pull request as draft November 20, 2024 01:19
@zhengruifeng zhengruifeng marked this pull request as ready for review November 20, 2024 01:42
@zhengruifeng zhengruifeng force-pushed the lazy_config branch 2 times, most recently from 035a86d to c8af66d Compare November 22, 2024 03:01
@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the lazy_config branch November 22, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants