-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50324][PYTHON][CONNECT] Make createDataFrame
trigger Config
RPC at most once
#48856
Conversation
05f17b8
to
2da389a
Compare
8bac4ae
to
6e0dd30
Compare
LGTM thank you! |
createDataFrame
trigger Config RPC at most oncecreateDataFrame
trigger Config
RPC at most once
@@ -706,9 +724,9 @@ def createDataFrame( | |||
else: | |||
local_relation = LocalRelation(_table) | |||
|
|||
cache_threshold = self._client.get_configs("spark.sql.session.localRelationCacheThreshold") | |||
cache_threshold = conf_getter["spark.sql.session.localRelationCacheThreshold"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just get all the confs in batch eagerly? Seems like we should get the conf once anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 cases that we don't need the configs:
1, the local data is empty, and the schema is specified, it returns a valid empty df;
2, the creation fails due to some assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another way of doing this is to maintain a sized dictionary in Python side, and cache the value retrieved within the Python side.
e.g.,
-
spark.get("a")
- look up cacehd["a"] = v
- if not, spark.get("a")
- look up cacehd["a"] = v
-
spark.set("a", aa")
- empty cache cached["a"]
- spark.set("a")
and create a dictioanry with TTL and max size
which will work all for Spark Classic and Spark Connect. |
A problem is that |
6e0dd30
to
b84e0a3
Compare
It seems we don't need this helper class to achieve the goal, will have another try |
035a86d
to
c8af66d
Compare
e28cfd0
to
1b02bf4
Compare
thanks, merged to master |
What changes were proposed in this pull request?
Get all configs in batch
Why are the changes needed?
there are too many related configs in
createDataFrame
, they are fetched one by one (or group by group) in different branches:1, it is possible no Config RPC is triggered, e.g. in this branch:
spark/python/pyspark/sql/connect/session.py
Lines 502 to 509 in 2633035
2, multiple Config RPCs for different configs, e.g. in this branch:
spark/python/pyspark/sql/connect/session.py
Lines 599 to 601 in 2633035
Does this PR introduce any user-facing change?
no
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no