Skip to content

Conversation

@bogao007
Copy link
Contributor

What changes were proposed in this pull request?

Use UDS for JVM and Python worker communication. This is a followup item from a previous PR comment #47133 (comment)

Why are the changes needed?

UDS is recommended for same-host processes communication.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

valueSerializer: ExpressionEncoder.Serializer[Row])

object TransformWithStateInPandasStateServer {
@volatile private var id = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, TransformWithStateInPandasPythonRunner is instantiated per executor so they are in different JVMs, and TransformWithStateInPandasStateServer is initialized from each TransformWithStateInPandasPythonRunner once. What could be the scenario for handling multiple threads here?

Copy link
Contributor Author

@bogao007 bogao007 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no much difference comparing to TCP socket. For UDS, we create socket files (we use different ports for TCP) for each server thread connection, and the files are named with different serverId which got updated incrementally. At the end we do cleanup to delete these socket files.

Copy link
Contributor

@jingz-db jingz-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small question otherwise looks good. Thanks for making the change!

@HyukjinKwon
Copy link
Member

UDS is recommended for same-host processes communication

Should we better change all usage in Python worker communication instead of this alone?

<artifactId>htmlunit3-driver</artifactId>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

@LuciferYang LuciferYang Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should add this to dependency management first
  2. for newly added dependencies, the corresponding LICENSE should be maintained. am i right ?@dongjoon-hyun @yaooqinn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-ASF ones, we will include their licenses in our licenses-binary directory and LICENSE-binary file, and also update the NOTICE-binary document for any applicable notices.

asm-commons/9.2//asm-commons-9.2.jar
asm-tree/9.2//asm-tree-9.2.jar
asm-util/9.2//asm-util-9.2.jar
asm/9.2//asm-9.2.jar
Copy link
Contributor

@LuciferYang LuciferYang Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark master branch is currently using asm version 9.7.1, and I think we should unify to version 9.7.1 because this involves support for Java versions, asm 9.2 does not support Java 19+

@LuciferYang
Copy link
Contributor

LuciferYang commented Nov 19, 2024

Is this a Unix-only feature? For Windows users, will it crash or are there workarounds available?

@LuciferYang
Copy link
Contributor

LuciferYang commented Nov 19, 2024

UDS is recommended for same-host processes communication

Should we better change all usage in Python worker communication instead of this alone?

+1

And since I'm not a Python developer, please allow me to ask a beginner's question: Is it a consensus that UDS is recommended for same-host process communication? Are there any relevant test data to support this?

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the PR title more explicit? Add the FOLLOWUP tag and also mentioning this is for streaming Python worker only?

runner_conf = {}

state_server_port = None
state_server_id = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?

@bogao007
Copy link
Contributor Author

UDS is recommended for same-host processes communication

Should we better change all usage in Python worker communication instead of this alone?

+1

And since I'm not a Python developer, please allow me to ask a beginner's question: Is it a consensus that UDS is recommended for same-host process communication? Are there any relevant test data to support this?

@LuciferYang Thanks for bringing this up! Unfortunately I don't have any test data to back this statement. Though it might have a slightly better performance, it may also bring some downsides compared to TCP socket (e.g. Windows compatibility). I'll drop this PR for now given not enough evidence to do the UDS migration, thanks!

@bogao007 bogao007 closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants