Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR is a followup of #38468 that proposes to remove notify-wait approach, and introduce a new way to collect partitions in parallel, and send them in order.

  • Previously, it actually waits until all results are stored all first, and then send them one by one in Protobuf message; (therefore, notify-wait isn't needed in fact).

    Both worse and best cases, we will always collect all partitions first and send them partition by partition.

  • Now, it sends Protobuf messages in an order whenever 0th partition is available (and send the next if available).

    Worse case, we will collect all partitions and send them one by one. Best case is to send partition by partition as it's collected.

Why are the changes needed?

For better performance, less memory usage, and better readability and maintinability (by removing synchronization)

Does this PR introduce any user-facing change?

No, this feature is not released yet, and this is performance only fix.

How was this patch tested?

CI in this PR should test it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#38468 (comment)

maybe we can create a threadpool? (shared across collect invocations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I just noticed the review comment. I believe this is matched with our current implementation in PySpark. If we should improve, let's improve both paths together. I would prefer to match them and deduplicate the logic first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for match the implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a thread pool if you have thread sitting around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do it need to be synchronized?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it doesn't (because it's guided by the index). This approach is actually from the initial ordered implementation of collect with Arrow (that were in production for very long time), 82c18c2#diff-459628811d7786c705fbb2b7a381ecd2b88f183f44ab607d43b3d33ea48d390fR3282-R3318.

@cloud-fan
Copy link
Contributor

Previously, it actually waits until all results are stored all first

Really? I think the best case is also sending partitions one by one.

Anyway, this PR looks good as it avoids the lock.

@HyukjinKwon
Copy link
Member Author

It collects all results first because of synced runJob that waits all results to arrive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I suggested to use locks and the main thread to write the results is exactly what this comment is trying to convey. You don't want these operations to happen inside the DAGScheduler thread. If you keep that blocked for something none scheduling related, you will stop all other scheduling. This is particularly bad in an environment where you might have multiple users running code at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen in higher concurrency scenarios that this does become a problem. Throughput will plateau because the DAGScheduler is doing the wrong things. I would like to avoid that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! We should write it down as code comments. @zhengruifeng can you help with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. let me add a comment for this

@HyukjinKwon
Copy link
Member Author

Actually let's just go with #38614 approach which is simpler. This approach can't easily dedup the codes anyway because of ordering anyway.

HyukjinKwon pushed a commit that referenced this pull request Nov 15, 2022
… batch in main thread

### What changes were proposed in this pull request?
Document the reason of sending batch in main thread

### Why are the changes needed?
as per #38613 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
no, doc-only

Closes #38654 from zhengruifeng/connect_doc_collect.

Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
… batch in main thread

### What changes were proposed in this pull request?
Document the reason of sending batch in main thread

### Why are the changes needed?
as per apache#38613 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
no, doc-only

Closes apache#38654 from zhengruifeng/connect_doc_collect.

Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-41005-followup branch January 15, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants