-
Notifications
You must be signed in to change notification settings - Fork 170
[Improvement][AQE] Avoid calling getShuffleResult multiple times #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jerqi can you help review this pr plz? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #190 +/- ##
============================================
+ Coverage 58.40% 58.99% +0.58%
- Complexity 1273 1325 +52
============================================
Files 158 160 +2
Lines 8446 8699 +253
Branches 784 815 +31
============================================
+ Hits 4933 5132 +199
- Misses 3260 3303 +43
- Partials 253 264 +11 ☔ View full report in Codecov by Sentry. |
|
Is it a compatible feature? If this is a breaking change, we should dicuss about it in the mail list first. But it seems to become a compatible feature if you adjust your implement. |
|
I think it's a compatible feature, could you give me more detail? @jerqi |
|
Or we should keep the original implementation getShuffleResult(partitionId), and then add a new implementation getShuffleResult(startPartition, endPartition)? |
|
A compatible feature means that old version server can use the new version server, the new version client can use the old version server. You change the protobuf field name, it will cause incompatibility in my thought. |
|
Thank you for your reminder, i will make it compatible. |
|
@jerqi can you help review this pr plz? |
|
org.apache.uniffle.common.security.HadoopSecurityContextTest test failed cause by Address already in use, It seems to be caused by running multiple tests at the same time |
|
@zuston Could you help me solve the problem of the flaky kerberos test? |
OK. Let me solve this tomorrow. |
|
I think we should test this with some SQL query in integration test. |
|
You're right , I will add some test in integration test. |
Already added integration test, Can you help review? Thank you @jerqi |
|
If the new version client access the old version server, what will happen? Could we fallback to use old way to access the old version server when there are exceptions? |
integration-test/spark3/src/test/java/org/apache/uniffle/test/MockRssShuffleManager.java
Outdated
Show resolved
Hide resolved
I think the new version of the client does not need to be compatible with the old version of the server. |
OK, I make a mistake. Maybe we just guarantee that old client can access the new server. |
| int32 shuffleId = 2; | ||
| repeated int32 partitions = 3; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a new message for response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind
common/src/main/java/org/apache/uniffle/common/util/RssUtils.java
Outdated
Show resolved
Hide resolved
integration-test/spark3/src/test/java/org/apache/uniffle/test/MockRssShuffleManager.java
Outdated
Show resolved
Hide resolved
| return hostName.replaceAll("[\\.-]", "_"); | ||
| } | ||
|
|
||
| public static Map<Integer, Roaring64NavigableMap> shuffleBitmapToPartitionBitmap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need startPartition and endPartition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the partition may not have a block, but we need to fill it to avoid NPE when used, See
for (int partitionId = startPartition; partitionId <= endPartition; partitionId++) {
result.putIfAbsent(partitionId, Roaring64NavigableMap.bitmapOf());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the partition may not have a block, but we need to fill it to avoid NPE when used, See
for (int partitionId = startPartition; partitionId <= endPartition; partitionId++) { result.putIfAbsent(partitionId, Roaring64NavigableMap.bitmapOf()); }
A little weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind
| public static Map<Integer, Roaring64NavigableMap> shuffleBitmapToPartitionBitmap( | ||
| Roaring64NavigableMap shuffleBitmap, int startPartition, int endPartition) { | ||
| Map<Integer, Roaring64NavigableMap> result = Maps.newHashMap(); | ||
| for (int partitionId = startPartition; partitionId <= endPartition; partitionId++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need <= instead of <?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think we need <= here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think we need
<=here.
Sorry, my mistake. We need < here
|
Could you provide some data about benchmark or performance test? |
Ok, I will do some performance tests to try to prove that the performance will improve in some extreme scenarios of AQE |
EnvironmentShuffle Server Num : 5 We measure the performance of get_shuffle_result by the following metrics:
Test ResultsBefore issue_136
After issue_136
SummarizeThe number of interface requests is reduced by 16%, and the total time is reduced by 12.5%. If we assign consecutive partitions to a server, the improvement will be more obvious. |
What's your test case? TPC-DS? |
|
No, this is our production task. |
|
Please resolve the comments left. |
…reversePartitionToServers
|
Can you help review please? @jerqi |
| return hostName.replaceAll("[\\.-]", "_"); | ||
| } | ||
|
|
||
| public static Map<Integer, Roaring64NavigableMap> shuffleBitmapToPartitionBitmap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't resolved.
...tion-test/spark3/src/test/java/org/apache/uniffle/test/GetShuffleReportForMultiPartTest.java
Outdated
Show resolved
Hide resolved
common/src/test/java/org/apache/uniffle/common/util/RssUtilsTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some comments that is not resolved.
|
Can you help review please? @jerqi |
jerqi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you raise a new issue that Coordinator support to assign consecutive partitions to a server? |
No problem, I'll follow up on this. |
###What changes were proposed in this pull request?
For issue #136 , When we use AQE, we may call shuffleWriteClient.getShuffleResult multiple times. But if both partition 1 and partition 2 are on the server A, we call getShuffleResult(partition 1) to get data form server A, and then we call getShuffleResult(partition 2) to get data form server A, it's not necassray. We can get getShuffleResult(partition 1, partition 2) instead.
###Why are the changes needed?
Improve getShuffleResult
###Does this PR introduce any user-facing change?
No
###How was this patch tested?
Added UT