-
Notifications
You must be signed in to change notification settings - Fork 170
[Bug] Fix potential negative preAllocatedSize variable #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@zuston @xianjingfeng @jerqi please have a look |
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Outdated
Show resolved
Hide resolved
|
I think we can solve the problem through lock, like row lock. |
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Outdated
Show resolved
Hide resolved
| ShuffleBuffer buffer = entry.getValue(); | ||
| long size = buffer.append(spd); | ||
| updateSize(size, isPreAllocated); | ||
| if (!isPreAllocated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isPreAllocated is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly... Some UTs passes isPreAllocated as false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can modify the UTs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can modify the UTs.
There are quite places of changes needed to modified. Could you help create a new issue and let's addressed that in a new PR?
The isPreAllocated is always true since #159
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
advancedxy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just update a possible solution.
If you have better ideas, it's appreciated. @xianjingfeng @zuston
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
| ShuffleBuffer buffer = entry.getValue(); | ||
| long size = buffer.append(spd); | ||
| updateSize(size, isPreAllocated); | ||
| if (!isPreAllocated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly... Some UTs passes isPreAllocated as false.
Codecov Report
@@ Coverage Diff @@
## master #428 +/- ##
============================================
+ Coverage 58.62% 58.65% +0.03%
- Complexity 1642 1647 +5
============================================
Files 199 199
Lines 11173 11201 +28
Branches 989 996 +7
============================================
+ Hits 6550 6570 +20
- Misses 4231 4237 +6
- Partials 392 394 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
This commit should fix apache#426 and apache#229
c756069 to
f4eef62
Compare
| rss.jetty.corePool.size 64 | ||
| rss.server.heartbeat.timeout 1 | ||
| rss.server.write.timeout 2000 | ||
| rss.server.shuffleBufferManager.trigger.flush.interval 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
|
This solution will reduce the preAllocatedSize release speed. Especially when we use LOCAL_ORDER. So i think we should do a performance test. |
You mean the As for performance test, do you have any example job for references? |
Yes
It will reordering the blocks. #293
No special requirements. You can also refer to #293. |
|
I don't think this will be influenced by |
If memory is not enough, it will trigger flush. |
|
I'm afraid there will be a problem here. It's better to do a simple test. |
If I understand this code correctly, In that sense, it will have less change to trigger flush. Therefore |
In original logic, flush will be trigger when cache data, but in this pr, it will be trigger by another rpc request. |
|
As for a test for local_order, @zuston do you have any suggestions? |
Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect |
1.If |
In short, I'm not convinced that this change would make a big difference here and affects local_order. cc @zuston or @jerqi do you any other input on this? But the behavior is indeed slightly different, I may make an amend to sync back to the original behavior and makes every one happy. |
|
Maybe this is the reason that why the client write speed will be affected. It maybe lead to usedMemory release slow. When I did #159, i have try to release memory after cache all data, but i found usedMemory release slow. Maybe i am wrong. I am not sure. This change is dangerous. I suggest you test it. Just suggestion.@jerqi @advancedxy |
|
Sorry, I missed this thread. I need some time to surf the commits changelog timeline and currently, I could only give some of my thoughts. From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.
+1 |
+1 |
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java
Outdated
Show resolved
Hide resolved
If this is right, maybe we'd better to have an abstraction to implement the throttle policy, instead of bounding to the Emm... This expands the scope of this PR. Feel free to discuss more. |
It's hottest path ..... if we could avoid using lock, why do we use it? |
Yes, we'd better not, but it's hard to avoid OOM if having too much requests come at the same time without lock. |
@xianjingfeng ah, yeah. This is a valid concern. But we may solve flush frequency in other ways. But based on this, I will maintain the decrease per
The preAllocated mechanism should already back pressure client side? So there won't be too much requests? |
Do we have OOM? |
|
Hi @xianjingfeng @jerqi @zuston I have updated the code, please take another look when you have time. |
Just supposing |
Lock isn't the only mechinsm to avoid OOM. |
|
After surfing the whole process of requiring/releasing pre-allocated size, I have a question that why the preAllocatedSize is added to two vars of And I still can't find the key of this pr to solve the negative preAllocatedSize. Could you help describe more? @advancedxy
Emm.. Yes. But in current codebase, the requiring pre-allocation operation is guarded by the global lock, which is shown in above code snippet. @jerqi |
|
I just don't want to increase extra lock. |
Sorry, let me add a bit of details here. Before this pr, there are two places to decrease
Some cases could cause double release of
|
common/src/main/java/org/apache/uniffle/common/ShufflePartitionedData.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Outdated
Show resolved
Hide resolved
|
@xianjingfeng do you have other comments? |
|
@jerqi @xianjingfeng gently ping. |
|
I will wait this PR for another half day. If no other objection, I will merge this. |
What changes were proposed in this pull request?
preAllocatedSizeatomicallyWhy are the changes needed?
preAllocatedSizecould be negative in prod env and this affects memory pressure calculation.And this commit should fix #229 #426.
Does this PR introduce any user-facing change?
How was this patch tested?
Existing UTs.