[Bug] Fix potential negative preAllocatedSize variable #428

advancedxy · 2022-12-15T10:02:06Z

What changes were proposed in this pull request?

update preAllocatedSize atomically
add a new configuration to trigger ShuffleBufferManager's flushIfNecessary periodically
tweaks existing UTs

Why are the changes needed?

preAllocatedSize could be negative in prod env and this affects memory pressure calculation.
And this commit should fix #229 #426.

Does this PR introduce any user-facing change?

a new ShuffleServer configuration is introduced

How was this patch tested?

Existing UTs.

advancedxy · 2022-12-15T10:03:25Z

@zuston @xianjingfeng @jerqi please have a look

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

xianjingfeng · 2022-12-15T11:21:47Z

I think we can solve the problem through lock, like row lock.

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

xianjingfeng · 2022-12-16T03:26:10Z

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java

    ShuffleBuffer buffer = entry.getValue();
    long size = buffer.append(spd);
-    updateSize(size, isPreAllocated);
+    if (!isPreAllocated) {


isPreAllocated is always true.

Not exactly... Some UTs passes isPreAllocated as false.

We can modify the UTs.

We can modify the UTs.

There are quite places of changes needed to modified. Could you help create a new issue and let's addressed that in a new PR?

The isPreAllocated is always true since #159

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

advancedxy

Just update a possible solution.
If you have better ideas, it's appreciated. @xianjingfeng @zuston

common/src/test/java/org/apache/uniffle/common/GrpcServerTest.java

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

advancedxy · 2022-12-16T03:42:04Z

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java

    ShuffleBuffer buffer = entry.getValue();
    long size = buffer.append(spd);
-    updateSize(size, isPreAllocated);
+    if (!isPreAllocated) {


Not exactly... Some UTs passes isPreAllocated as false.

codecov-commenter · 2022-12-16T03:44:22Z

Codecov Report

Merging #428 (d479d4b) into master (5321292) will increase coverage by 0.03%.
The diff coverage is 55.55%.

@@             Coverage Diff              @@
##             master     #428      +/-   ##
============================================
+ Coverage     58.62%   58.65%   +0.03%     
- Complexity     1642     1647       +5     
============================================
  Files           199      199              
  Lines         11173    11201      +28     
  Branches        989      996       +7     
============================================
+ Hits           6550     6570      +20     
- Misses         4231     4237       +6     
- Partials        392      394       +2

Impacted Files	Coverage Δ
.../apache/uniffle/common/ShufflePartitionedData.java	`52.94% <0.00%> (-28.88%)`	⬇️
...pache/uniffle/server/ShuffleServerGrpcService.java	`0.80% <0.00%> (-0.01%)`	⬇️
.../uniffle/server/buffer/PreAllocatedBufferInfo.java	`100.00% <ø> (ø)`
.../org/apache/uniffle/server/ShuffleTaskManager.java	`75.88% <84.21%> (+1.22%)`	⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java	`99.25% <100.00%> (+0.01%)`	⬆️
...he/uniffle/server/buffer/ShuffleBufferManager.java	`82.28% <100.00%> (-0.47%)`	⬇️
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java	`90.90% <0.00%> (+1.51%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

This commit should fix apache#426 and apache#229

jerqi · 2022-12-22T02:04:39Z

server/src/test/resources/server.conf

 rss.jetty.corePool.size 64
 rss.server.heartbeat.timeout 1
 rss.server.write.timeout 2000
+rss.server.shuffleBufferManager.trigger.flush.interval 500


Add a new line.

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

xianjingfeng · 2022-12-22T03:17:26Z

This solution will reduce the preAllocatedSize release speed. Especially when we use LOCAL_ORDER. So i think we should do a performance test.

advancedxy · 2022-12-22T06:19:42Z

This solution will reduce the preAllocatedSize release speed. Especially when we use LOCAL_ORDER. So i think we should do a performance test.

You mean the preAllocatedSize doesn't decreased per each cacheShuffleData call?
For this new solution, I believe it's possible to do that and maintains correct preAllocatedSize. However I am not sure that would make a big difference here. Could you elaborate a bit more how LOCAL_ORDER would be influenced?

As for performance test, do you have any example job for references?

xianjingfeng · 2022-12-22T07:18:51Z

You mean the preAllocatedSize doesn't decreased per each cacheShuffleData call?

Yes

Could you elaborate a bit more how LOCAL_ORDER would be influenced?

It will reordering the blocks. #293

As for performance test, do you have any example job for references?

No special requirements. You can also refer to #293.

jerqi · 2022-12-22T07:25:58Z

I don't think this will be influenced by LOCAL_ORDER. LOCAL_ORDER don't sort when we cache data.LOCAL_ORDER sort when we flush data.

xianjingfeng · 2022-12-22T07:36:06Z

I don't think this will be influenced by LOCAL_ORDER. LOCAL_ORDER don't sort when we cache data.LOCAL_ORDER sort when we flush data.

If memory is not enough, it will trigger flush.

xianjingfeng · 2022-12-22T08:44:01Z

I'm afraid there will be a problem here. It's better to do a simple test.
https://github.com/apache/incubator-uniffle/blob/53212923328194fafa5ae94aa216ab04f06fbd13/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java#L192

advancedxy · 2022-12-22T09:10:22Z

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.

In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

xianjingfeng · 2022-12-22T09:20:15Z

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.

In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

In original logic, flush will be trigger when cache data, but in this pr, it will be trigger by another rpc request.

advancedxy · 2022-12-22T09:21:56Z

As for a test for local_order, @zuston do you have any suggestions?

advancedxy · 2022-12-22T09:27:55Z

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.
In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

In original logic, flush will be trigger when cache data, but in this pr, it will be trigger by another rpc request.

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

xianjingfeng · 2022-12-22T09:43:09Z

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact.
In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

advancedxy · 2022-12-22T10:40:37Z

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact. In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

From my understanding of the code of shuffleBufferManager, preAllocatedSize won't back-pressure shuffle write client, usedMemory does. However usedMemory doesn't decrease when preAllocatedSize decreased.
preAllocatedSize may affect the flush behavior, as it's decreased slightly slower. As mentioned in previous reply, it would cause flush less frequently. LocalOrder would be affected when caching data but flush is more frequently triggered.
If it in theory may cause some performance issues, a simple test may be sufficient as it requires complexes server environments to simulate.

In short, I'm not convinced that this change would make a big difference here and affects local_order. cc @zuston or @jerqi do you any other input on this?

But the behavior is indeed slightly different, I may make an amend to sync back to the original behavior and makes every one happy.

jerqi · 2022-12-22T10:53:09Z

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact. In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

What's the problem which you encounter?
LOCAL_ORDER is an option and it won't give too much performance impact. Because the blocks that we sort won't be very many and we only compare the taskId that isn't a big key.

xianjingfeng · 2022-12-22T11:11:01Z

2. As mentioned in previous reply, it would cause flush less frequently

Maybe this is the reason that why the client write speed will be affected. It maybe lead to usedMemory release slow. When I did #159, i have try to release memory after cache all data, but i found usedMemory release slow. Maybe i am wrong. I am not sure. This change is dangerous. I suggest you test it. Just suggestion.@jerqi @advancedxy

zuston · 2022-12-22T11:30:21Z

Sorry, I missed this thread. I need some time to surf the commits changelog timeline and currently, I could only give some of my thoughts.

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

LOCAL_ORDER is an option and it won't give too much performance impact. Because the blocks that we sort won't be very many and we only compare the taskId that isn't a big key.

+1

xianjingfeng · 2022-12-22T11:37:35Z

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

zuston · 2022-12-22T11:52:33Z

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

If this is right, maybe we'd better to have an abstraction to implement the throttle policy, instead of bounding to the shuffleBufferManager.

Emm... This expands the scope of this PR. Feel free to discuss more.

jerqi · 2022-12-22T12:07:27Z

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

If this is right, maybe we'd better to have an abstraction to implement the throttle policy, instead of bounding to the shuffleBufferManager.

Emm... This expands the scope of this PR. Feel free to discuss more.

It's hottest path ..... if we could avoid using lock, why do we use it?

zuston · 2022-12-22T12:22:20Z

It's hottest path ..... if we could avoid using lock, why do we use it?

Yes, we'd better not, but it's hard to avoid OOM if having too much requests come at the same time without lock.

advancedxy · 2022-12-22T13:01:34Z

Maybe this is the reason that why the client write speed will be affected. It maybe lead to usedMemory release slow.

@xianjingfeng ah, yeah. This is a valid concern. But we may solve flush frequency in other ways. But based on this, I will maintain the decrease per cacheShuffleData call.

but it's hard to avoid OOM if having too much requests come at the same time without lock.

The preAllocated mechanism should already back pressure client side? So there won't be too much requests?
And the coordinator should do better assignments.

jerqi · 2022-12-22T14:45:28Z

It's hottest path ..... if we could avoid using lock, why do we use it?

Yes, we'd better not, but it's hard to avoid OOM if having too much requests come at the same time without lock.

Do we have OOM?

advancedxy · 2022-12-23T01:46:21Z

Hi @xianjingfeng @jerqi @zuston I have updated the code, please take another look when you have time.

zuston · 2022-12-23T02:04:50Z

Do we have OOM?

Just supposing

jerqi · 2022-12-23T02:19:09Z

Do we have OOM?

Just supposing

Lock isn't the only mechinsm to avoid OOM.

zuston · 2022-12-23T07:26:34Z

After surfing the whole process of requiring/releasing pre-allocated size, I have a question that why the preAllocatedSize is added to two vars of ShuffleBufferManager.usedMemory and ShuffleBufferManager.preAllocatedSize in here (https://github.com/apache/incubator-uniffle/blob/53212923328194fafa5ae94aa216ab04f06fbd13/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java#L242-L255).

And I still can't find the key of this pr to solve the negative preAllocatedSize. Could you help describe more? @advancedxy

Lock isn't the only mechinsm to avoid OOM.

Emm.. Yes. But in current codebase, the requiring pre-allocation operation is guarded by the global lock, which is shown in above code snippet. @jerqi

jerqi · 2022-12-23T08:21:43Z

I just don't want to increase extra lock.

advancedxy · 2022-12-23T08:24:06Z

And I still can't find the key of this pr to solve the negative preAllocatedSize. Could you help describe more? @advancedxy

Sorry, let me add a bit of details here. Before this pr, there are two places to decrease preAllocatedSize.

sendShuffleData -> one of multiple cache shuffle data -> decrease preAllocatedSize, and also removes the require buffer id
preAllocatedBufferCheck in shuffleTaskManager, which checks preAllocated is expired or not, then removes it in another check.

Some cases could cause double release of preAllocatedSize:

anything went wrong in one of the cacheShuffleData calls(except the first one), then preAllocatedSize is partially decreased, then the whole size would be decreased in preAllocatedBufferCheck
preAllocatedBufferCheck checks one buffer expired, add it to removeIds. Client sends the sendShuffledata and that finishes first

common/src/main/java/org/apache/uniffle/common/ShufflePartitionedData.java

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

advancedxy · 2022-12-26T08:03:14Z

@xianjingfeng do you have other comments?

advancedxy · 2022-12-27T01:52:45Z

@jerqi @xianjingfeng gently ping.

advancedxy · 2022-12-27T06:46:04Z

I will wait this PR for another half day. If no other objection, I will merge this.

xianjingfeng reviewed Dec 15, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java Outdated Show resolved Hide resolved

advancedxy commented Dec 15, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated Show resolved Hide resolved

xianjingfeng reviewed Dec 16, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java Outdated Show resolved Hide resolved

xianjingfeng reviewed Dec 16, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated Show resolved Hide resolved

advancedxy commented Dec 16, 2022

View reviewed changes

[Bug] Fix potential negative preAllocatedSize variable

f4eef62

This commit should fix apache#426 and apache#229

advancedxy force-pushed the issue_426 branch from c756069 to f4eef62 Compare December 21, 2022 12:17

advancedxy requested a review from jerqi December 21, 2022 14:23

jerqi reviewed Dec 22, 2022

View reviewed changes

jerqi requested a review from xianjingfeng December 22, 2022 02:42

xianjingfeng reviewed Dec 22, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated Show resolved Hide resolved

xianjingfeng reviewed Dec 22, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated Show resolved Hide resolved

advancedxy requested a review from zuston December 22, 2022 09:28

zuston reviewed Dec 22, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated Show resolved Hide resolved

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Show resolved Hide resolved

Address comments

a464e63

xianjingfeng reviewed Dec 26, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/ShufflePartitionedData.java Outdated Show resolved Hide resolved

xianjingfeng reviewed Dec 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java Show resolved Hide resolved

xianjingfeng reviewed Dec 26, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java Outdated Show resolved Hide resolved

prefer long instead of int

d479d4b

jerqi approved these changes Dec 27, 2022

View reviewed changes

advancedxy merged commit 3aaac3e into apache:master Dec 27, 2022

jerqi mentioned this pull request Dec 30, 2022

[Bug] preAllocatedSize could be negative in corner cases #426

Closed

3 tasks

[Bug] Fix potential negative preAllocatedSize variable #428

[Bug] Fix potential negative preAllocatedSize variable #428

Uh oh!

Conversation

advancedxy commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

advancedxy commented Dec 15, 2022

Uh oh!

Uh oh!

xianjingfeng commented Dec 15, 2022

Uh oh!

Uh oh!

Uh oh!

xianjingfeng Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

advancedxy Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

xianjingfeng Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

advancedxy Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

advancedxy Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jerqi Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

advancedxy Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xianjingfeng commented Dec 22, 2022

Uh oh!

advancedxy commented Dec 22, 2022

Uh oh!

xianjingfeng commented Dec 22, 2022

Uh oh!

jerqi commented Dec 22, 2022

Uh oh!

xianjingfeng commented Dec 22, 2022

Uh oh!

xianjingfeng commented Dec 22, 2022

Uh oh!

advancedxy commented Dec 22, 2022

Uh oh!

xianjingfeng commented Dec 22, 2022

Uh oh!

advancedxy commented Dec 22, 2022

Uh oh!

advancedxy commented Dec 22, 2022

Uh oh!

xianjingfeng commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Dec 22, 2022

Uh oh!

jerqi commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

advancedxy commented Dec 15, 2022 •

edited

Loading

codecov-commenter commented Dec 16, 2022 •

edited

Loading

xianjingfeng commented Dec 22, 2022 •

edited

Loading

jerqi commented Dec 22, 2022 •

edited

Loading

advancedxy commented Dec 22, 2022 •

edited

Loading

jerqi commented Dec 23, 2022 •

edited

Loading