Skip to content

Conversation

@advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Dec 15, 2022

What changes were proposed in this pull request?

  • update preAllocatedSize atomically
  • add a new configuration to trigger ShuffleBufferManager's flushIfNecessary periodically
  • tweaks existing UTs

Why are the changes needed?

preAllocatedSize could be negative in prod env and this affects memory pressure calculation.
And this commit should fix #229 #426.

Does this PR introduce any user-facing change?

  • a new ShuffleServer configuration is introduced

How was this patch tested?

Existing UTs.

@advancedxy
Copy link
Contributor Author

@zuston @xianjingfeng @jerqi please have a look

@xianjingfeng
Copy link
Member

I think we can solve the problem through lock, like row lock.

ShuffleBuffer buffer = entry.getValue();
long size = buffer.append(spd);
updateSize(size, isPreAllocated);
if (!isPreAllocated) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isPreAllocated is always true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly... Some UTs passes isPreAllocated as false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the UTs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the UTs.

There are quite places of changes needed to modified. Could you help create a new issue and let's addressed that in a new PR?

The isPreAllocated is always true since #159

Copy link
Contributor Author

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just update a possible solution.
If you have better ideas, it's appreciated. @xianjingfeng @zuston

ShuffleBuffer buffer = entry.getValue();
long size = buffer.append(spd);
updateSize(size, isPreAllocated);
if (!isPreAllocated) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly... Some UTs passes isPreAllocated as false.

@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2022

Codecov Report

Merging #428 (d479d4b) into master (5321292) will increase coverage by 0.03%.
The diff coverage is 55.55%.

@@             Coverage Diff              @@
##             master     #428      +/-   ##
============================================
+ Coverage     58.62%   58.65%   +0.03%     
- Complexity     1642     1647       +5     
============================================
  Files           199      199              
  Lines         11173    11201      +28     
  Branches        989      996       +7     
============================================
+ Hits           6550     6570      +20     
- Misses         4231     4237       +6     
- Partials        392      394       +2     
Impacted Files Coverage Δ
.../apache/uniffle/common/ShufflePartitionedData.java 52.94% <0.00%> (-28.88%) ⬇️
...pache/uniffle/server/ShuffleServerGrpcService.java 0.80% <0.00%> (-0.01%) ⬇️
.../uniffle/server/buffer/PreAllocatedBufferInfo.java 100.00% <ø> (ø)
.../org/apache/uniffle/server/ShuffleTaskManager.java 75.88% <84.21%> (+1.22%) ⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.25% <100.00%> (+0.01%) ⬆️
...he/uniffle/server/buffer/ShuffleBufferManager.java 82.28% <100.00%> (-0.47%) ⬇️
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java 90.90% <0.00%> (+1.51%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rss.jetty.corePool.size 64
rss.server.heartbeat.timeout 1
rss.server.write.timeout 2000
rss.server.shuffleBufferManager.trigger.flush.interval 500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@jerqi jerqi requested a review from xianjingfeng December 22, 2022 02:42
@xianjingfeng
Copy link
Member

This solution will reduce the preAllocatedSize release speed. Especially when we use LOCAL_ORDER. So i think we should do a performance test.

@advancedxy
Copy link
Contributor Author

This solution will reduce the preAllocatedSize release speed. Especially when we use LOCAL_ORDER. So i think we should do a performance test.

You mean the preAllocatedSize doesn't decreased per each cacheShuffleData call?
For this new solution, I believe it's possible to do that and maintains correct preAllocatedSize. However I am not sure that would make a big difference here. Could you elaborate a bit more how LOCAL_ORDER would be influenced?

As for performance test, do you have any example job for references?

@xianjingfeng
Copy link
Member

You mean the preAllocatedSize doesn't decreased per each cacheShuffleData call?

Yes

Could you elaborate a bit more how LOCAL_ORDER would be influenced?

It will reordering the blocks. #293

As for performance test, do you have any example job for references?

No special requirements. You can also refer to #293.

@jerqi
Copy link
Contributor

jerqi commented Dec 22, 2022

I don't think this will be influenced by LOCAL_ORDER. LOCAL_ORDER don't sort when we cache data.LOCAL_ORDER sort when we flush data.

@xianjingfeng
Copy link
Member

I don't think this will be influenced by LOCAL_ORDER. LOCAL_ORDER don't sort when we cache data.LOCAL_ORDER sort when we flush data.

If memory is not enough, it will trigger flush.

@xianjingfeng
Copy link
Member

@advancedxy
Copy link
Contributor Author

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.

In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

@xianjingfeng
Copy link
Member

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.

In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

In original logic, flush will be trigger when cache data, but in this pr, it will be trigger by another rpc request.

@advancedxy
Copy link
Contributor Author

As for a test for local_order, @zuston do you have any suggestions?

@advancedxy
Copy link
Contributor Author

if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {

If I understand this code correctly, usedMemory is not updated for preAllocated buffer. inFlushSize is not related to this PR. preAllocatedSize is still decreased but slightly slowly.
In that sense, it will have less change to trigger flush. Therefore LOCAL_ORDER is not influenced?

In original logic, flush will be trigger when cache data, but in this pr, it will be trigger by another rpc request.

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

@advancedxy advancedxy requested a review from zuston December 22, 2022 09:28
@xianjingfeng
Copy link
Member

xianjingfeng commented Dec 22, 2022

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact.
In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

@advancedxy
Copy link
Contributor Author

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact. In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

  1. From my understanding of the code of shuffleBufferManager, preAllocatedSize won't back-pressure shuffle write client, usedMemory does. However usedMemory doesn't decrease when preAllocatedSize decreased.
  2. preAllocatedSize may affect the flush behavior, as it's decreased slightly slower. As mentioned in previous reply, it would cause flush less frequently. LocalOrder would be affected when caching data but flush is more frequently triggered.
  3. If it in theory may cause some performance issues, a simple test may be sufficient as it requires complexes server environments to simulate.

In short, I'm not convinced that this change would make a big difference here and affects local_order. cc @zuston or @jerqi do you any other input on this?

But the behavior is indeed slightly different, I may make an amend to sync back to the original behavior and makes every one happy.

@jerqi
Copy link
Contributor

jerqi commented Dec 22, 2022

Yeah, that's possible.... But the flush trigger is not non-deterministic anyway, it's trigged by outside's requests, which may vary. I still don't get why this kind of change would affect LOCAL_ORDER optimization or bring performance penalty here.

1.If preAllocatedSize release is slow, the client write speed will be affected. When I did #159 , i encountered similar problems. 2.LOCAL_ORDER will affect performance when cache data, but i am not sure how much impact. In general, there will be a lot of performance impact if this place is not handled well. So, i think simple test is needed.

  1. What's the problem which you encounter?
  2. LOCAL_ORDER is an option and it won't give too much performance impact. Because the blocks that we sort won't be very many and we only compare the taskId that isn't a big key.

@xianjingfeng
Copy link
Member

2. As mentioned in previous reply, it would cause flush less frequently

Maybe this is the reason that why the client write speed will be affected. It maybe lead to usedMemory release slow. When I did #159, i have try to release memory after cache all data, but i found usedMemory release slow. Maybe i am wrong. I am not sure. This change is dangerous. I suggest you test it. Just suggestion.@jerqi @advancedxy

@zuston
Copy link
Member

zuston commented Dec 22, 2022

Sorry, I missed this thread. I need some time to surf the commits changelog timeline and currently, I could only give some of my thoughts.

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

LOCAL_ORDER is an option and it won't give too much performance impact. Because the blocks that we sort won't be very many and we only compare the taskId that isn't a big key.

+1

@xianjingfeng
Copy link
Member

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

@zuston
Copy link
Member

zuston commented Dec 22, 2022

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

If this is right, maybe we'd better to have an abstraction to implement the throttle policy, instead of bounding to the shuffleBufferManager.

Emm... This expands the scope of this PR. Feel free to discuss more.

@jerqi
Copy link
Contributor

jerqi commented Dec 22, 2022

From my side, I think the mechanism of requiring allocation seems like token bucket algorithm, the most difference is the bucket size depending on the free-memory, which is a dynamic bucket. Based on this, we should use the global lock when releasing or getting, this is the key to solve the problem mentioned in description.

+1

If this is right, maybe we'd better to have an abstraction to implement the throttle policy, instead of bounding to the shuffleBufferManager.

Emm... This expands the scope of this PR. Feel free to discuss more.

It's hottest path ..... if we could avoid using lock, why do we use it?

@zuston
Copy link
Member

zuston commented Dec 22, 2022

It's hottest path ..... if we could avoid using lock, why do we use it?

Yes, we'd better not, but it's hard to avoid OOM if having too much requests come at the same time without lock.

@advancedxy
Copy link
Contributor Author

advancedxy commented Dec 22, 2022

Maybe this is the reason that why the client write speed will be affected. It maybe lead to usedMemory release slow.

@xianjingfeng ah, yeah. This is a valid concern. But we may solve flush frequency in other ways. But based on this, I will maintain the decrease per cacheShuffleData call.

but it's hard to avoid OOM if having too much requests come at the same time without lock.

The preAllocated mechanism should already back pressure client side? So there won't be too much requests?
And the coordinator should do better assignments.

@jerqi
Copy link
Contributor

jerqi commented Dec 22, 2022

It's hottest path ..... if we could avoid using lock, why do we use it?

Yes, we'd better not, but it's hard to avoid OOM if having too much requests come at the same time without lock.

Do we have OOM?

@advancedxy
Copy link
Contributor Author

Hi @xianjingfeng @jerqi @zuston I have updated the code, please take another look when you have time.

@zuston
Copy link
Member

zuston commented Dec 23, 2022

Do we have OOM?

Just supposing

@jerqi
Copy link
Contributor

jerqi commented Dec 23, 2022

Do we have OOM?

Just supposing

Lock isn't the only mechinsm to avoid OOM.

@zuston
Copy link
Member

zuston commented Dec 23, 2022

After surfing the whole process of requiring/releasing pre-allocated size, I have a question that why the preAllocatedSize is added to two vars of ShuffleBufferManager.usedMemory and ShuffleBufferManager.preAllocatedSize in here (https://github.com/apache/incubator-uniffle/blob/53212923328194fafa5ae94aa216ab04f06fbd13/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java#L242-L255).

And I still can't find the key of this pr to solve the negative preAllocatedSize. Could you help describe more? @advancedxy

Lock isn't the only mechinsm to avoid OOM.

Emm.. Yes. But in current codebase, the requiring pre-allocation operation is guarded by the global lock, which is shown in above code snippet. @jerqi

@jerqi
Copy link
Contributor

jerqi commented Dec 23, 2022

I just don't want to increase extra lock.

@advancedxy
Copy link
Contributor Author

And I still can't find the key of this pr to solve the negative preAllocatedSize. Could you help describe more? @advancedxy

Sorry, let me add a bit of details here. Before this pr, there are two places to decrease preAllocatedSize.

  1. sendShuffleData -> one of multiple cache shuffle data -> decrease preAllocatedSize, and also removes the require buffer id
  2. preAllocatedBufferCheck in shuffleTaskManager, which checks preAllocated is expired or not, then removes it in another check.

Some cases could cause double release of preAllocatedSize:

  1. anything went wrong in one of the cacheShuffleData calls(except the first one), then preAllocatedSize is partially decreased, then the whole size would be decreased in preAllocatedBufferCheck
  2. preAllocatedBufferCheck checks one buffer expired, add it to removeIds. Client sends the sendShuffledata and that finishes first

@advancedxy
Copy link
Contributor Author

@xianjingfeng do you have other comments?

@advancedxy
Copy link
Contributor Author

@jerqi @xianjingfeng gently ping.

@advancedxy
Copy link
Contributor Author

I will wait this PR for another half day. If no other objection, I will merge this.

@advancedxy advancedxy merged commit 3aaac3e into apache:master Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Negative value of preAllocatedSize

5 participants