write amplification #69

cpd85 · 2022-05-23T21:28:50Z

i'm noticing running some spark apps that produce 11TB of shuffle data on external shuffle service, that they produce closer to 18TB of shuffle data on remote shuffle service. is some write amplification expected?

hiboyang · 2022-05-26T04:53:20Z

It may depend on how these metrics are calculated. Remote shuffle service does write some extra data for each shuffle record like task attempt id and partition id to track the record. But sometime, the metics may be also off a little bit due to serialization/compressing.

cpd85 · 2022-05-31T15:17:07Z

got it. looks like compression isn't supported at the moment on server side? my workloads tend to stress out the SSD and not use computation so I think they could benefit from compression. I see this class https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/common/Compression.java but as far as I can tell it it isn't actually used or configurable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write amplification #69

write amplification #69

cpd85 commented May 23, 2022

hiboyang commented May 26, 2022

cpd85 commented May 31, 2022

write amplification #69

write amplification #69

Comments

cpd85 commented May 23, 2022

hiboyang commented May 26, 2022

cpd85 commented May 31, 2022