Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write amplification #69

Open
cpd85 opened this issue May 23, 2022 · 2 comments
Open

write amplification #69

cpd85 opened this issue May 23, 2022 · 2 comments

Comments

@cpd85
Copy link

cpd85 commented May 23, 2022

i'm noticing running some spark apps that produce 11TB of shuffle data on external shuffle service, that they produce closer to 18TB of shuffle data on remote shuffle service. is some write amplification expected?

@hiboyang
Copy link
Contributor

It may depend on how these metrics are calculated. Remote shuffle service does write some extra data for each shuffle record like task attempt id and partition id to track the record. But sometime, the metics may be also off a little bit due to serialization/compressing.

@cpd85
Copy link
Author

cpd85 commented May 31, 2022

got it. looks like compression isn't supported at the moment on server side? my workloads tend to stress out the SSD and not use computation so I think they could benefit from compression. I see this class https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/common/Compression.java but as far as I can tell it it isn't actually used or configurable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants