The Shuttle is designed as standalone service (not depend on Yarn NodeManager external shuffle service). The goal of Shuttle rss is to change the small and random IO to sequence IO, result to improve the performance and stablility of application.
Shuttle rss has four part : ShuffleMaster and ShuffleWorker. The ShuffleMaster provide following functionalities:
- manage shuffle workers:
- shuffle workers status monitoring and transferring
- shuffle workers load balance
- shuffle service cluster router
- allocate shuffle workers to application
- response allocate shuffle workers request from application driver
The ShuffleWorker provide following functionalities:
- accept data and checksum from maps shuffle writer
- group and dump same partition data and checksum into same files
- self-health check and upload status to ShuffleMaster
- save shuffle data to distribute file system(HDFS/CFS/S3/Alluxio)
- split large shuffle file to some small file
The ShuffleWriter provide following functionalities:
- compress and serialize shuffle data, group data and checksum
- use Netty client send shuffle package to shuffle worker
- switch new shuffle worker to send when sending failed
The ShuffleReader provide following functionalities:
- read shuffle data and checksum from partition shuffle file async
The following architecture diagram illustrates how it works:
The basic work flow like following desc:
- shuffle manager send request to shuffle master for getting shuffle workers and configuration to use.
- shuffle writer send data and checksum to shuffle worker asynchronous. When all map task finished, shuflfe worker assign the shuffle file as finalize status.
- shuffle worker group and dump all maps' same partition's data and checksum in one file to distribute file system
- shuffle reader read partition files,and check the checksum of data
The design for stability:
- use checksum check data from diff map to guarantee data correctness
- shuffle writer switch shuffle worker when sending data corrupted or time out
- flow control the map data sending speed to protect whole system stability
The disign for performance:
- shuffle writer save the package data size to a threshold for reduce sending packages
- shuffle writer use direct memory to cache data
- shuffle writer serialize and compress data parallelization
- shuffle reader read data asynchronously and using pre-read mechanism
- shuffle worker process different partition data parallel using customized thread model
- shuffle worker buffered data to save flush data times