-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving #256
Comments
summarykey problemworkloadML serving, with
optimization goallatency SLO attainment configurations to tune
scenariorequest-response paradigm. Serving environment running in datacenter, with homogeneous devices. techniqueAlpa(DP + ILP) + greedy search dynamic workload?yes multi-tenant?yes. do inference of multiple models on the same cluster. implementationThe real system is implemented on top of an existing model-parallel training system, Alpa. Problem and motivation
model parallelism have been well studied in the throughput-oriented training setting. However, its effects for model serving under latency-sensitive settings remains largely untapped. motivation study: [ch3] show than model parallelism benefits serving multiple models (reduce serving latency and improve resource utilization in the presence of bursty workloads) through statistical multiplexing under these assumptions:
[ch3.3] and Fig9 further analyzed the effect of inter-op and intra-op in terms of throughput / latency:
problem/challenge: decision search space is large. Main ideas and insights
xxxxxxxxx Solution description
Planning phase: split models to buckets, then split devices to groups, then do placement on each bucket
Runtime Scheduling All requests are sent to a centralized controller. The controller dispatches each request to the group with the shortest queue length. Each group manages a first-come-first-serve queue. When a group receives a request, it checks whether it can serve the request under SLO and rejects the request if it cannot. see Fig 11 Important results
hardware: a cluster with 8 nodes and 64 GPUs. each node has 8 V100 GPUs workloads:
Baselines to compare with:
results
Limitations and opportunities for improvement
when doesn't it work?
assumptions?
Closely related work
Follow-up research ideas (Optional)
|
https://arxiv.org/pdf/2302.11665.pdf
The text was updated successfully, but these errors were encountered: