-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiently Scaling Transformer Inference #348
Comments
summarykey problemworkloadefficient generative inference for Transformer models. (while #256 can be generally applied for all DNN models) large deep models, with tight latency targets and long sequence lengths optimization goaldepend on requirements of downstream applications:
configurations to tunemodel parallelization. how to partition Multi-Head Attention / FFN layer of Transformer block scenariodatacenter. with TPU techniquexxxxx dynamic workload?xxxxx multi-tenant?xxxxx implementationxxxxx Problem and motivation
challenges [ch1]
metrics of inference job [ch2]
tradeoff space between latency/throughput/costch2.1, Fig1 problem formulation [ch2.2, 3.1]model
device layout [ch3.1]
tensor partition layouts [ch3.1]
communication collectives [ch3.1, Figure A.1 ]
inference stagesAn inference request is executed in a batch of
Main ideas and insights
provide a set of engineering principles for how best to partition a model in order to scale Transformer inference Solution description
Important results
Limitations and opportunities for improvement
Closely related work
Follow-up research ideas (Optional)
|
https://proceedings.mlsys.org/paper_files/paper/2023/file/523f87e9d08e6071a3bbd150e6da40fb-Paper-mlsys2023.pdf
The text was updated successfully, but these errors were encountered: