Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged ONNX decoder next steps #784

Open
4 of 6 tasks
fxmarty opened this issue Feb 15, 2023 · 1 comment
Open
4 of 6 tasks

Merged ONNX decoder next steps #784

fxmarty opened this issue Feb 15, 2023 · 1 comment
Labels
onnx Related to the ONNX export onnxruntime Related to ONNX Runtime

Comments

@fxmarty
Copy link
Contributor

fxmarty commented Feb 15, 2023

Feature request

The PR #647 was merged that adds support for merged without/with past decoder as a single ONNX file, along with inference in ORTModelForCausalLM.

Some key steps are still remaining:

Motivation

Reduce memory usage

Your contribution

/

@fxmarty fxmarty added onnxruntime Related to ONNX Runtime onnx Related to the ONNX export labels Feb 15, 2023
@fxmarty
Copy link
Contributor Author

fxmarty commented Feb 23, 2023

Hi @un-certainty , yes if you are using CUDAExecutionProvider, using IO Binding is probably helpful. I don't have a proper benchmark at hand though.

Also I wonder if the caches are perserved on GPU, will it potentially cause a memory explosion? When the QPS is high and sequences are long, there will be a lot of intermediate tensors. I'm not sure if this could lead to OOM.

I would say it could, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
onnx Related to the ONNX export onnxruntime Related to ONNX Runtime
Projects
None yet
Development

No branches or pull requests

1 participant