Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse session for 60% speedup #5

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

samwillis
Copy link

Hey @k9p5,

vits-web is really awesome!

I've started trying to speed it up a little, currently predict() setups a whole new ort session and loads the models each time. If instead you split the process into two, so that you can create a "vits-web session" first, you can get a 60% speedup on the basic example. This is particularly useful for repeated calls, setup the model once, and then repeatedly call it with text chunks for more audio (this is what I want to use it for).

As you can see in the video from my machine it cuts the call time from ~2.7s to ~1.1s, with the ~1.7s during the initiation of the session. Repeat calls then only take ~1s.

I've not added/modified any tests yet as it makes more sense to run this past you first.

Screen.Recording.2024-07-10.at.19.47.50.mov

@k9p5
Copy link
Contributor

k9p5 commented Jul 10, 2024

Thanks you very much, good job :) I really did not expect it to make that much of a difference. Am I correct to assume that if you terminate the worker after a result has been generated the time difference will diminish?

@k9p5
Copy link
Contributor

k9p5 commented Jul 10, 2024

I just noticed that you're also holding the model in memory, was that on purpose?

@samwillis
Copy link
Author

samwillis commented Jul 11, 2024

Yes. So for my use case I want to progressively do ttv, essentially one sentence at a time, and start playing it as soon as the first chunk is available.

Holding the model in memory for the duration of the multi step session is ideal.

I tried the transformers.js ttv and it's far too slow, even when using multiple workers. Your packaging of Sherpa/Onnx/Piper is perfect, and with this change makes real-time ttv possible in the browser.

This will need some more polish, particularly around disposing of the session after use.

@k9p5
Copy link
Contributor

k9p5 commented Jul 11, 2024

Makes sense, it kind of depends on the use case whether to keep the memory footprint as small as possible (my goal) or to minimize runtime. Effectively, the part that is missing from the library are environment variables. I'm gonna copy this strategy from onnix then you can just set something like tts.env.keepSessionInMemory = true and maybe this also requires tts.releaseMemory()

@samwillis
Copy link
Author

Absolutely, and I think vits-web can cover both use cases well. On the minimum runtime side, it's important to be able to start the runtime+model before you need it too, not just keep it around for the next call.

In my refactor the original predict function still exists and is just a thin wrapper around the session object, starting it for the single call, then discarding it. The largest part of the memory footprint is the voice model, so I think having a wrapper session object (started with the voiceId) works quite well. It seems to me this API is the most flexible, for example allowing someone to start and use multiple voices at once.

tts.env.keepSessionInMemory = true implies that there wouldn't be a way to start the runtime+model pre-emptively, that's very important for the time to first output in some applications. With the thing I'm experimenting with (voice gen in real time of a streaming response from an LLM) I would want to start the model at application start, then when required produce voice output as fast as possible.

Let me know if you are happy with my approach, and if so I can tidy it up and add tests.

@avarayr
Copy link

avarayr commented Sep 20, 2024

Hey guys, sorry to revive this thread. I'm I right to assume that the recent commit bdf7f36 addresses this issue?

@mikebaldry
Copy link

mikebaldry commented Nov 11, 2024

Hey guys, sorry to revive this thread. I'm I right to assume that the recent commit bdf7f36 addresses this issue?

I think it addresses part of the issue, but it will still create a new session every predict and I'm not sure the loading of the imports is the slowest part. That would be creating the InferenceSession from the model (and maybe getting the blob, though that's probably cached anyway).

This is very promising for me - I would like to create a session, run many predictions against the session, then close the session when it makes sense for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants