-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual model warmup to resolve AOT model warmup performance degradation #126
Manual model warmup to resolve AOT model warmup performance degradation #126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update unit tests?
QQ on the description,
- we set the max pdbs when we start the server, this value should be within memory cap (based on calculation w the devices used), then it would not OOM right?
- Why higher actual batch size would have very slow detokenization? Could you share some investigation or profiles?
Unit tests do not need to be updated because it is on the condition of engine.warm
Yes, I think the storage of the compiled graphs from AOT and executing it from AOT is what takes up the memory. We observe the OOM at generate request.
Yes, you can reference #92 for some investigations. Also shared the doc internally. |
Did you figure out what is the root cause of performance issue and OOM for AOT? |
RCA has been attempted and the root cause of OOM can potentially be the added space to save the compiled graphs in executables alongside saving the cache in the compilation cache directory. The performance issue, has not been concluded. Could be unoptimal AOT executables. I can share the investigation offline |
Use manual model warmup instead of AOT implemented model warmup, since with AOT, we observe performance degradation at higher batch size of maxtext configuration, mentioned in #92:
This has been verified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.