You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Are there any rough numbers folks have on any hardaware setup for the inference runtime of any standard models across tasks and various context lengths?
Using vLLM, for just 5 requests for the 131072 context length for NIAH_single_1, I'm currently seeing ~15 minutes on a single a100 for Llama 3.1 8b.
While I actively play with parallelism configs, I'm wondering if other folks have any rough order of magnitude runtime changes, they have seen across various tasks and context lengths on any hardware setups
The text was updated successfully, but these errors were encountered:
Are there any rough numbers folks have on any hardaware setup for the inference runtime of any standard models across tasks and various context lengths?
Using vLLM, for just 5 requests for the 131072 context length for NIAH_single_1, I'm currently seeing ~15 minutes on a single a100 for Llama 3.1 8b.
While I actively play with parallelism configs, I'm wondering if other folks have any rough order of magnitude runtime changes, they have seen across various tasks and context lengths on any hardware setups
The text was updated successfully, but these errors were encountered: