You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm wondering the cost of features mentioned in the TUTEL paper. It looks like the dynamic features including top-anything as well as the dynamic capacity factor will introduce the additional overhead. Do you have any analysis on this, especially the extra memory or computation cost and corresponding model performance benefit?
The text was updated successfully, but these errors were encountered:
Hi. What you ask includes "model required cost" and "switching cost".
"Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually, this cost can be estimated by O(|capacity_factor| x |topK| x |model dim settings ..|). Thus, when you change capacity_factor and topK, the model-required cost should be also changed.
"Switching cost" is the extra cost when activating the change of parallel configurations from one to anther (e.g. tensor migration cost / checkpointing cost / program exchange cost / ..).
TUTEL's feature ensures it always keeps "Switching cost" for any configuration changes to be zero (regardless of warmup steps which performs once), while keeping "model-required cost" as model requested. e.g. If you change tensor parallel method or change the overlap granularities, the "model-required cost" keeps still. If you double top-k sparsity, the "model-required cost" will be doubled as well.
I'm wondering the cost of features mentioned in the TUTEL paper. It looks like the dynamic features including top-anything as well as the dynamic capacity factor will introduce the additional overhead. Do you have any analysis on this, especially the extra memory or computation cost and corresponding model performance benefit?
The text was updated successfully, but these errors were encountered: