How about the cost of TUTEL features? #239

fyang064 · 2024-06-06T20:36:20Z

I'm wondering the cost of features mentioned in the TUTEL paper. It looks like the dynamic features including top-anything as well as the dynamic capacity factor will introduce the additional overhead. Do you have any analysis on this, especially the extra memory or computation cost and corresponding model performance benefit?

ghostplant · 2024-06-08T01:46:00Z

Hi. What you ask includes "model required cost" and "switching cost".

"Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually, this cost can be estimated by O(|capacity_factor| x |topK| x |model dim settings ..|). Thus, when you change capacity_factor and topK, the model-required cost should be also changed.

"Switching cost" is the extra cost when activating the change of parallel configurations from one to anther (e.g. tensor migration cost / checkpointing cost / program exchange cost / ..).

TUTEL's feature ensures it always keeps "Switching cost" for any configuration changes to be zero (regardless of warmup steps which performs once), while keeping "model-required cost" as model requested. e.g. If you change tensor parallel method or change the overlap granularities, the "model-required cost" keeps still. If you double top-k sparsity, the "model-required cost" will be doubled as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about the cost of TUTEL features? #239

How about the cost of TUTEL features? #239

fyang064 commented Jun 6, 2024

ghostplant commented Jun 8, 2024 •

edited

Loading

How about the cost of TUTEL features? #239

How about the cost of TUTEL features? #239

Comments

fyang064 commented Jun 6, 2024

ghostplant commented Jun 8, 2024 • edited Loading

ghostplant commented Jun 8, 2024 •

edited

Loading