You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Free mem trigger with all2all for sync trigger eviction
Summary:
X-link: facebookresearch/FBGEMM#2067
X-link: meta-pytorch/torchrec#3442
Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction.
This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.
Reviewed By: emlin
Differential Revision: D83896528
eviction_policy.eviction_step_intervals, # trigger_step_interval if trigger mode is iteration
696
715
eviction_mem_threshold_gb, # mem_util_threshold_in_GB if trigger mode is mem_util
697
-
self.kv_zch_params.eviction_policy.ttls_in_mins, # ttls_in_mins for each table if eviction strategy is timestamp
698
-
self.kv_zch_params.eviction_policy.counter_thresholds, # counter_thresholds for each table if eviction strategy is counter
699
-
self.kv_zch_params.eviction_policy.counter_decay_rates, # counter_decay_rates for each table if eviction strategy is counter
700
-
self.kv_zch_params.eviction_policy.feature_score_counter_decay_rates, # feature_score_counter_decay_rates for each table if eviction strategy is feature score
701
-
self.kv_zch_params.eviction_policy.training_id_eviction_trigger_count, # training_id_eviction_trigger_count for each table
702
-
self.kv_zch_params.eviction_policy.training_id_keep_count, # training_id_keep_count for each table
703
-
self.kv_zch_params.eviction_policy.l2_weight_thresholds, # l2_weight_thresholds for each table if eviction strategy is feature l2 norm
716
+
eviction_policy.ttls_in_mins, # ttls_in_mins for each table if eviction strategy is timestamp
717
+
eviction_policy.counter_thresholds, # counter_thresholds for each table if eviction strategy is counter
718
+
eviction_policy.counter_decay_rates, # counter_decay_rates for each table if eviction strategy is counter
719
+
eviction_policy.feature_score_counter_decay_rates, # feature_score_counter_decay_rates for each table if eviction strategy is feature score
720
+
eviction_policy.training_id_eviction_trigger_count, # training_id_eviction_trigger_count for each table
721
+
eviction_policy.training_id_keep_count, # training_id_keep_count for each table
722
+
eviction_policy.l2_weight_thresholds, # l2_weight_thresholds for each table if eviction strategy is feature l2 norm
0 commit comments