Add optional gate activation histogram logging during eval #641

Aphoh · 2024-06-21T07:34:22Z

The main goal of this PR is to add the ability to log activation statistics of the MLPs of models.

In it's current state, this involves one big, slightly inconvenient change: every model's compute_loss function now returns a tuple of (loss: Array, extras: dict) where extras can contain any auxiliary data to log. Thus all the upstream code had to be modified to accomidate this change.

Currently there's code to measure the activation statistics of llama models during eval only, as computing the histograms is incredibly inefficient on TPUs. For LLaMa 7b, computing the histograms takes roughly 4x as long as the rest of the forward pass. AFAIK there's no faster way to do this, but it's just during eval so 🤷.

The code's a little messy, so some review would be appreciated.

dlwh

OK so I kinda want to not make a whole bunch of changes to the model API just yet, and would rather have a guide on how to hack this in, since these things tend to be special snowflakes.

I also wonder if we just should consider using a debug callback (see e.g. jit_log_metrics) which is a bit gross from a functional purity perspective, but for logging I think it's fine?

dlwh · 2024-06-21T16:35:39Z

src/levanter/models/gemma.py

- activation_function = hf_config.hidden_act
+ # This is the implementation in huggingface
+ # https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/gemma/modeling_gemma.py#L200
+ activation_function = "gelu_pytorch_tanh"


i swore we already did this

dlwh · 2024-06-21T16:37:12Z

src/levanter/eval.py

@@ -123,6 +125,17 @@ def eval_callback(step: StepInfo):
 _join_prefix(prefix, "loading_time"): result.total_eval_loading_time,
 _join_prefix(prefix, "total_time"): time_fn(),
 }
+ if (gate_hist := result.extras.get("gate_hist", None)) is not None:


so i think i'm gonna have a strong preference for

extracting this block (and the part in the loop) into a class (sort of like runningmean)

not actually checking the usage of it in taggedevaluator (or in the models) into main, but instead

making a little guide on how to add it in, since it's something that people want to play with sometimes but kinda adds a bunch of noise

dlwh · 2024-06-21T16:38:10Z

src/levanter/models/llama.py

@@ -193,16 +203,20 @@ def init(
 if isinstance(activation_fn, str):
 activation_fn = ACT2FN[activation_fn]
 act = activation_fn # type: ignore
- return LlamaMlp(gate_proj, up_proj, down_proj, act)
+ get_bins() # initialize bins


dlwh · 2024-06-21T16:39:29Z

src/levanter/eval.py

+ if extras:
+ for key in extras:
+ curr = total_extras.get(key, jnp.zeros_like(extras[key]))
+ total_extras[key] = extras[key] + curr


is summing always going to be the right reduction here?

dlwh · 2024-06-21T16:40:52Z