Miscellaneous fixes to the x-transformers implementation #79

Waino · 2024-10-21T08:08:55Z

Validation no longer crashes (transposes were missing).
A distributed component covering the parameters in the TransformerWrapper object, most notably to_logits.
Arguments of TransformerWrapper can be set through the config file.
A fix to the content of state dicts, avoiding duplicate storage of some parameters.
Removal of some obsolete opts.
Correctly handle stats both with and without accuracy computation (type of initial value is inferred from preceding stats object).

Skip backward if loss is NaN. Stop training if enough batches are skipped.

The default value must be either zero or None, depending on whether accuracy is reported or not.

Parameters in the TransformerWrapper, e.g. to_logits, need their own distributed component and optimizer.

The adapter injection code was causing parameter duplication. Another issue: to normalize or not to normalize? We compute a normalization based on either tokens or sents, but never apply it. The effect can be compensated for using the learning rate, as long as batches are approximately the same size. Too high learning rates lead to gradient clipping, which is extra detrimental because each component is individually clipped. Clipping deterministically requires one of the following: - access to gradients for all parameters of the entire model (infeasible) - component local clipping (current approach) - communicating a clipping factor across devices (maybe we should do this?)

TimotheeMickus

i mostly have nitpickery of the docstring variety.

mammoth/model_builder.py

mammoth/distributed/components.py

mammoth/opts.py

TimotheeMickus · 2024-10-28T13:47:55Z

mammoth/trainer.py

-                        src,
-                        decoder_input,
-                        src_lengths,
+                        rearrange(src, 't b 1 -> b t'),


in a perfect world we would just normalize the tensors to batch seq dim across the lib, but this world isn't perfect.

Indeed. #80

TimotheeMickus · 2024-10-28T13:48:27Z

mammoth/trainer.py

@@ -476,6 +493,9 @@ def _gradient_accumulation(

            try:
                if loss is not None:
+                    if torch.isnan(loss):
+                        raise Exception('Loss blowout')


can we/should we type this?

~~I think typed exceptions are mainly useful for recoverable problems that you may want to catch somewhere higher up. Blowing out of the training loop is not going to be recoverable.~~

The typed exception would be useful to separate NaN loss from OOM and other exceptions from backward.

I express a strong disagreement.

You do catch it (and everything else) on line 519.

If you want a clean break, then make it a break / branching out cond. If you want a weird stack jump behavior, make it clear to the other devs you're doing a custom hack around a potential problem

I would also argue typing helps other devs: e.g., assume my dev crashes because of a typo producing a out of bounds, not being caught by this try/except here is bound to make my stack less of pain

The minimal thing is to just make it a runtime so as to avoid cases as in 3, but i wouldn't be against finding something more exotic / custom, that way you could except ExoticError to deal with nan blowouts and except RuntimeError for cuda ooms

You are absolutely right.

I already forgot what I was trying to do here. The catch-everything-and-retry-forever was already there from before, and I tried to make it less bad, when I should have just removed it.

I'll remove the try/except, and the --max_nan_batches opt.

TimotheeMickus · 2024-10-28T13:49:17Z

mammoth/trainer.py

@@ -496,10 +516,13 @@ def _gradient_accumulation(
                total_stats.update(batch_stats)
                report_stats.update(batch_stats)
                report_stats.update_task_loss(batch_stats.loss, metadata)
-
            except Exception:


i think this is also meant to to catch cuda oom's, isn't it? in which case your nan counter isn't valid?

Yeah, that's right. The current code maybe allows recovering from a single slightly too large batch, by catching the OOM. This is quite unlikely, though, and wasn't intentional.

We could have a typed exception that increases the nan counter, and let everything else through directly. That would cause OOMs to be instantly fatal.

The old transformer encoder classes are still needed for the obsolete attention bridge. Removal is pending.

This makes the "all" language hack easier to use. When using this hack to achieve shared embeddings, it is no longer possible to use src_lang and tgt_lang to distinguish tasks from each other (because all tasks are just "all-all"). The iterate_tasks.py helper also supports using the task id as a template variable. Having the real src-tgt as the task id, without any useless train_ prefix makes file naming less painful.

Waino added 9 commits October 7, 2024 12:15

Detect model blowout

5a73b4a

Skip backward if loss is NaN. Stop training if enough batches are skipped.

Bugfix to validation

17b6ced

Remove more obsolete opts

4f6620c

Number of correct defaults to None, but can handle zero

7229141

Bugfix: Statisics inherits n_correct from previous instance

203d4d5

The default value must be either zero or None, depending on whether accuracy is reported or not.

Pass kwargs also to TransformerWrapper

83c8c26

Distributed component for TransformerWrapper

97bc2d9

Parameters in the TransformerWrapper, e.g. to_logits, need their own distributed component and optimizer.

Count clipping, don't recompute active adapters

630a679

Waino requested review from TimotheeMickus and josephattieh October 28, 2024 13:00

TimotheeMickus approved these changes Oct 28, 2024

View reviewed changes

Waino added 6 commits November 4, 2024 10:25

Support validation when there are only autoencoder tasks.

36231aa

Docstrings

c0356e4

Remove old transformer decoder.

1490108

The old transformer encoder classes are still needed for the obsolete attention bridge. Removal is pending.

More nits

c205e63

NaN loss is instantly fatal

c4ff0c1

Waino force-pushed the feat/model_blowout branch from 223bed4 to c4ff0c1 Compare November 4, 2024 12:03

Waino merged commit dd4e1ff into main Nov 4, 2024
2 checks passed

Waino deleted the feat/model_blowout branch November 4, 2024 13:43

Waino mentioned this pull request Nov 4, 2024

Behavior on model blowout #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous fixes to the x-transformers implementation #79

Miscellaneous fixes to the x-transformers implementation #79

Waino commented Oct 21, 2024

TimotheeMickus left a comment

TimotheeMickus Oct 28, 2024

Waino Nov 4, 2024

TimotheeMickus Oct 28, 2024

Waino Nov 4, 2024 •

edited

Loading

TimotheeMickus Nov 4, 2024

Waino Nov 4, 2024

TimotheeMickus Oct 28, 2024

Waino Nov 4, 2024

Miscellaneous fixes to the x-transformers implementation #79

Miscellaneous fixes to the x-transformers implementation #79

Conversation

Waino commented Oct 21, 2024

TimotheeMickus left a comment

Choose a reason for hiding this comment

TimotheeMickus Oct 28, 2024

Choose a reason for hiding this comment

Waino Nov 4, 2024

Choose a reason for hiding this comment

TimotheeMickus Oct 28, 2024

Choose a reason for hiding this comment

Waino Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

TimotheeMickus Nov 4, 2024

Choose a reason for hiding this comment

Waino Nov 4, 2024

Choose a reason for hiding this comment

TimotheeMickus Oct 28, 2024

Choose a reason for hiding this comment

Waino Nov 4, 2024

Choose a reason for hiding this comment

Waino Nov 4, 2024 •

edited

Loading