Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) #983

gabrielspmoreira · 2023-02-14T04:30:07Z

Fixes #961

Goals ⚽

This PR fix an issue that caused metrics obtained with model.fit() being much lower than the ones obtained with model.evaluate() when Keras Masking is used.
This bug was observed when comparing training and evaluation metrics of a Transformer example (as described in #961 ), which makes usage of Keras Masking (SequenceMaskRandom, SequenceMaskLast) to select items of the sequence for training / eval.

Implementation Details 🚧

After investigation, I found out that the issue was been caused by the @tf.function decorator we had in model.train_compute_metrics(). After replacing a condition inside that function by tf.cond(), it was possible to remove the @tf.function decorator and fix the error when using Keras Masking (i.e., setting predictions._keras_mask).

Testing Details 🔍

Included tests in graph mode for test_train_metrics_steps, to double check that the logic inside model.train_compute_metrics() that skips steps for computing metrics continue to working in eager and graph mode.

github-actions · 2023-02-14T04:39:41Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-983

gabrielspmoreira · 2023-02-14T11:49:39Z

rerun tests

…and is causing a lower than real accuracy in model.fit() when using preds._keras_mask

…fault

edknv · 2023-02-14T22:40:53Z

The decorator was added to fix that dataloader issue. There is unsuitability with list columns in the dataloader and adding the decorator fixed it. Do we still have issues with metrics if we use both tf.cond and tf.function?

rnyak · 2023-02-15T15:27:04Z

@gabrielspmoreira I tested the PR and now I am getting more consistent results between model.fit() and model.evaluate().

rnyak · 2023-02-15T17:05:50Z

rerun tests

gabrielspmoreira · 2023-02-15T17:08:46Z

The decorator was added to fix that dataloader issue. There is unsuitability with list columns in the dataloader and adding the decorator fixed it. Do we still have issues with metrics if we use both tf.cond and tf.function?

Hi Edward. I remember you have added some @tf.function decorator to deal with list features.
For this fix, I removed the @tf.function only from the train_compute_metrics(), which according to git blame was added by myself a while ago to be able to compute train metrics each N steps, in order to speed up training. So I think that @tf.function decorator is not related to you additions to fix the dataloader issue.

edknv

Based on the discussion offline, it sounds like the CI failure is unrelated to tf.function. Please ignore my previous comment.

…m, SequenceMaskLast) (#983) * Removed @tf.function from train_compute_metrics, as it is not needed and is causing a lower than real accuracy in model.fit() when using preds._keras_mask * Turning if condition into tf.cond to remove tf.function decorator * Making the should_compute_train_metrics_for_batch variable True by default

gabrielspmoreira requested a review from rnyak February 14, 2023 04:31

gabrielspmoreira self-assigned this Feb 14, 2023

gabrielspmoreira added the bug Something isn't working label Feb 14, 2023

gabrielspmoreira changed the title ~~Fixes lower train metrics when using Keras Masking~~ Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) Feb 14, 2023

gabrielspmoreira force-pushed the tf/fix_training_smaller_accuracy branch from 02e35d6 to 0cc59fe Compare February 14, 2023 15:57

viswa-nvidia mentioned this pull request Feb 14, 2023

Fix/Simplify/Optimize the transformer API for causal and masked pre-training approach #981

Closed

8 tasks

gabrielspmoreira added 3 commits February 14, 2023 16:51

Removed @tf.function from train_compute_metrics, as it is not needed …

e6a6438

…and is causing a lower than real accuracy in model.fit() when using preds._keras_mask

Turning if condition into tf.cond to remove tf.function decorator

5c5624a

Making the should_compute_train_metrics_for_batch variable True by de…

08ff219

…fault

gabrielspmoreira force-pushed the tf/fix_training_smaller_accuracy branch from 0cc59fe to 08ff219 Compare February 14, 2023 19:51

rnyak requested a review from sararb February 15, 2023 15:25

rnyak added this to the Merlin 23.02 milestone Feb 15, 2023

sararb approved these changes Feb 15, 2023

View reviewed changes

edknv approved these changes Feb 15, 2023

View reviewed changes

rnyak merged commit c5c0f03 into main Feb 15, 2023

gabrielspmoreira mentioned this pull request Feb 21, 2023

[RMP] Tensorflow support for session based recommendations integration in Merlin NVIDIA-Merlin/Merlin#433

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) #983

Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) #983

gabrielspmoreira commented Feb 14, 2023 •

edited

Loading

github-actions bot commented Feb 14, 2023

gabrielspmoreira commented Feb 14, 2023

edknv commented Feb 14, 2023

rnyak commented Feb 15, 2023

rnyak commented Feb 15, 2023

gabrielspmoreira commented Feb 15, 2023

edknv left a comment •

edited

Loading

Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) #983

Fixes lower train metrics when using Keras Masking (SequenceMaskRandom, SequenceMaskLast) #983

Conversation

gabrielspmoreira commented Feb 14, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

github-actions bot commented Feb 14, 2023

Documentation preview

gabrielspmoreira commented Feb 14, 2023

edknv commented Feb 14, 2023

rnyak commented Feb 15, 2023

rnyak commented Feb 15, 2023

gabrielspmoreira commented Feb 15, 2023

edknv left a comment • edited Loading

Choose a reason for hiding this comment

gabrielspmoreira commented Feb 14, 2023 •

edited

Loading

edknv left a comment •

edited

Loading