Generate: low memory tests are flaky #29136

gante · 2024-02-20T10:28:49Z

What does this PR do?

As identified by @molbap -- generate tests with the low_memory flag are flaky. The full reason is the same as explained in this comment.

The error likelihood has low (~3%), but still quite disruptive for transformers devs.

HuggingFaceDocBuilderDev · 2024-02-20T10:51:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for working on this!

As mentioned in my comment, I don't think using flaky makes sense here if the reason is the order or operations

amyeroberts · 2024-02-20T10:54:56Z

tests/generation/test_utils.py

@@ -1503,6 +1503,7 @@ def test_contrastive_generate_dict_outputs_use_cache(self):
            for output in (output_contrastive, output_generate):
                self._check_outputs(output, input_ids, model.config, use_cache=True)

+    @is_flaky(description="low_memory changes the order of FP operations, which may result in different tokens.")


I don't think @flaky is appropriate here. This flag is for marking tests which occasionally fail due to randomness. Here it will fail for some models because of the change in operations, but that change is deterministic i.e. this test will still fail. Instead. this should be explicitly skipped for those models it affects.

@amyeroberts It is a deterministic property, yes, but one that we can't control unless we control the inputs as well. Similar to how time measurement tests or equivalence tests are flaky (because the environment is not fully isolated and random inputs that may trigger an error above the threshold, respectivelly).

This means that we either control the inputs or add this flag. I have a strong preference for this flag -- otherwise, we may have to manually find a set of inputs that work for ALL models every time we make a modeling change in any language model (this is a mixin test) 👀

My point is that if it's deterministic, this flag won't fix things - it only triggers a re-run n times. Even if there's randomness in the inputs, the difference in the order of operations won't be affected i.e. I'd expect this to be very flaky for some models, and not for others. It doesn't make sense to say it "passes" for certain models because the model doesn't actually have this property.

@amyeroberts Let me explain the whole problem better, so we can decide together on how to proceed on this class of tests as a whole 😄 (And sorry for the wall of text. This may be useful for future reference!)

The point of this test is to confirm that passing low_memory flag, which causes beam search to separate a batch into several mini-batches at model inference time so as to save memory, results in roughly the same outputs. The cause of the numerical differences is a matmul-level property we can't control: batched matmul vs several vector-matrix multiplications will have slightly different outputs, due to different shapes. Exactly the same way that passing a past_key_values cache results in slightly different outputs, the source and magnitude of numerical mismatches are the same. Ditto for assisted decoding vs non assisted decoding and left-padding vs not padded (the root cause of the last one is slightly different, the attention masking). In all these cases, one of the following is happening:

the equivalence of using the feature or not is untested (past_key_values); OR

the feature has a test with this decorator (assisted decoding); OR

the feature has a test with an internal loop doing something similar to this decorator (left-padding); OR

the feature has a test and fails occasionally (low_memory).
Ideally, they should have the same pattern!

In a perfect world, we'd isolate and test the numerical property: the logits in these cases should be very close to one another ✅ [although we don't have a test for these properties in isolation, and we should!]

The catch is that these options (low_memory, past_key_values, left-padding, assisted decoding) do other things in generate in addition to triggering these numerical properties. So we should test that a) the resulting outputs are similar; AND b) the rest of the behavior associated with the option is correct (e.g. the reconstructed output tensors such as scores are correctly built). We also have the following influencing how we write these tests:

Some models have modified a generate (e.g Whisper), and we want to check these options work well there;

We can't simply compare that the logits are similar: due to the fluctuation in the logits, different tokens may be selected in a certain generate step, causing all subsequent logits to be vastly different;

Comparing that the logits are similar up to the token differences is also not an option. It may hide a bug;

The test should be a mixin, to facilitate the job of adding new models while ensuring correctness;

Hardcoding an input in the mixin to cherry-pick a situation where these numerical differences do not materialize into different tokens is brittle. We risk having to find a new input every once in a while, as we add new models and make modeling corrections. Stochastic inputs may also enable us to uncover funny situations (a test that has a planned failure rate of 3% failing 5 times should be checked).

Having explained the problem, I believe that repeating the test N times is the only way to ensure the test goals are met, while respecting the influencing factors. It's not that the test is flaky... it's that we have to verify the results of a non-linear operation over an approximation with surrounding additional logic, which is hard.

I'm open to suggestions, as long as we don't limit our ability to detect problems -- if anything, our test suite is small 🤗

That's a super interesting discussion: might be minor but indeed if @flaky is not the correct "explanation" - maybe a new decorator would be justified here, such as @nondeterministic, at least for readability/ulterior documentation purposes?
Second, for this kind of added-randomness tests, what do we think about using a non-exact match metric?

By that I mean, if the generation differs slightly, would something like a Levenshtein-based metric kept above some threshold be acceptable? Only for these tests, that is.

I'd be up for non-exact match metrics, but I'm not sure how to do it 🤔

These tests use our tiny random models, whose output is gibberish from a human perspective. Their output distribution tends to be flatter as well, which makes them more susceptible to the numerical fluctuations materializing into different tokens 😞 We can see in the tests that the divergence can happen as soon as in the first generation iteration. Ignoring the tokens, the best I could come up with was to compare the logits up to the divergence. Not even the shapes are required to match unless we remove the EOS token 😅

I'm strongly biased towards a solution that forces us to compare the whole output somehow, as I've been bitten by poor logic when iterating on these features. I couldn't detect the issue by comparing the model logits up to the divergence nor by checking the results qualitatively, only by comparing the full output against the unmodified case.

As discussed offline, adding for traceability: it seems that on main, these tests are not flaky anymore? Or at least, not with a ~3% or 1% error rate - can't reproduce them on a loop on CPU or GPU machines.

@molbap Is this for all models, or just the ones which were previously failing?

@amyeroberts I think, all? Specifically, I ran

for i in {1..1000}; do RUN_SLOW=1 CUDA_VISIBLE_DEVICES= python -m pytest tests/ -k test_contrastive_generate_low_memory >> test_runs.log 2>&1 done

and looked at the logs, saw only green there!

huh - OK. Well that's both good news and also perplexing. In this case, let's close this PR and if they crop up again we can figure how to handle (and what might be causing this intermittent behaviour)

gante · 2024-02-21T18:12:35Z

@amyeroberts #29109 seems to have fixed most of the issue (this test does compare batched vs unbatched generation under the hood, which the PR linked above fixes)

The root issue about this and other tests being non-deterministic tests persists, though :) I'm going to close the PR and move the discussion to slack at a future time :)

low memory tests are flaky

ae9b3bd

gante requested a review from amyeroberts February 20, 2024 10:28

amyeroberts reviewed Feb 20, 2024

View reviewed changes

gante closed this Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: low memory tests are flaky #29136

Generate: low memory tests are flaky #29136

gante commented Feb 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 20, 2024

amyeroberts left a comment

amyeroberts Feb 20, 2024

gante Feb 20, 2024 •

edited

Loading

amyeroberts Feb 20, 2024

gante Feb 20, 2024 •

edited

Loading

molbap Feb 20, 2024

gante Feb 20, 2024

molbap Feb 21, 2024

amyeroberts Feb 21, 2024

molbap Feb 21, 2024

amyeroberts Feb 21, 2024

gante commented Feb 21, 2024

Generate: low memory tests are flaky #29136

Generate: low memory tests are flaky #29136

Conversation

gante commented Feb 20, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Feb 20, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante commented Feb 21, 2024

gante commented Feb 20, 2024 •

edited

Loading

gante Feb 20, 2024 •

edited

Loading

gante Feb 20, 2024 •

edited

Loading