-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate: low memory tests are flaky #29136
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think @flaky is appropriate here. This flag is for marking tests which occasionally fail due to randomness. Here it will fail for some models because of the change in operations, but that change is deterministic i.e. this test will still fail. Instead. this should be explicitly skipped for those models it affects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts It is a deterministic property, yes, but one that we can't control unless we control the inputs as well. Similar to how time measurement tests or equivalence tests are flaky (because the environment is not fully isolated and random inputs that may trigger an error above the threshold, respectivelly).
This means that we either control the inputs or add this flag. I have a strong preference for this flag -- otherwise, we may have to manually find a set of inputs that work for ALL models every time we make a modeling change in any language model (this is a mixin test) 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that if it's deterministic, this flag won't fix things - it only triggers a re-run n times. Even if there's randomness in the inputs, the difference in the order of operations won't be affected i.e. I'd expect this to be very flaky for some models, and not for others. It doesn't make sense to say it "passes" for certain models because the model doesn't actually have this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts Let me explain the whole problem better, so we can decide together on how to proceed on this class of tests as a whole 😄 (And sorry for the wall of text. This may be useful for future reference!)
The point of this test is to confirm that passing
low_memory
flag, which causes beam search to separate a batch into several mini-batches at model inference time so as to save memory, results in roughly the same outputs. The cause of the numerical differences is a matmul-level property we can't control: batched matmul vs several vector-matrix multiplications will have slightly different outputs, due to different shapes. Exactly the same way that passing apast_key_values
cache results in slightly different outputs, the source and magnitude of numerical mismatches are the same. Ditto for assisted decoding vs non assisted decoding and left-padding vs not padded (the root cause of the last one is slightly different, the attention masking). In all these cases, one of the following is happening:past_key_values
); ORlow_memory
).Ideally, they should have the same pattern!
In a perfect world, we'd isolate and test the numerical property: the logits in these cases should be very close to one another ✅ [although we don't have a test for these properties in isolation, and we should!]
The catch is that these options (
low_memory
,past_key_values
, left-padding, assisted decoding) do other things ingenerate
in addition to triggering these numerical properties. So we should test that a) the resulting outputs are similar; AND b) the rest of the behavior associated with the option is correct (e.g. the reconstructed output tensors such asscores
are correctly built). We also have the following influencing how we write these tests:generate
(e.g Whisper), and we want to check these options work well there;generate
step, causing all subsequent logits to be vastly different;Having explained the problem, I believe that repeating the test N times is the only way to ensure the test goals are met, while respecting the influencing factors. It's not that the test is flaky... it's that we have to verify the results of a non-linear operation over an approximation with surrounding additional logic, which is hard.
I'm open to suggestions, as long as we don't limit our ability to detect problems -- if anything, our test suite is small 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a super interesting discussion: might be minor but indeed if
@flaky
is not the correct "explanation" - maybe a new decorator would be justified here, such as@nondeterministic
, at least for readability/ulterior documentation purposes?Second, for this kind of added-randomness tests, what do we think about using a non-exact match metric?
By that I mean, if the generation differs slightly, would something like a Levenshtein-based metric kept above some threshold be acceptable? Only for these tests, that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be up for non-exact match metrics, but I'm not sure how to do it 🤔
These tests use our tiny random models, whose output is gibberish from a human perspective. Their output distribution tends to be flatter as well, which makes them more susceptible to the numerical fluctuations materializing into different tokens 😞 We can see in the tests that the divergence can happen as soon as in the first generation iteration. Ignoring the tokens, the best I could come up with was to compare the logits up to the divergence. Not even the shapes are required to match unless we remove the EOS token 😅
I'm strongly biased towards a solution that forces us to compare the whole output somehow, as I've been bitten by poor logic when iterating on these features. I couldn't detect the issue by comparing the model logits up to the divergence nor by checking the results qualitatively, only by comparing the full output against the unmodified case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, adding for traceability: it seems that on main, these tests are not flaky anymore? Or at least, not with a ~3% or 1% error rate - can't reproduce them on a loop on CPU or GPU machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@molbap Is this for all models, or just the ones which were previously failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts I think, all? Specifically, I ran
and looked at the logs, saw only green there!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh - OK. Well that's both good news and also perplexing. In this case, let's close this PR and if they crop up again we can figure how to handle (and what might be causing this intermittent behaviour)