-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix generation doctests #30263
Fix generation doctests #30263
Conversation
| 618 | when | -2.009 | 13.41% | ||
| 356 | we | -1.859 | 15.58% | ||
| 460 | can | -2.508 | 8.14% | ||
| 262 | the | -1.415 | 24.28% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small numerical precision errors, so I just rewrote those numbers as they are returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that's weird. The CI tests are showing slightly different numerical scores, should I just copy the ones returned by the CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd say stick with the CI values providing they're similar to these. If they're very different, some investigation might be needed, but there's so many small numerical differences that can creep in because of the differences in hardware, setup etc it's probably not worth chasing. We've done the same for some model integration tests
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
The failing |
confirmed all passed, thanks a lot! |
Let's not bother by tests_pr_documentation_tests here. It's likely CPU vs GPU issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on fixing these tests!
Main question is why we're moving the logits_processor. It's completely possible that a use defines their own logits processor which does something similar which wouldn't be properly handled in this case. It seems better to handle the processors vs candidate generator logits compatibility than this forcible removal
for processor in self.logits_processor: | ||
if type(processor) == MinLengthLogitsProcessor: | ||
self.main_model_min_length = getattr(processor, "min_length") | ||
self.logits_processor.remove(processor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm - this seems pretty hacky. Why are we removing the logits processor like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing errors because we are overriding the "min_length" and passing it as a kwarg into generate
every call. If we do not remove, there will be two min lengths, one as a processor and one as a kwarg.
I am not sure right now how to tackle this another way, I'll give it a thought. I did not think user can rewrite MinLengthLogitsProcessor
their own way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not think user can rewrite MinLengthLogitsProcessor their own way
Why couldn't they define their own logits processor and pass that in? Do we only allow logits processors defined in transformers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you are right, that should be possible. I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit of digging, I think that removing the MinLengthLogitsProcessor
is the only solution because we are allowing users to pass in the same thing either as a kwarg or as a "transformers-existing LogitsProcessor". Regardless of way they pass it, the same generation is done, but using "kwargs" is a more straightforward way. Passing LogitsProcessorList can be enabled only for custom processors, raising warnings/errors if transformer-existing processor is passed.
@gante wdyt? this will be a breaking change but might ease maintaining in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we just raised an exception if we detect both the kwarg and the logit processor are set. Otherwise we can get unexpected behaviour where the logitprocessor's min_length is ignored.
This raises a more general question about the responsibilities of logit processors v.s. kwargs - it seems there might be future cases when there's overlap.
For what it's worth - I'm not sure how much sense there is in having a min length logits processor? I'd expect the processors to modify the scores, but not necessarily to control the generation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth - I'm not sure how much sense there is in having a min length logits processor? I'd expect the processors to modify the scores, but not necessarily to control the generation logic.
Agreed here, but I am not sure it's a stopping criteria either.
This raises a more general question about the responsibilities of logit processors v.s. kwargs - it seems there might be future cases when there's overlap.
Hmm, overlap in a sense that both (kwargs and processors) perform same change in logits? I believe that using logits processors is for advanced users and custom cases, so not sure about possible overlaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed here, but I am not sure it's a stopping criteria either.
I think it can be interpreted as a stopping criterion i.e. if it hasn't be satisfied the generation shouldn't stop.
Hmm, overlap in a sense that both (kwargs and processors) perform same change in logits?
Overlap in the sense that they can both have parameters which control the same thing. I'm not sure about changing logits here - is this needed for specifying a min length?
I believe that using logits processors is for advanced users and custom cases, so not sure about possible overlaps
It's more of a question of whether our API allows it - we should be wary of Hyrum's law!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts @zucchini-nlp Answering a question above
For what it's worth - I'm not sure how much sense there is in having a min length logits processor? I'd expect the processors to modify the scores, but not necessarily to control the generation logic.
The min length processor does modify the scores, it sets the probability of EOS tokens to 0. Generation always stops when EOS is selected, thus we have to modify the probabilities before selecting the token so as to enforce a minimum length. If we didn't modify the probabilities, then the generation loop would have to become more complex (if eos was selected and we had a minimum length, we would need to go back and pick a different token).
Remember also that there can be multiple stopping criteria in addition to the EOS token (maximum length, wall clock time, ...), and that generation stops when one of them is met -- preventing generation from stopping until min_length
was reached would break this behavior 🤗
Regarding the code changes:
MinLengthLogitsProcessor
can only be present in self.logits_processor
if a user manually passes a MinLengthLogitsProcessor
instance to generate
(as opposed to the documented min_length
or min_new_tokens
arguments). By being present as a custom processor, the candidate generation process generates more candidates than it should, hurting performance. Note that the overlap is okay (two minimum length processors is equivalent to only have the largest minimum length), but the custom MinLengthLogitsProcessor
is not.
Having any form of logits processing in assisted generation is not a hard requirement, but rather a performance-enhancing operation -- it forces the assistant model to have the same logic processing as the main model. Sadly, Whisper relies on custom logits processors with assisted generation to secure the speedups announced in our Distilwhisper paper, so we can't simply remove support for them to get rid of these (and future related) issues at the cost of a minor throughput hit.
As Amy wrote, exceptions > handling ourselves. I would raise an exception if either MinLengthLogitsProcessor
or MinNewTokensLengthLogitsProcessor
are set as custom logits processors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Amy wrote, exceptions > handling ourselves. I would raise an exception if either MinLengthLogitsProcessor or MinNewTokensLengthLogitsProcessor are set as custom logits processors.
Yes, raising error was the second option I had, it just seemed a bit un-intuitive from user point to be able to pass LogitsProcessor and then getting an error. Okay, I will raise exception and change the doctests then :)
@gante @amyeroberts Now we raise an error is user tries to pass in LogitsProcessor object instead of a kwarg for assistant model. Tests are passing, except for the gpu vs cpu one, as confirmed by @ydshieh above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and iterating on this!
I guess we still might hit issues if people add in their own logits processor which wouldn't pass the isinstance
check - but there's infinite different ways people can make their own classes, so we can't really prepare for all of them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating 🤗
@ydshieh since the failing doctest is okay on your end, would you be able to merge? 🤗 (I don't have permissions to merge with red CI :D ) |
Yes, I will merge later today. Thank you for all the work/review you 3 have done ❤️ |
The run I triggered is 2 weeks ago and since then there is a commit changing something. Let's merge as I am kind confident it still works! (well I only triggered doctest) Thanks again! |
What does this PR do?
Fixes some doctests that were failing in this run