-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Added docstring example for EpsilonLogitsWarper #24783 #25378
[DOCS] Added docstring example for EpsilonLogitsWarper #24783 #25378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments in the lines below, see also this comment: let's try to build shorter code for the example :)
>>> set_seed(100) | ||
|
||
>>> # We can see that the model generates `J. Trump` as the next token | ||
>>> outputs = model.generate(inputs["input_ids"], max_new_tokens=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd set do_sample=True
here, for a direct comparison with the example below :)
You may also want to illustrate a case where this line does not generate the most common case (i.e. does NOT generate J. Trump Jr), where applying epsilon_cutoff
would result in generating the common case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gante : Can you share some explanation for illustrating the second case you mentioned ? I am trying to understand: How does the case when ordinary multinomial sampling doesn't generate the most common case and applying epsilon_cutoff
resulting in the generation of common case show that using epsilon_cutoff
is better at sampling from multiple variety of tokens ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epsilon_cutoff
is a variation of top_p
/top_k
, sharing a common point: it drops some low-probability solutions according to its algorithm. As such, we should be able to show an example where applying it on a sequence that has a low probability token results in a modified sequence :)
But hold this thought, the example is a bit too long and we need to think about a more concise way to showcase this processor
a6fb4e3
to
01efbb0
Compare
Hi @gante, addressed your comments. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example has 40 lines at the moment, which is too long. Our docs should be concise, otherwise users won't bother reading them :)
Let's get a case where we can showcase the processor with 2 generate
calls (one with and another without epsilon_cutoff
). Note that set_seed
needs to be called before each generate
call, otherwise we can't show that epsilon_cutoff
had an impact (i.e. otherwise the difference can be attributed to sampling, and not to epsilon_cutoff
)
01efbb0
to
28400c9
Compare
Hi @gante made the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a very small nit! Thanks for working on this! 🤗
>>> # The use of the `epsilon_cutoff` parameter (best performing values between 3e-4 and 9e-4 from the paper mentioned above) generates tokens | ||
>>> # by sampling from a variety of tokens with probabilities greater than or equal to epsilon value. The disadvantage of this sampling is that | ||
>>> # if there are many possible tokens to sample from, the epsilon value has to be very small for sampling to occur from all the possible tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are a bit too long should be less than 119 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker : made the suggested change
28400c9
to
386801b
Compare
@sanjeevk-os Thank you for your contribution 💛 |
(huggingface#25378) * [DOCS] Added docstring example for EpsilonLogitsWarper huggingface#24783 * minor code changes based on review comments * set seed for both generate calls, reduced the example length * fixed line length under 120 chars
(huggingface#25378) * [DOCS] Added docstring example for EpsilonLogitsWarper huggingface#24783 * minor code changes based on review comments * set seed for both generate calls, reduced the example length * fixed line length under 120 chars
(huggingface#25378) * [DOCS] Added docstring example for EpsilonLogitsWarper huggingface#24783 * minor code changes based on review comments * set seed for both generate calls, reduced the example length * fixed line length under 120 chars
What does this PR do?
See #24783
Added docstring example for EpsilonLogitsWarper
Before submitting
Pull Request section?
Who can review?
@gante