-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reserved IDs for token2id
in Dictionary
#2190
Comments
Hi @Froskekongen, so, right now this can be added in different ways:
What's a problem to add reserved tokens after dictionary build & filtered to the end? |
Hi @menshikh-iv, So patching the dictionary is what I do now, which I consider a hack. It would be much better to be able to specify "special" tokens a priori. The easiest way would, perheps, be a convenience method to patch the dictionary after building and filtering, passing a dict with reserved ID-token pairs. Now, if that patching method exists, it should be trivial to add it to the init method and after compactifying. |
@Froskekongen can you propose some interface for this feature and describe the mechanism, how this should work (in case if we want to update dictionary, add more "special" tokens)? CC: @piskvorky through? |
I can submit a PR for this, if you accept PRs from the outside. Would that be acceptable? |
We accept outside PRs, as long as you agree to transfer copyright, Developer instruction. This is needed so we can manage the code and grow the project at our will, without reaching out to all past contributors. |
@menshikh-iv: See the PR. If this is acceptable, it is trivial to accept special tokens as part of the init in the dictionary class, and use this function to patch the dictionary. |
The dictionary now stard indexing words at 0, which may be unwanted for several applications. E.g. in my application, I want the first word to have index 1, and reserve 0 for padding. Similarly, one may want to reserve certain IDs for unknown tokens, special delimiter tokens, etc, that will remain unaffected by, e.g.
filter_extremes
andcompactify
. What I propose, is to be able to supply reserved IDs in Dictionary that will remain unaffected by adding new documents and using different methods on Dictionary.The text was updated successfully, but these errors were encountered: