Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reserved IDs for token2id in Dictionary #2190

Open
Froskekongen opened this issue Sep 20, 2018 · 6 comments
Open

Allow reserved IDs for token2id in Dictionary #2190

Froskekongen opened this issue Sep 20, 2018 · 6 comments
Labels
feature Issue described a new feature

Comments

@Froskekongen
Copy link
Contributor

The dictionary now stard indexing words at 0, which may be unwanted for several applications. E.g. in my application, I want the first word to have index 1, and reserve 0 for padding. Similarly, one may want to reserve certain IDs for unknown tokens, special delimiter tokens, etc, that will remain unaffected by, e.g. filter_extremes and compactify. What I propose, is to be able to supply reserved IDs in Dictionary that will remain unaffected by adding new documents and using different methods on Dictionary.

@menshikh-iv menshikh-iv added feature Issue described a new feature need info Not enough information for reproduce an issue, need more info from author labels Sep 20, 2018
@menshikh-iv
Copy link
Contributor

Hi @Froskekongen, so, right now this can be added in different ways:

  1. manual patching Dictionary after building
  2. add padding token to the corpus in first place

What's a problem to add reserved tokens after dictionary build & filtered to the end?

@Froskekongen
Copy link
Contributor Author

Froskekongen commented Sep 25, 2018

Hi @menshikh-iv,

So patching the dictionary is what I do now, which I consider a hack. It would be much better to be able to specify "special" tokens a priori.

The easiest way would, perheps, be a convenience method to patch the dictionary after building and filtering, passing a dict with reserved ID-token pairs.

Now, if that patching method exists, it should be trivial to add it to the init method and after compactifying.

@menshikh-iv
Copy link
Contributor

@Froskekongen can you propose some interface for this feature and describe the mechanism, how this should work (in case if we want to update dictionary, add more "special" tokens)?

CC: @piskvorky through?

@menshikh-iv menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Sep 26, 2018
@Froskekongen
Copy link
Contributor Author

I can submit a PR for this, if you accept PRs from the outside. Would that be acceptable?

@piskvorky
Copy link
Owner

piskvorky commented Sep 26, 2018

We accept outside PRs, as long as you agree to transfer copyright, Developer instruction. This is needed so we can manage the code and grow the project at our will, without reaching out to all past contributors.

@Froskekongen
Copy link
Contributor Author

@menshikh-iv: See the PR. If this is acceptable, it is trivial to accept special tokens as part of the init in the dictionary class, and use this function to patch the dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

3 participants