Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for patch corpora.Dictionary based on special tokens #2200

Merged
merged 9 commits into from
Jan 11, 2019
36 changes: 20 additions & 16 deletions gensim/corpora/dictionary.py
Original file line number Diff line number Diff line change
Expand Up @@ -597,13 +597,12 @@ def merge_with(self, other):
def patch_with_special_tokens(self, special_token_dict):
"""Patch token2id and id2token using a dictionary of special tokens.

Use case
--------
When doing sequence modeling (e.g. named entity recognition), one may
want to specify special tokens that behave differently than others.

**Usecase:** when doing sequence modeling (e.g. named entity recognition), one may want to specify
special tokens that behave differently than others.
One example is the "unknown" token, and another is the padding token.
It is usual to set the padding token to have index `0`, and patching
the dictionary with `{'<PAD>': 0}` would be one way to specify this.
It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
would be one way to specify this.

Parameters
----------
Expand All @@ -612,16 +611,21 @@ def patch_with_special_tokens(self, special_token_dict):

Examples
--------
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> special_tokens = {'pad': 0, 'space': 1}
>>> print(dct.token2id)
{'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
.. sourcecode:: pycon
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's pycon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's special section name for using flake8 on docstrings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #2192

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, OK. Can you update our Developer wiki page with such tricks?

Also (unrelated to this PR) with the workflow process you currently use to label and manage issues and PRs (the interesting PR label, time until closed as abandoned, CI tricks etc).


>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>>
>>> special_tokens = {'pad': 0, 'space': 1}
>>> print(dct.token2id)
{'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
>>>
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
Copy link
Contributor

@horpto horpto Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv
not critical at all, but id # 5 is lost.


"""
possible_ids = []
for token, idx in special_token_dict.items():
Expand Down