-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method for patch corpora.Dictionary
based on special tokens
#2200
Add method for patch corpora.Dictionary
based on special tokens
#2200
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @Froskekongen 👍
I'm worried about 'self.dfs` and other internal counters: you don't fill it -> you never can filter this kind of tokens (probably this isn't an issue). Maybe this is OK and you no need this.
Also I'm worried, is this produce any side-effects (by same reason as mentioned before)?
d.patch_with_special_tokens(special_tokens) | ||
self.assertEqual(d.token2id['pad'], 0) | ||
self.assertEqual(d.token2id['space'], 1) | ||
self.assertEqual(len(d.token2id), 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to extend this test (check also than you can transform document with special tokens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extension added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see it, where is doc2bow
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't obvious to me that this was the indention - to test the doc2bow transformation. I have added tests for that now, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange because this is the main method for an apply dictionary to document to get BoW, thanks!
@menshikh-iv: I fixed the logic and updated the tests to also account for this case. For the case of internal counters - it should be OK that special tokens are not accounted for. IMO, we are not interested in the counts of these - we are only requesting specific placeholders for these tokens. |
@piskvorky wdyt about this PR? LGTM, are you agree? |
@menshikh-iv: Are you planning to accept this PR? |
I'd prefer to have the docstring expanded. What is the motivation for this function? Who would want to call it and why? If it's fully backward compatible, we can merge it. It just needs a better description and motivation, so it doesn't die forgotten in obscurity. |
@Froskekongen please do #2200 (comment) and I'll merge current PR. |
…:Froskekongen/gensim into patch_dictionary_based_on_special_tokens
@menshikh-iv: Updated documentation of the function. |
…_based_on_special_tokens
corpora.Dictionary
based on special tokens
>>> dct.patch_with_special_tokens(special_tokens) | ||
>>> print(dct.token2id) | ||
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1} | ||
.. sourcecode:: pycon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's pycon
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's special section name for using flake8
on docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #2192
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, OK. Can you update our Developer wiki page with such tricks?
Also (unrelated to this PR) with the workflow process you currently use to label and manage issues and PRs (the interesting PR
label, time until closed as abandoned, CI tricks etc).
thanks for PR @Froskekongen, congratz with first contribution 🥇 |
>>> | ||
>>> dct.patch_with_special_tokens(special_tokens) | ||
>>> print(dct.token2id) | ||
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv
not critical at all, but id # 5 is lost.
First part of fixing this issue: #2190
This function patches
token2id
in dictionary based on a dict with special tokens and their wanted indices.