Add method for patch `corpora.Dictionary` based on special tokens #2200

Froskekongen · 2018-09-27T08:17:05Z

First part of fixing this issue: #2190

This function patches token2id in dictionary based on a dict with special tokens and their wanted indices.

menshikh-iv

Looks good @Froskekongen 👍
I'm worried about 'self.dfs` and other internal counters: you don't fill it -> you never can filter this kind of tokens (probably this isn't an issue). Maybe this is OK and you no need this.

Also I'm worried, is this produce any side-effects (by same reason as mentioned before)?

menshikh-iv · 2018-09-28T07:02:57Z

gensim/test/test_corpora_dictionary.py

+        d.patch_with_special_tokens(special_tokens)
+        self.assertEqual(d.token2id['pad'], 0)
+        self.assertEqual(d.token2id['space'], 1)
+        self.assertEqual(len(d.token2id), 7)


Need to extend this test (check also than you can transform document with special tokens)

Extension added.

I still don't see it, where is doc2bow call?

It wasn't obvious to me that this was the indention - to test the doc2bow transformation. I have added tests for that now, however.

That's strange because this is the main method for an apply dictionary to document to get BoW, thanks!

Froskekongen · 2018-10-02T20:32:15Z

@menshikh-iv: I fixed the logic and updated the tests to also account for this case.

For the case of internal counters - it should be OK that special tokens are not accounted for. IMO, we are not interested in the counts of these - we are only requesting specific placeholders for these tokens.

menshikh-iv · 2018-10-04T05:49:25Z

@piskvorky wdyt about this PR? LGTM, are you agree?

Froskekongen · 2018-10-12T14:19:30Z

@menshikh-iv: Are you planning to accept this PR?

piskvorky · 2018-10-12T17:51:27Z

I'd prefer to have the docstring expanded. What is the motivation for this function? Who would want to call it and why?

If it's fully backward compatible, we can merge it. It just needs a better description and motivation, so it doesn't die forgotten in obscurity.

menshikh-iv · 2018-10-15T04:58:06Z

@Froskekongen please do #2200 (comment) and I'll merge current PR.

…:Froskekongen/gensim into patch_dictionary_based_on_special_tokens

Froskekongen · 2018-11-04T17:20:25Z

@menshikh-iv: Updated documentation of the function.

…_based_on_special_tokens

piskvorky · 2019-01-11T13:19:36Z

gensim/corpora/dictionary.py

-        >>> dct.patch_with_special_tokens(special_tokens)
-        >>> print(dct.token2id)
-        {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
+        .. sourcecode:: pycon


What's pycon?

that's special section name for using flake8 on docstrings

Aha, OK. Can you update our Developer wiki page with such tricks?

Also (unrelated to this PR) with the workflow process you currently use to label and manage issues and PRs (the interesting PR label, time until closed as abandoned, CI tricks etc).

menshikh-iv · 2019-01-11T19:32:06Z

thanks for PR @Froskekongen, congratz with first contribution 🥇

horpto · 2019-01-11T19:42:48Z

gensim/corpora/dictionary.py

+            >>>
+            >>> dct.patch_with_special_tokens(special_tokens)
+            >>> print(dct.token2id)
+            {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}


@menshikh-iv
not critical at all, but id # 5 is lost.

Froskekongen added 3 commits September 27, 2018 10:11

Function to patch dictionary

e3e2aeb

Typo.

d75e2a1

Fix bug

af975bd

menshikh-iv suggested changes Sep 28, 2018

View reviewed changes

Fix tests and logic

c4d4a06

Added doc2bow test.

6fade67

Froskekongen added 2 commits November 4, 2018 18:19

Code review

8045716

Merge branch 'patch_dictionary_based_on_special_tokens' of github.com…

1b4a43b

…:Froskekongen/gensim into patch_dictionary_based_on_special_tokens

menshikh-iv added 2 commits January 10, 2019 17:32

Merge remote-tracking branch 'upstream/develop' into patch_dictionary…

d2451ee

…_based_on_special_tokens

fix doc

782b69b

menshikh-iv changed the title ~~Patch dictionary based on special tokens~~ Add method for patch corpora.Dictionary based on special tokens Jan 11, 2019

piskvorky reviewed Jan 11, 2019

View reviewed changes

menshikh-iv approved these changes Jan 11, 2019

View reviewed changes

menshikh-iv merged commit 2d8b389 into piskvorky:develop Jan 11, 2019

horpto reviewed Jan 11, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method for patch `corpora.Dictionary` based on special tokens #2200

Add method for patch `corpora.Dictionary` based on special tokens #2200

Froskekongen commented Sep 27, 2018

menshikh-iv left a comment

menshikh-iv Sep 28, 2018

Froskekongen Oct 3, 2018

menshikh-iv Oct 4, 2018 •

edited

Loading

Froskekongen Oct 4, 2018

menshikh-iv Oct 4, 2018

Froskekongen commented Oct 2, 2018 •

edited

Loading

menshikh-iv commented Oct 4, 2018 •

edited

Loading

Froskekongen commented Oct 12, 2018

piskvorky commented Oct 12, 2018 •

edited

Loading

menshikh-iv commented Oct 15, 2018

Froskekongen commented Nov 4, 2018

piskvorky Jan 11, 2019

menshikh-iv Jan 11, 2019

menshikh-iv Jan 11, 2019

piskvorky Jan 11, 2019 •

edited

Loading

menshikh-iv commented Jan 11, 2019

horpto Jan 11, 2019 •

edited

Loading

Add method for patch corpora.Dictionary based on special tokens #2200

Add method for patch corpora.Dictionary based on special tokens #2200

Conversation

Froskekongen commented Sep 27, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Sep 28, 2018

Choose a reason for hiding this comment

Froskekongen Oct 3, 2018

Choose a reason for hiding this comment

menshikh-iv Oct 4, 2018 • edited Loading

Choose a reason for hiding this comment

Froskekongen Oct 4, 2018

Choose a reason for hiding this comment

menshikh-iv Oct 4, 2018

Choose a reason for hiding this comment

Froskekongen commented Oct 2, 2018 • edited Loading

menshikh-iv commented Oct 4, 2018 • edited Loading

Froskekongen commented Oct 12, 2018

piskvorky commented Oct 12, 2018 • edited Loading

menshikh-iv commented Oct 15, 2018

Froskekongen commented Nov 4, 2018

piskvorky Jan 11, 2019

Choose a reason for hiding this comment

menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

piskvorky Jan 11, 2019 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Jan 11, 2019

horpto Jan 11, 2019 • edited Loading

Choose a reason for hiding this comment

Add method for patch `corpora.Dictionary` based on special tokens #2200

Add method for patch `corpora.Dictionary` based on special tokens #2200

menshikh-iv Oct 4, 2018 •

edited

Loading

Froskekongen commented Oct 2, 2018 •

edited

Loading

menshikh-iv commented Oct 4, 2018 •

edited

Loading

piskvorky commented Oct 12, 2018 •

edited

Loading

piskvorky Jan 11, 2019 •

edited

Loading

horpto Jan 11, 2019 •

edited

Loading