Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for patch corpora.Dictionary based on special tokens #2200

Merged
merged 9 commits into from
Jan 11, 2019
Merged

Add method for patch corpora.Dictionary based on special tokens #2200

merged 9 commits into from
Jan 11, 2019

Conversation

Froskekongen
Copy link
Contributor

First part of fixing this issue: #2190

This function patches token2id in dictionary based on a dict with special tokens and their wanted indices.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @Froskekongen 👍
I'm worried about 'self.dfs` and other internal counters: you don't fill it -> you never can filter this kind of tokens (probably this isn't an issue). Maybe this is OK and you no need this.

Also I'm worried, is this produce any side-effects (by same reason as mentioned before)?

d.patch_with_special_tokens(special_tokens)
self.assertEqual(d.token2id['pad'], 0)
self.assertEqual(d.token2id['space'], 1)
self.assertEqual(len(d.token2id), 7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to extend this test (check also than you can transform document with special tokens)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extension added.

Copy link
Contributor

@menshikh-iv menshikh-iv Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see it, where is doc2bow call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't obvious to me that this was the indention - to test the doc2bow transformation. I have added tests for that now, however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange because this is the main method for an apply dictionary to document to get BoW, thanks!

@Froskekongen
Copy link
Contributor Author

Froskekongen commented Oct 2, 2018

@menshikh-iv: I fixed the logic and updated the tests to also account for this case.

For the case of internal counters - it should be OK that special tokens are not accounted for. IMO, we are not interested in the counts of these - we are only requesting specific placeholders for these tokens.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 4, 2018

@piskvorky wdyt about this PR? LGTM, are you agree?

@Froskekongen
Copy link
Contributor Author

@menshikh-iv: Are you planning to accept this PR?

@piskvorky
Copy link
Owner

piskvorky commented Oct 12, 2018

I'd prefer to have the docstring expanded. What is the motivation for this function? Who would want to call it and why?

If it's fully backward compatible, we can merge it. It just needs a better description and motivation, so it doesn't die forgotten in obscurity.

@menshikh-iv
Copy link
Contributor

@Froskekongen please do #2200 (comment) and I'll merge current PR.

…:Froskekongen/gensim into patch_dictionary_based_on_special_tokens
@Froskekongen
Copy link
Contributor Author

@menshikh-iv: Updated documentation of the function.

@menshikh-iv menshikh-iv changed the title Patch dictionary based on special tokens Add method for patch corpora.Dictionary based on special tokens Jan 11, 2019
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
.. sourcecode:: pycon
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's pycon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's special section name for using flake8 on docstrings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #2192

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, OK. Can you update our Developer wiki page with such tricks?

Also (unrelated to this PR) with the workflow process you currently use to label and manage issues and PRs (the interesting PR label, time until closed as abandoned, CI tricks etc).

@menshikh-iv
Copy link
Contributor

thanks for PR @Froskekongen, congratz with first contribution 🥇

@menshikh-iv menshikh-iv merged commit 2d8b389 into piskvorky:develop Jan 11, 2019
>>>
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
Copy link
Contributor

@horpto horpto Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv
not critical at all, but id # 5 is lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants