-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add common terms phrases model #1263
Conversation
@alexgarel Thanks for the new feature and the tests. Apologies for the late review. |
I'm concerned that this isn't quite the ElasticSearch "common grams" approach, but something else, only described/demonstrated/named in the related issue and source code. I can believe it'd be useful in some cases, but have no idea what they are. Without careful/detailed docs of why it does what it does, my thought is it'd be even better further separated from the existing Phrases functionality – for example, marked as experimental in a separate file. Also, I'm still wondering about what does (or should) happen in alternating common-uncommon-common-uncommon-etc cases like the example in my question on the issue. |
@gojomo you're right, it's not "common grams" it only takes inspiration from this approach. While managing stop words may be less significant in english because most expression do not have them, it is a greater concern in french for example, where more expressions use them ("bain de bouche", "don de sang", "prêt à porter" etc…). |
@tmylk we have to take a decision, and it's not easy:
option 1 and 2 includes paying a potential penalty on performance, while 3 really gives a higher cost on maintenance. We could benchmark solution 1 and 2 to see the cost. Do you have a preferred data set for running a benchmark ? I'am ok to work more on this, and add some more explanations on the behaviour, as @gojomo pointed out. |
Thanks for the clarification & offer of more integration/doc work! Is there an article or paper on this technique anywhere? Is it a common/popular technique in other languages, with any existing name? I'm still a bit unclear about what could possibly happen with certain patterns, like If a trigram like Will this approach do any of the automatic COMMON_UNCOMMON bigram combining, as in the example on the ElasticSerch Common Grams TokenFilter page? Or is it just a matter of making the COMMON words invisible to phrase-analysis, then attaching them internally (never on the ends) in any phrases still detected? |
|
I think I'm starting to get the gist of it. Regarding the interaction of your answers (3) & (4): (5) Are any of the co-occurrence counts tracked with regard to combined common-words – for example, is the creation of (5a) If the former – common-words that are auto-bigrammed become part of the co-occurrence stat keys – will an original text like (5b) If the latter – only UNCOMMON to UNCOMMON co-occurrences are truly tracked – it seems that even if |
|
OK, I think I've almost got it. Regarding (5b), does (6) Are simple bigrams of one common and one-uncommon word ever combined? (From what I know understand of what stats are kept, it seems this would have to be an 'always' or 'never' decision.) |
@gojomo, first thanks for your questions, this conversation will help having a clear documentation.
|
Hello @gojomo and @tmylk I finally continue my work on this. So I created a branch where I've done an implementation of common terms capability directly in the Phrases / Phraser class. I then run some tests to see how it would impact using or not common terms.
So, according to my non scientific test, performance on bigram if you use it (without common terms) is :
This branch is not meant to be merge of course but just to serve discussion. If the performance impact seems ok, I will do a new pull request with this implementations. |
Ping @gojomo @alexgarel, what's a status of this PR? |
Hello @menshikh-iv, I will come back soon with a new PR, so I close this one. |
@alexgarel this sounds like a really great feature -- please let me know if you need any assistance with the new PR. |
This follows feature request #1258.
This add a new model CommonTermsPhrases and it's companion CommonTermsPhraser.
Theses class enable taking into account common terms (aka stop words) when searching for bigrams. Common terms are not taken into account in frequency computation but kept along in the final bigram.
It may allow to catch expressions like "eye of the beholder" or "lack of interest" (in language like french it is quite frequent to have common terms).