You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just encountered a bug with the SumBasicSummarizer, where it seems that the method looks up the document frequency of a stemmed word. However, the actual word_freq_in_doc dictionary only stores the frequencies for unstemmed words.
In particular, I believe that the culprit is the different normalization of content words between _get_content_words_in_sentence() versus the normalization in _get_all_content_words_in_doc(). In particular, the former method performs stemming, whereas the latter does not.
I would have proposed a PR myself, but I don't know which is the "more correct" fix (IMO, consistent stemming should be the way to go?).
FWIW, I used this with German texts, although capitalization etc. seems to be no issue here.
The text was updated successfully, but these errors were encountered:
Hey, first of all, thanks for the great library!
I just encountered a bug with the
SumBasicSummarizer
, where it seems that the method looks up the document frequency of a stemmed word. However, the actualword_freq_in_doc
dictionary only stores the frequencies for unstemmed words.In particular, I believe that the culprit is the different normalization of content words between
_get_content_words_in_sentence()
versus the normalization in_get_all_content_words_in_doc()
. In particular, the former method performs stemming, whereas the latter does not.I would have proposed a PR myself, but I don't know which is the "more correct" fix (IMO, consistent stemming should be the way to go?).
FWIW, I used this with German texts, although capitalization etc. seems to be no issue here.
The text was updated successfully, but these errors were encountered: