Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StopWordFilter #78

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add StopWordFilter #78

wants to merge 6 commits into from

Conversation

rth
Copy link
Owner

@rth rth commented Jun 14, 2020

Add the StopWordFilter struct to filter stop words, as as example of a TokenProcessor trait implementation that takes in an iterator and returns an iterator of strings (following discussion in #21)

TODO

  • decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.

@rth rth mentioned this pull request Jun 14, 2020
2 tasks
@joshlk
Copy link
Collaborator

joshlk commented Jun 15, 2020

decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

How about having separate preference defaults for different languages. Such as:

StopWordFilter::default("en")

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

@rth
Copy link
Owner Author

rth commented Jun 15, 2020

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

Absolutely. It's just that there is not clear consensus what should a stop word include/exclude and when one is provided people tend to use it without thinking too much (see e.g. this paper). I agree we can include stop word list for a few common world languages.

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

+1

@joshlk
Copy link
Collaborator

joshlk commented Jun 15, 2020

Interesting paper. Might be worth including a standard stop word list from spacy but add a note in the documentation that refers to the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants