-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add German Stopwords #638
Add German Stopwords #638
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you!
One question: I noticed that the list includes a
and b
– are these pre-processing artifacts, or are they there for a reason? I guess a
makes sense as an alternate spelling of á
(which should probably be in there as well). Not sure about b
, though.
I totally hadn't realised this list was missing from the German data, btw. I have a few more ideas, so will be adding those later! 👍
@ines Thanks for the review. The |
@souravsingh, how did you compile this list? Is it taken from a single source, or multiple? To what extent have you personally tweaked that data by adding words you noticed were missing or removing words you thought were inappropriate? This list looks similar, but not identical, to https://github.com/wgpsutherland/stopwords/blob/master/dist/de.json; I couldn't find any other plausible-looking sources on Google. (Note that I'm not a project maintainer, just a random guy from the internet, but if I were in @ines's place I'd personally want to know what the source of the data was before merging.) |
@ExplodingCabbage The list was taken from multiple sources. The list was compiled from the website here-http://codingwiththomas.blogspot.in/2012/01/german-stop-words.html and from the stopwords list from Apache Lucene. I had some knowledge of German, so identifying which ones are plausible stopwords wasn't really difficult. |
Thanks for the info! Will merge this now and make a few edits and additions. Btw, @honnibal and I went over the current state of the language data earlier and it could definitely need some better organisation. Starting with basic formatting, but also more complex stuff – for example, having a global module for emoticons that can be imported across languages (instead of having the same data live in each language). So I'll be taking this on over the next week or so 😃 English and German will be easy (native German speaker here), but we might post a call for native speakers for the other languages soon, just to have another pair of eyes making sure it's all good. We have such a great community from all over the world, so we should be able to do this! 💪 |
Add a list of stopwords for German Language.
Fixes Issue #364