-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German - gender neutral language #153
Comments
Note the stemming is algorithmic, not dictionary based. If I follow you, whether this matters is dependent on how the text is word split, which is something external to the Snowball algorithms. Typically though words are split by finding spans of "word characters", which are typically letters or letters and numbers. I'd expect For example, your first case is handled by the javascript demo as two words with a * between so stems to Maybe it's useful to add rules to remove such suffixes for the |
This was opened against snowball-data which is just testdata - the code of the stemmers is in the snowball repo, so I'm going to move this ticket there. (The testdata is in a separate repo because it's very large - this way people who just want to build the code from git don't have to download a lot of extra data that they probably don't want.) |
I worked out why the demo wasn't working (we need to specify the https://snowballstem.org/demo.html?text=Arbeiter*innen%0aArbeiter%23innen%0aArbeiter_innen#German More generally though |
Closing - as I explained above, such cases will usually actually already work, and the submitter hasn't responded for over a year so I can only assume they were satisfied with that. |
It appears that this dictionary for stemming doesn't deal properly with gender neutral word forms. In German often Texts use for example "Arbeiter*innen", "Arbeiter:innen" or "Arbeiter_innen" (aka gender gap) in order to include persons of all genders while most conservative authors just use "Arbeiter" (aka generic masculine). In my understanding this word forms should all be reduced to the same stem.
The text was updated successfully, but these errors were encountered: