Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German - gender neutral language #153

Closed
eest9 opened this issue Jul 14, 2021 · 4 comments
Closed

German - gender neutral language #153

eest9 opened this issue Jul 14, 2021 · 4 comments

Comments

@eest9
Copy link

eest9 commented Jul 14, 2021

It appears that this dictionary for stemming doesn't deal properly with gender neutral word forms. In German often Texts use for example "Arbeiter*innen", "Arbeiter:innen" or "Arbeiter_innen" (aka gender gap) in order to include persons of all genders while most conservative authors just use "Arbeiter" (aka generic masculine). In my understanding this word forms should all be reduced to the same stem.

@ojwb
Copy link
Member

ojwb commented Jul 23, 2021

Note the stemming is algorithmic, not dictionary based.

If I follow you, whether this matters is dependent on how the text is word split, which is something external to the Snowball algorithms. Typically though words are split by finding spans of "word characters", which are typically letters or letters and numbers. I'd expect * and : would be treated as non word characters so would be a word break; _ is sometimes included as a word character and sometimes not depending on what's being searched. If the punctuation before innen is treated as a word break then the stemming algorithm would actually get called separately for arbeiter then innen which would produce stems arbeit (as you want) then inn.

For example, your first case is handled by the javascript demo as two words with a * between so stems to arbeit and inn (see https://snowballstem.org/demo.html?text=Arbeiter*innen#German) The other two cases are intended to be handled similarly by the demo, but the regexp used to word split the text seems to not handle this as I'd expect for some reason (which is a bug in the demo I wasn't previously aware of).

Maybe it's useful to add rules to remove such suffixes for the _ case. If you think this is worth pursuing in light of the above, please can you propose a patch?

@ojwb
Copy link
Member

ojwb commented Sep 3, 2021

This was opened against snowball-data which is just testdata - the code of the stemmers is in the snowball repo, so I'm going to move this ticket there.

(The testdata is in a separate repo because it's very large - this way people who just want to build the code from git don't have to download a lot of extra data that they probably don't want.)

@ojwb ojwb transferred this issue from snowballstem/snowball-data Sep 3, 2021
@ojwb
Copy link
Member

ojwb commented Oct 6, 2021

I worked out why the demo wasn't working (we need to specify the u flag so the regexp works on Unicode characters) and now it works as intended:

https://snowballstem.org/demo.html?text=Arbeiter*innen%0aArbeiter%23innen%0aArbeiter_innen#German

More generally though _ might be a word character (as I noted above).

@ojwb
Copy link
Member

ojwb commented Nov 9, 2022

Closing - as I explained above, such cases will usually actually already work, and the submitter hasn't responded for over a year so I can only assume they were satisfied with that.

@ojwb ojwb closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants