German - gender neutral language #153

eest9 · 2021-07-14T20:57:27Z

It appears that this dictionary for stemming doesn't deal properly with gender neutral word forms. In German often Texts use for example "Arbeiter*innen", "Arbeiter:innen" or "Arbeiter_innen" (aka gender gap) in order to include persons of all genders while most conservative authors just use "Arbeiter" (aka generic masculine). In my understanding this word forms should all be reduced to the same stem.

ojwb · 2021-07-23T03:40:20Z

Note the stemming is algorithmic, not dictionary based.

If I follow you, whether this matters is dependent on how the text is word split, which is something external to the Snowball algorithms. Typically though words are split by finding spans of "word characters", which are typically letters or letters and numbers. I'd expect * and : would be treated as non word characters so would be a word break; _ is sometimes included as a word character and sometimes not depending on what's being searched. If the punctuation before innen is treated as a word break then the stemming algorithm would actually get called separately for arbeiter then innen which would produce stems arbeit (as you want) then inn.

For example, your first case is handled by the javascript demo as two words with a * between so stems to arbeit and inn (see https://snowballstem.org/demo.html?text=Arbeiter*innen#German) The other two cases are intended to be handled similarly by the demo, but the regexp used to word split the text seems to not handle this as I'd expect for some reason (which is a bug in the demo I wasn't previously aware of).

Maybe it's useful to add rules to remove such suffixes for the _ case. If you think this is worth pursuing in light of the above, please can you propose a patch?

ojwb · 2021-09-03T03:19:58Z

This was opened against snowball-data which is just testdata - the code of the stemmers is in the snowball repo, so I'm going to move this ticket there.

(The testdata is in a separate repo because it's very large - this way people who just want to build the code from git don't have to download a lot of extra data that they probably don't want.)

ojwb · 2021-10-06T05:42:38Z

I worked out why the demo wasn't working (we need to specify the u flag so the regexp works on Unicode characters) and now it works as intended:

https://snowballstem.org/demo.html?text=Arbeiter*innen%0aArbeiter%23innen%0aArbeiter_innen#German

More generally though _ might be a word character (as I noted above).

ojwb · 2022-11-09T03:51:31Z

Closing - as I explained above, such cases will usually actually already work, and the submitter hasn't responded for over a year so I can only assume they were satisfied with that.

ojwb transferred this issue from snowballstem/snowball-data Sep 3, 2021

OlgaGuselnikova mentioned this issue Dec 22, 2021

German stemmer possible improvements #161

Open

ojwb closed this as completed Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German - gender neutral language #153

German - gender neutral language #153

eest9 commented Jul 14, 2021

ojwb commented Jul 23, 2021

ojwb commented Sep 3, 2021

ojwb commented Oct 6, 2021

ojwb commented Nov 9, 2022

German - gender neutral language #153

German - gender neutral language #153

Comments

eest9 commented Jul 14, 2021

ojwb commented Jul 23, 2021

ojwb commented Sep 3, 2021

ojwb commented Oct 6, 2021

ojwb commented Nov 9, 2022