-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace diacritics when doing fuzzy searches #3236
Conversation
I tried this branch out with a few preset search terms that have been inconvenient in the past in Vietnamese. The change produces a slight improvement over master. I haven’t found any real problems yet. As a further test, I took this branch and removed the manually diacritic-folded Vietnamese terms using the regular expression |
@1ec5 Thanks for digging into this!
Yes, this change definitely won't solve all problems, but it should be possible to work around the more common issues by using the "preset terms".
Exact matching on the leading part of name and leading part of terms are stuff you can control by adjusting the strings in Transifex. Exact matching of the leading tag value might cause more problems in other languages. Maybe we should disable that unless the locale is I think the diacritic replacement has a more pronounced effect on 'ß' -> 'ss', because it normalizes the Levenshtein distance between the search strings. For example before this change, "grass" looks 3 chars different from "glaß", and after this change, they only differ by 1 character. |
To clarify, the Vietnamese localization is already (ab)using the preset terms to include the main preset name, any synonyms, the main name diacritic-folded, and the synonyms diacritic-folded, in that order. I believe that's why this change has little effect. Removing the diacritic-folded terms results in some results getting a lot better and some getting a lot worse, which in my opinion shows that the language-agnostic diacritic folding may be weighted too high for Vietnamese. (It would ideally count less toward the edit distance than base letter changes, whereas for other languages it should count more or the same.) To the extent that the workaround works, it's because we've specified a lot of synonyms in the Vietnamese presets. So I'll probably keep the workaround in the Vietnamese localization (despite the bloat) and hold out for a more sophisticated solution in the future. |
tldr: it means that strings like "fussball" will fuzzy match strings like "fußball"
more details:
In
collection.js
search results are returned in the following order:Diacritical marks are only replaced when calculating the "similar" search results string distance, so this is a fallback strategy from strict string matches.
(closes #3159)
I also
cc @1ec5