Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diacritic-Insensitive Search Support (Czech characters) #288

Open
janroz opened this issue May 28, 2023 · 4 comments
Open

Diacritic-Insensitive Search Support (Czech characters) #288

janroz opened this issue May 28, 2023 · 4 comments

Comments

@janroz
Copy link

janroz commented May 28, 2023

Hello,

I'm using TNTSearch for a project and encountered an issue with diacritic-insensitive searching. Specifically, the library seems unable to find matches when the search term differs in diacritics (e.g., searching for "OSVC" does not match "OSVČ").

I was wondering if there is a known workaround for this, or if there is a potential for a future update that might address this issue? This functionality is very important for applications dealing with languages using diacritics.

Adding this to loadConfig didn't work:

'charset' => 'utf8'

Thank you for your assistance.

@janroz janroz changed the title Diacritic-Insensitive Search Support Diacritic-Insensitive Search Support (Czech characters) May 28, 2023
@janroz
Copy link
Author

janroz commented Jun 11, 2023

Nobody?

@janroz
Copy link
Author

janroz commented Jun 11, 2023

For now I solved it by using EdgeNgramTokenizer, after some tweeks it works well:

'tokenizer' => \TeamTNT\TNTSearch\Support\EdgeNgramTokenizer::class,

@somegooser
Copy link

What did you tweek with edgengram?

  1. Did you try utf8mb4 as charset?

  2. What you also could try is "transliterate". Translate all special characters to default english. Examples:

`
['À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'È' => 'E', 'É' => 'E']

`

@igor-kamil
Copy link

I was also looking for a way to handle searches with and without diacritics, but I didn't find a working solution for now...
So, I tried to make own tokenizer and for now it seems to work nicely:
You can check it out here: AccentInsensitiveTokenizer.php.
To see how it works - check the test AccentInsensitiveTokenizerTest.php.

I hope this might be helpful to others.
(I’m considering opening a PR to add it to the TNTSearch repo, but right now it relies on Laravel helpers.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants