Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliterate Song artist and title to ASCII for search #795

Merged

Conversation

DeinAlptraum
Copy link
Contributor

@DeinAlptraum DeinAlptraum commented Dec 27, 2023

Transliterate all song text fields for search (Title, Artist, Edition etc.) as well as search queries themselves.

This allows matching e.g. 'Ż' by typing 'z', 'ł' with 'l' or the typographers apostrophe with the straight one '.
The one downside is that we break everything down into ASCII, if you e.g. search for 'ö' you will see all results that match the normal 'o'.

The transliteration uses the GNU utility iconv (iconv -f UTF-8 -t ASCII//TRANSLIT) for transliteration, I've added the iconvenv unit to the repository with some modifications to make it compatible with Windows, as well as the iconv.dll for Windows.

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Dec 27, 2023

iconv -f utf-8 -t ASCII//TRANSLIT converts "Mötley Crüe" to "Moetley Cruee" when running on my system. The band was not aware how ö and ü were pronounced and transliterated when they chose that name.

And as you can see in the failed CI build, no iconvenc on Windows.

@DeinAlptraum
Copy link
Contributor Author

iconv -f utf-8 -t ASCII//TRANSLIT converts "Mötley Crüe" to "Moetley Cruee" when running on my system.

Do you think this is a problem? The plan is to also transliterate the search query, so you should be able to find Mötley Crüe by actually entering Mötley etc. While this isn't optimal as you can't pinpoint actual 'ö' etc. anymorw by entering them, I think it's worth the overall improvement.

And as you can see in the failed CI build, no iconvenc on Windows.

I will add the iconv.dll for Windows later

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Dec 27, 2023

The result of this conversion also depends on the system settings. With LANG=de_DE.utf8 I get ö -> oe, but with LANG=en_US.utf8 I get ö -> o.

Please build the iconv.dll in a way that it does not depend on any other DLLs except maybe those provided by Windows (msvcrt.dll). And it is not just the DLL that is missing. You need to provide the Free Pascal unit.

@DeinAlptraum
Copy link
Contributor Author

The result of this conversion also depends on the system settings. With LANG=de_DE.utf8 I get ö -> oe, but with LANG=en_US.utf8 I get ö -> o.

Oh, I didn't know. I guess that might not be a bad thing though, and as long as both Title/Artist and search query use the same conversion... will see if we can also force the locale setting through the Pascal interface

it is not just the DLL that is missing. You need to provide the Free Pascal unit.

Sorry, in more detail: I plan to change to xmliconv and xmliconv_windows. Both use iconv, the latter will need the .dll to work.

@DeinAlptraum
Copy link
Contributor Author

DeinAlptraum commented Dec 28, 2023

Turns out xmliconv doesn't really offer an interface to iconv unfortunately, I took xmliconv_windows as inspiration to adapt iconvenc, tested and it works both on Linux and Windows for me. I can't test this on Mac, since I don't have one.

Also, I can't tell why AppVeyor failed... or rather, didn't even try, on the last commit...
I've removed the GetStringWithNoAccents function since it was used only as a bootleg version of proper transliteration to ASCII anyway as far as I could tell. I replaced its usage with a new utility that does exactly that. This also means transliterating the search query, so if you type e.g. "ß" that now matches songs with "ß" in their names again. It also matches "ss" though, which is the one downside of this approach imo.
I've also renamed the -NoAccent fields to -ASCII since that seems more sensible now.

@DeinAlptraum DeinAlptraum marked this pull request as ready for review December 28, 2023 18:48
src/ultrastardx.dpr Outdated Show resolved Hide resolved
game/iconv.dll Outdated Show resolved Hide resolved
@s09bQ5
Copy link
Collaborator

s09bQ5 commented Dec 29, 2023

I added my own build of the DLL including debug data and license text, squashed your commits and made three tiny fixes (uses in UUnicodeUtils.pas, wrong dll name in files_main_uninstall.nsh, unused initc unit in inconvenc_windows.pas):
https://github.com/UltraStar-Deluxe/USDX/tree/uc-normalize-search
If you agree with these changes, please force push them to your branch.

Btw., GNU libiconv transliterates Mötley Crüe to M"otley Cr"ue regardless of the system's language settings. People searching for Motley Crue will never find the band's songs.

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Dec 29, 2023

Is https://github.com/anyascii/anyascii maybe better?

@DeinAlptraum
Copy link
Contributor Author

tiny fixes (uses in UUnicodeUtils.pas

I don't see any difference for that first one? Pushing your changes though.

Is https://github.com/anyascii/anyascii maybe better?

Looks closer to what we want, but doesn't have a Pascal implementation, right? I don't really feel like getting into this with my current knowledge of Pascal, but if you really prefer this I'll take a look. Otherwise I'd rather go with what we have now, except...

Btw., GNU libiconv transliterates Mötley Crüe to M"otley Cr"ue regardless of the system's language settings. People searching for Motley Crue will never find the band's songs.

Ugh, that sucks. I looked for ways to pass a locale setting to iconv, but couldn't find one. Windows does have a setlocale implementation, but that corresponds to a registry edit every time it is called apparently, so not that great.
So I guess it's either manually filtering quotation marks in TransliterateToASCII (and hoping we've covered everything), or trying to use AnyASCII.

@DeinAlptraum
Copy link
Contributor Author

Addendum:

I looked for ways to pass a locale setting to iconv, but couldn't find one. Windows does have a setlocale implementation, but that corresponds to a registry edit every time it is called apparently, so not that great.

Just found SetThreadLocale in JwaWindows, but like you said, libiconv ignores the settings on Windows...

So I guess it's either manually filtering quotation marks in TransliterateToASCII (and hoping we've covered everything), or [...]

And that does not cover everything, e.g. "Pokémon" is transliterated to "pok'emon"... so can't really go with this unless we're fine with purging all quotation marks, apostrophes etc. from titles, artists, search queries etc.

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Dec 30, 2023

tiny fixes (uses in UUnicodeUtils.pas

I don't see any difference for that first one? Pushing your changes though.

You added iconvenc a second time at the end of the list of used units so that it was also requested on Windows.

Is https://github.com/anyascii/anyascii maybe better?

Looks closer to what we want, but doesn't have a Pascal implementation, right? I don't really feel like getting into this with my current knowledge of Pascal, but if you really prefer this I'll take a look. Otherwise I'd rather go with what we have now, except...

I wrote a small program to convert the arrays of anyascii.c to pascal and converted the tiny anyascii function by hand.

There is also ICU for transliteration and I expect it to do a better job than anyascii, but the library is huge. Anyascii adds "only" half a megabyte.

@DeinAlptraum
Copy link
Contributor Author

You added iconvenc a second time at the end of the list of used units so that it was also requested on Windows.

Then why does the force push have no diff on that file? ^^

I wrote a small program to convert the arrays of anyascii.c to pascal and converted the tiny anyascii function by hand.

Oh that's cool! Should I close this PR then, or how would you like to continue?

@bohning
Copy link
Collaborator

bohning commented Jan 29, 2024

@s09bQ5 What's your suggestion on how to proceed here?

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Feb 3, 2024

I added anyascii to a branch in my repo: s09bQ5/USDX@de03e83. It is not yet used, just compiled and linked. @DeinAlptraum, if you want, you can add your work on top of that.

@DeinAlptraum
Copy link
Contributor Author

Will take a look, but I am very busy this month so not sure when I'll get around to that.

@DeinAlptraum DeinAlptraum force-pushed the uc-normalize-search branch 2 times, most recently from 6e405fc to e93ab4e Compare February 10, 2024 22:23
@DeinAlptraum
Copy link
Contributor Author

Changed from iconv to use your AnyASCII bindings instead @s09bQ5, and that seems to work like a charm as far as I've tested. Thanks for your work!

Just one minor thing: compiling anyascii takes more than 3 minutes on my laptop on Windows. Linux takes only a few seconds, and the CI also doesn't seem to take significantly longer than before, but just FYI...

@s09bQ5
Copy link
Collaborator

s09bQ5 commented Feb 10, 2024

Changed from iconv to use your AnyASCII bindings instead

Looks good!

compiling anyascii takes more than 3 minutes on my laptop on Windows.

Can you check if that is still the case when converting the \n line endings to \r\n? Maybe the Windows version thinks the whole file is a single line.

@DeinAlptraum
Copy link
Contributor Author

Can you check if that is still the case when converting the \n line endings to \r\n?

Changed it to CRLF, but that didn't make a (notable) difference.
If you think this isn't a reason to hold this back, one of you will have to merge since I don't have the permissions to do so.
Otherwise, would be cool if someone can check whether they have the same problem. Waiting > 3mins for every compile must be painful if you develop on Windows (though I only tried this in Lazarus)

@barbeque-squared
Copy link
Member

I won't have time to check on the Windows compile process (I use the make approach) until next weekend. But the code itself looks fine, so if I forget (but please remind me on Discord) just merge it anyway 👍 Same if someone else tries it in the meantime and doesn't experience issues/it's Lazarus-specific.

@DeinAlptraum
Copy link
Contributor Author

I also just tested compiling via MSYS2 on Windows, which takes about half as long... so "only" about 90 seconds for AnyASCII

@barbeque-squared
Copy link
Member

I was finally able to look at this. If I apply the patch I get the following output:

$ git apply 795.patch
795.patch:158865: trailing whitespace.
				
795.patch:158961: trailing whitespace.
			} else 
warning: 2 lines add whitespace errors.

I suspect these correspond to anyascii.pas line 95 and anyascii_c_to_pas.c line 79? I know that it's purely cosmetic, but some editors want to automatically 'fix' things like this so it's better to not have it at all.


As for the compilation time issue: it only takes a few seconds on my Windows computer, and it only compiles it once anyway. It's probably some compiler optimization thing somewhere?
As for the feature itself: I don't really have all that many songs with weird characters, but it doesn't break anything either so once the whitespace is fixed it'll get merged.

@DeinAlptraum
Copy link
Contributor Author

@barbeque-squared okay, that's great!
Done with the whitespace changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants