Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False characters shown #199

Closed
Dragodraki opened this issue Jan 17, 2024 · 8 comments
Closed

False characters shown #199

Dragodraki opened this issue Jan 17, 2024 · 8 comments

Comments

@Dragodraki
Copy link

Dragodraki commented Jan 17, 2024

On website https://buerohaus-ahner.bueroshops.de all €-characters turned into ÿ-characters when your addon is enabled.
Can you make Wingman Jr. allow this font type and display the €-characters on those websites again?

@wingman-jr-addon
Copy link
Owner

Thanks @Dragodraki , I'll start taking a look into the root cause for this specific site. I've hit so many cases already, I'm curious to see what the new case is.
Here's a specific page that has a euro symbol on it:
https://buerohaus-ahner.bueroshops.de/artikeldetails/standard/SCA226002/abfallbehaelter-metall-20-liter-wei-wandmontage-moeglich.html
Without the addon:
image
With the addon:
image

@wingman-jr-addon
Copy link
Owner

Well @Dragodraki this one was interesting, but also has been a likely source of problems for any of the characters appearing in the range 0x80 to 0x9F. See the reference here: https://www.i18nqa.com/debug/bug-iso8859-1-vs-windows-1252.html

Here's what happened.
I looked into the matter, and discovered that the character encoding declared by the web page on the heading was clearly iso-8859-1. Unlike many of the other bugs, the actual character set detection logic was fully working and it wasn't a new edge case for that.
However, I noticed that Firefox was listing the Text Encoding as Windows 1252.
So, it turns out that iso-8859-1 is special for legacy reasons. iso-8859-1 is actually treated as an alias for Windows 1252. Windows 1252 varies from iso-8859-1 for the character range 0x80 to 0x9F, so this seemed like the problem. But why was this failing?
Well, as it turns out, Firefox has a TextDecoder and a TextEncoder. The TextDecoder can be instantiated with any valid label, such as 'windows-1252', but the TextEncoder only supports UTF-8.
At a macro level, the addon needs to decode/re-encode text in the target charset. So not being able to encode in a specified target charset is a problem. I had a special routine that would encode iso-8859-1.
Well, I was assuming that the only time this would be used is when it had already decoded iso-8859-1 input, in which case with true iso-8859-1 all Unicode character codes would be in the range 0-255. As a safety, I clamped anything greater at 255. However, in reality, when windows-1252 masquerading as iso-8859-1 was being decoded, it would generate Unicode character codes greater than 255 for any of the special cases where windows-1252 and iso-8859-1 differed.
So, know that my actual input to the text encoding was windows-1252, I added a lookup table to convert those special characters back into Windows 1252 byte encodings and now it looks like it's working.
I'd like to maybe poke on this further but at least the page above now generates properly.

@Dragodraki
Copy link
Author

Thanks for your very quick support, again! :)
Since the problem is not solved yet, I suppose you'll fix it in your next version of Wingman, won't you?

@wingman-jr-addon
Copy link
Owner

Yes that's the plan @Dragodraki. Right now I'm putting much of my efforts into a next generation of detection model over at wingman-jr-addon/model#7.

In the meantime if you're feeling adventurous you can give the branch a try. Setting it up is as easy as 1) cloning, 2) checking out the branch, 3) going to about:debugging -> This Firefox -> Load Temporary Addon and picking the manifest.json file.

Thanks for your continued bug reports - they will make the addon so much better for international users!

@Dragodraki
Copy link
Author

So, no fix - it's okay, since its about this single site only.

You mentioned you work on next generation detection model. Now I am curious, do you mind telling a bit about it? So many addons and also general programs are forks of others or pushed aside by standard apps from big manufacturers, there is barely anything new out there - that's why I like to try new programs.

@wingman-jr-addon
Copy link
Owner

@Dragodraki Well I'm posting some progress about the new model over at the other issue wingman-jr-addon/model#7 , but in short I'm looking at some of the research that's been done in the last 2 years that combines the advances of vision transformers with convolutional networks as well as some of the robustness pretraining approaches like CLIP and DINO.

@wingman-jr-addon
Copy link
Owner

(I'm going to leave this open until I have the PR merged)

@wingman-jr-addon
Copy link
Owner

@Dragodraki I added a test file, see #200 for visual differences in character translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants