Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Issue with french characters: é, è, ç #33

Closed
Gryzounours opened this issue Aug 21, 2023 · 16 comments
Closed

[Bug] Issue with french characters: é, è, ç #33

Gryzounours opened this issue Aug 21, 2023 · 16 comments
Assignees
Labels
bug Something isn't working priority: high This needs fixing ASAP

Comments

@Gryzounours
Copy link

After running subclean on srt files with é è ç and many more, they are replaced by strange � characters. Is there a way to solve it ?

Thanks a lot

@Gryzounours Gryzounours added bug Something isn't working priority: medium Medium Priority labels Aug 21, 2023
@DrKain
Copy link
Owner

DrKain commented Aug 21, 2023

Thanks for reporting, do you have a sample of the subtitle file before being modified so I can look into this?
The version number would also help.

@DrKain DrKain added the need info Need more information before looking into this label Aug 21, 2023
@Gryzounours
Copy link
Author

Hey,
here it is
dh1998.zip

version 1.5.0

@DrKain DrKain removed the need info Need more information before looking into this label Aug 21, 2023
@DrKain
Copy link
Owner

DrKain commented Aug 21, 2023

Thanks for the sample. The issue appears to be caused by the file encoding (related #8), the file you provided is ASNI and the output, once cleaned, is UTF-8.

If you're using Bazarr you can enable an option in settings to automatically convert these files to UTF-8 without breaking characters:
Settings → Subtitles → Post-Processing → Encode Subtitles To UTF8

Or you can fix the current file using Notepad++, simply clicking "Convert to UTF-8"
image

There is an open issue for this #8 that will be resolved when I get the time. Sorry for the inconvenience.
I will leave this issue open until the linked issue is closed.

@DrKain
Copy link
Owner

DrKain commented Aug 21, 2023

Here's the cleaned file you provided with the correct format: dh1998-utf8.zip

@Gryzounours
Copy link
Author

I don't use Bazarr, I use: https://github.com/Valyreon/Subloader

Great little tool.

Can't wait until you fix it ;)

Have a nice day

@Arecsu
Copy link

Arecsu commented Jan 19, 2024

Woops, I think this bug is sort of critical. I made the mistake to download the latest binary and confidently ran the tool against my whole HTPC library. I don't have Bazarr, I used the tool directly.

Those spanish subtitles that were not encoded as UTF-8 ended up full of weird characters and overwritten by it. Now I'm trying to figure out which subtitles were corrupted in the process and search for them again :/

Great tool by the way, it does the job! Although, this bug killed 30% of my library 🥲

@DrKain
Copy link
Owner

DrKain commented Jan 19, 2024

Sorry to hear that @Arecsu. I'll try prioritize a fix when I can. I've been in and out of hospital for the last few months so I've not had a lot of time to work on this.

DrKain added a commit that referenced this issue Jan 19, 2024
This should be a fix for #33. All files will now be re-encoded to UTF-8 if any other encoding is found.
@DrKain DrKain closed this as completed Jan 19, 2024
@DrKain
Copy link
Owner

DrKain commented Jan 19, 2024

If anyone has more problems with the latest version please open a new issue

DrKain added a commit that referenced this issue Jan 19, 2024
@DrKain
Copy link
Owner

DrKain commented Jan 19, 2024

Final comment just for comparison so I don't forget it later on.

Code_RGYxOIw38x

@Arecsu
Copy link

Arecsu commented Jan 19, 2024

Sorry to hear that @Arecsu. I'll try prioritize a fix when I can. I've been in and out of hospital for the last few months so I've not had a lot of time to work on this.

Hey! I didn't know you were going through a difficult time, I hope you and everyone else is doing well now 🙏

I went to sleep, and just woke up to find out you've managed to fix the issue. Wooow. Highly appreciate it. Will test it later. Thank you so much!!

@Arecsu
Copy link

Arecsu commented Jan 22, 2024

hey! re-opening this issue because there's still encoding issues.

Here is the source:

1
00:02:10,058 --> 00:02:11,530
- Howard.
- Buenos días.

2
00:02:11,600 --> 00:02:14,493
- Entrega de McGill.
- ¿Qué haces aquí?

3
00:02:14,563 --> 00:02:16,995
No te he visto.
Quise ver cómo estabas.

Here is the result:

1
00:02:10,058 --> 00:02:11,530
- Howard.
- Buenos d}as.

2
00:02:11,600 --> 00:02:14,493
- Entrega de McGill.
- }Qu} haces aqu}?

3
00:02:14,563 --> 00:02:16,995
No te he visto.
Quise ver c}mo estabas.

This is the log:

[Info] Encoding: cp1252, Language: spanish
[Info] Language is spanish, using ascii

The source encoding is indeed cp1252. But then, it seems to use ASCII to process the subtitles. ASCII doesn't have the needed characters. Hmmm would not be better to process the files using UTF-8?

Here is the file attached in TXT format: example.txt

Thank you so much!

@DrKain
Copy link
Owner

DrKain commented Jan 22, 2024

Thanks for reporting, I'll look more into this later on in the week.
The different encodings can be tricky, UTF-8 was the original target but that was breaking some subtitles as reported in this issue. Some encodings will break the parser meaning the tool can't read each node and process the text, so this is why I originally started using UTF-8. I'll work something out eventually.
Thanks for the example file too.

@DrKain DrKain reopened this Jan 22, 2024
@DrKain
Copy link
Owner

DrKain commented Jan 22, 2024

If you have the time I've whipped up a test build if you wanted to give it a shot. The file size is too large for GitHub chat so I had to use dropbox. This changes the default encoding to utf16le and allows you to pass your own encoding using --encoding utf8 to see what I mean about the broken characters. I'll still look more into this when I get the time but it has been a very busy week. Thanks for reporting the issue though.

This is no-longer the case, --encoding and --encodefile parameters will be removed in the next update as the 1.7.0 added support for a bunch more formats so these should not be required anymore

@Gryzounours
Copy link
Author

Let's imagine a test case with a lot of french and spanish subtitles in a folder and various subfolders: some of them are encoded with utf8, others with Ansi, if we run subclean --sweep, What would happen ? it would convert the files first to utf8 then run the cleaning algorithm ?

@DrKain
Copy link
Owner

DrKain commented Jan 22, 2024

The --sweep rule will still obey all regular parameters so the character encoding will be changed.
Currently the best way I've found to convert files was using Notepad++. Open the file, click "Encoding" at the top and click "Convert to UTF-8".,

Clearly there's something odd going on with how NodeJS handles character encodings so I'll need to look more into this when I get the time. Or, if another user wants to fork the repo and take a shot they're more than welcome.

@DrKain DrKain added priority: high This needs fixing ASAP and removed priority: medium Medium Priority labels Jan 22, 2024
@DrKain
Copy link
Owner

DrKain commented Jan 22, 2024

Took 3-4 hours but I think I've finally fixed the issue, I'm publishing a new version in a few minutes. If another encoding error pops up that is not supported please open a new issue, I'll need to add custom support for the unique cases.

Thank you all for providing subtitles to test with too, they helped a lot.

@DrKain DrKain closed this as completed in b4fd008 Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: high This needs fixing ASAP
Projects
None yet
Development

No branches or pull requests

3 participants