-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Issue with french characters: é, è, ç #33
Comments
Thanks for reporting, do you have a sample of the subtitle file before being modified so I can look into this? |
Hey, version 1.5.0 |
Thanks for the sample. The issue appears to be caused by the file encoding (related #8), the file you provided is ASNI and the output, once cleaned, is UTF-8. If you're using Bazarr you can enable an option in settings to automatically convert these files to UTF-8 without breaking characters: Or you can fix the current file using Notepad++, simply clicking "Convert to UTF-8" There is an open issue for this #8 that will be resolved when I get the time. Sorry for the inconvenience. |
Here's the cleaned file you provided with the correct format: dh1998-utf8.zip |
I don't use Bazarr, I use: https://github.com/Valyreon/Subloader Great little tool. Can't wait until you fix it ;) Have a nice day |
Woops, I think this bug is sort of critical. I made the mistake to download the latest binary and confidently ran the tool against my whole HTPC library. I don't have Bazarr, I used the tool directly. Those spanish subtitles that were not encoded as UTF-8 ended up full of weird characters and overwritten by it. Now I'm trying to figure out which subtitles were corrupted in the process and search for them again :/ Great tool by the way, it does the job! Although, this bug killed 30% of my library 🥲 |
Sorry to hear that @Arecsu. I'll try prioritize a fix when I can. I've been in and out of hospital for the last few months so I've not had a lot of time to work on this. |
If anyone has more problems with the latest version please open a new issue |
Hey! I didn't know you were going through a difficult time, I hope you and everyone else is doing well now 🙏 I went to sleep, and just woke up to find out you've managed to fix the issue. Wooow. Highly appreciate it. Will test it later. Thank you so much!! |
hey! re-opening this issue because there's still encoding issues. Here is the source:
Here is the result:
This is the log:
The source encoding is indeed cp1252. But then, it seems to use ASCII to process the subtitles. ASCII doesn't have the needed characters. Hmmm would not be better to process the files using UTF-8? Here is the file attached in TXT format: example.txt Thank you so much! |
Thanks for reporting, I'll look more into this later on in the week. |
This is no-longer the case, |
Let's imagine a test case with a lot of french and spanish subtitles in a folder and various subfolders: some of them are encoded with utf8, others with Ansi, if we run subclean --sweep, What would happen ? it would convert the files first to utf8 then run the cleaning algorithm ? |
The Clearly there's something odd going on with how NodeJS handles character encodings so I'll need to look more into this when I get the time. Or, if another user wants to fork the repo and take a shot they're more than welcome. |
Took 3-4 hours but I think I've finally fixed the issue, I'm publishing a new version in a few minutes. If another encoding error pops up that is not supported please open a new issue, I'll need to add custom support for the unique cases. Thank you all for providing subtitles to test with too, they helped a lot. |
After running subclean on srt files with é è ç and many more, they are replaced by strange � characters. Is there a way to solve it ?
Thanks a lot
The text was updated successfully, but these errors were encountered: