Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latin chars discarded #43

Open
shei-pi opened this issue May 13, 2020 · 6 comments
Open

latin chars discarded #43

shei-pi opened this issue May 13, 2020 · 6 comments

Comments

@shei-pi
Copy link
Contributor

shei-pi commented May 13, 2020

The detect function can't always guess the right encoding, so the decompress methods falls back to 'utf-8' and ignoring the errors. Therefore the latin chars get discarded.
There should be an optional param to force the decoding to be used when decoding the data.

@CosmicHorrorDev
Copy link

It seems a bit silly to always fall back to utf-8 when no character detection libraries are installed. data actually contains the encoding specified when subtitles are uploaded (even though this can be wrong if the author set it wrong) with the key "SubEncoding" so I feel like that should really be the first choice to fall back on.

@agonzalezro
Copy link
Owner

That's a really good point @LovecraftianHorror, I didn't realise we had that info on the response.

@jprabino could you change your PR #44 for testing this?

Thanks to both!

@shei-pi
Copy link
Contributor Author

shei-pi commented Jun 2, 2020

Got it. Thanks @LovecraftianHorror!

@shei-pi
Copy link
Contributor Author

shei-pi commented Jun 17, 2020

@LovecraftianHorror can you give an example?
I've been trying to test this but couldn't find where the "SubEncoding" key is supposed to be.
We get self.data using:
self.data = self.xmlrpc.DownloadSubtitles(self.token, ids)

Here's the result of that.

`self.data

'status':'200 OK'
'data':[{'data': 'H4sIAAAAAAAAA5V9zY4c...zzOeZotQAA', 'idsubtitlefile': '1955339152'}]
'seconds':0.06
len():3`

'data' inside self.data has the subtitles in base64, by the time we convert from base64 we should have the encoding already.
In any case I checked and couldn't find the "SubEncoding" anywhere.

@CosmicHorrorDev
Copy link

CosmicHorrorDev commented Jun 17, 2020

Oh sorry I should have looked into the library more to see how downloading actually worked. The "SubEncoding" is returned on the results from SearchSubtitles not on the download information. You might be able to add another parameter for passing in encodings for the subtitle ids with a dictionary to make them optional? I'm not really sure about the best way to do this then.

If you want another reference there's me getting the "SubEncoding" here for a search result and using it here for downloaded subtitles. I recently changed downloading to just directly download the content instead of decoding it which avoids the error entirely. It seems like you should be able to do that here except for when return_decoded_data is True which would force you to decode it. Maybe changing to that would be the best course of option since I doubt many people use return_decoded_data, but it would still be good to have a way to specify the decoding.

Hopefully, that gives some options. I should have looked into the code a bit more before commenting the first time.

@tiago-portugal
Copy link

tiago-portugal commented Aug 29, 2022

Thanks for the insights. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants