latin chars discarded #43

shei-pi · 2020-05-13T02:14:04Z

The detect function can't always guess the right encoding, so the decompress methods falls back to 'utf-8' and ignoring the errors. Therefore the latin chars get discarded.
There should be an optional param to force the decoding to be used when decoding the data.

CosmicHorrorDev · 2020-06-01T03:57:44Z

It seems a bit silly to always fall back to utf-8 when no character detection libraries are installed. data actually contains the encoding specified when subtitles are uploaded (even though this can be wrong if the author set it wrong) with the key "SubEncoding" so I feel like that should really be the first choice to fall back on.

agonzalezro · 2020-06-02T15:35:32Z

That's a really good point @LovecraftianHorror, I didn't realise we had that info on the response.

@jprabino could you change your PR #44 for testing this?

Thanks to both!

shei-pi · 2020-06-02T19:00:24Z

Got it. Thanks @LovecraftianHorror!

shei-pi · 2020-06-17T02:31:56Z

@LovecraftianHorror can you give an example?
I've been trying to test this but couldn't find where the "SubEncoding" key is supposed to be.
We get self.data using:
self.data = self.xmlrpc.DownloadSubtitles(self.token, ids)

Here's the result of that.

`self.data

'status':'200 OK'
'data':[{'data': 'H4sIAAAAAAAAA5V9zY4c...zzOeZotQAA', 'idsubtitlefile': '1955339152'}]
'seconds':0.06
len():3`

'data' inside self.data has the subtitles in base64, by the time we convert from base64 we should have the encoding already.
In any case I checked and couldn't find the "SubEncoding" anywhere.

CosmicHorrorDev · 2020-06-17T02:50:54Z

Oh sorry I should have looked into the library more to see how downloading actually worked. The "SubEncoding" is returned on the results from SearchSubtitles not on the download information. You might be able to add another parameter for passing in encodings for the subtitle ids with a dictionary to make them optional? I'm not really sure about the best way to do this then.

If you want another reference there's me getting the "SubEncoding" here for a search result and using it here for downloaded subtitles. I recently changed downloading to just directly download the content instead of decoding it which avoids the error entirely. It seems like you should be able to do that here except for when return_decoded_data is True which would force you to decode it. Maybe changing to that would be the best course of option since I doubt many people use return_decoded_data, but it would still be good to have a way to specify the decoding.

Hopefully, that gives some options. I should have looked into the code a bit more before commenting the first time.

tiago-portugal · 2022-08-29T22:09:39Z

Thanks for the insights. 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latin chars discarded #43

latin chars discarded #43

shei-pi commented May 13, 2020

CosmicHorrorDev commented Jun 1, 2020

agonzalezro commented Jun 2, 2020

shei-pi commented Jun 2, 2020

shei-pi commented Jun 17, 2020

CosmicHorrorDev commented Jun 17, 2020 •

edited

Loading

tiago-portugal commented Aug 29, 2022 •

edited

Loading

latin chars discarded #43

latin chars discarded #43

Comments

shei-pi commented May 13, 2020

CosmicHorrorDev commented Jun 1, 2020

agonzalezro commented Jun 2, 2020

shei-pi commented Jun 2, 2020

shei-pi commented Jun 17, 2020

CosmicHorrorDev commented Jun 17, 2020 • edited Loading

tiago-portugal commented Aug 29, 2022 • edited Loading

CosmicHorrorDev commented Jun 17, 2020 •

edited

Loading

tiago-portugal commented Aug 29, 2022 •

edited

Loading