-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding is not taken into account when parsing file #116
Comments
Hi @edevil! Sorry for the delay. Today Floki uses the Mochiweb as the HTML parser. As mentioned in the Mochiweb project, it does not support other encodings there are not UTF-8. Please try @mischov ov 's suggestion to convert your document to UTF-8: rusterlium/html5ever_elixir#6 (comment) Thanks! |
Ok, thanks! |
Hi, |
Hey @nuno84 👋 But what @mischov suggested there is that you could use the Codepagex Hex package to convert from your encoding to UTF8. html = :unicode.characters_to_binary(your_html, :latin1)
Floki.parse_document!(html) Since |
Hi again Filipe, |
I was thinking about this. |
Ok, for future reference, I found that that conversion is not complete.
Any suggestion? At least it seems to be working now. |
Sorry, I swapped the things. Actually
No, unfortunately it is not that simple. See the algorithm description here: https://html.spec.whatwg.org/#determining-the-character-encoding
I see. This is because that dependency is using Rustler, but without precompilation. I think a solution would be to propose the usage of Rustler Precompiled there. I can help with that if you want :) I'm also planning to create another package for that, but I haven't been able to focus on that. But I have one question: are you trying to parse random pages from the internet? Or do you have some specific target that uses this specific encoding (windows-1252)? |
I also thought if it was simple it would be done a long time ago.
mix deps thwors an error:
Added module:
Now I can call the function as usuall? Is this the process or am I missing something? I read the blog post and example you did. The deps are failing but I don't know if I should try a lower version on rustler_precompiled ?? |
For reference, |
If we're parsing an XML file with an encoding:
Example:
This example was taken from: "http://manybooks.net/index.xml"
The text was updated successfully, but these errors were encountered: