Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding is not taken into account when parsing file #116

Closed
edevil opened this issue May 30, 2017 · 10 comments
Closed

Encoding is not taken into account when parsing file #116

edevil opened this issue May 30, 2017 · 10 comments

Comments

@edevil
Copy link

edevil commented May 30, 2017

If we're parsing an XML file with an encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Example:

> Enum.at(Floki.find(body, "description"), 7)
{"description", [],
 [<<60, 112, 62, 65, 32, 66, 97, 114, 98, 97, 114, 97, 32, 77, 97, 114, 114, 32,
    77, 117, 114, 100, 101, 114, 32, 77, 121, 115, 116, 101, 114, 121, 44, 32,
    66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62, 60, 112, 62, ...>>]}

This example was taken from: "http://manybooks.net/index.xml"

@philss
Copy link
Owner

philss commented Jun 1, 2017

Hi @edevil! Sorry for the delay. Today Floki uses the Mochiweb as the HTML parser. As mentioned in the Mochiweb project, it does not support other encodings there are not UTF-8.

Please try @mischov ov 's suggestion to convert your document to UTF-8: rusterlium/html5ever_elixir#6 (comment)

Thanks!

@philss philss closed this as completed Jun 1, 2017
@edevil
Copy link
Author

edevil commented Jun 1, 2017

Ok, thanks!

@nuno84
Copy link
Contributor

nuno84 commented Jul 27, 2022

Hi,
A few years passed by since this thread was posted... Is there any update to this issue?
I also need to parse a charset=windows-1252 page and am not sure how to do it.
This reply url is not very clear also, but is this the way to go?
Thank you

@philss
Copy link
Owner

philss commented Jul 27, 2022

Hey @nuno84 👋
This is still a problem, since Floki does not implement the algorithm for detecting the encoding of the page.

But what @mischov suggested there is that you could use the Codepagex Hex package to convert from your encoding to UTF8.
I think you can archive the same result without that package, by using the :unicode module from Erlang:

html = :unicode.characters_to_binary(your_html, :latin1)

Floki.parse_document!(html)

Since latin1 (or ISO 8859-1) is a superset of window-1252 this should work.

@nuno84
Copy link
Contributor

nuno84 commented Jul 28, 2022

Hi again Filipe,
I tested and both solutions solve the issue. I will keep yours as it is one less dependency 👍
One last question: Is there any way to detect if I need to decode the HTML? Do I need to do some regex on the HTML head to look for encoding property or something similar? Any thoughts on that?
Thank you very much.

@nuno84
Copy link
Contributor

nuno84 commented Jul 28, 2022

I was thinking about this.
Isn't the encoding on the meta in the head?
It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?
I could work on it... maybe through and apply_auto_encode option to make it optionable?
what are your thoughts?

@nuno84
Copy link
Contributor

nuno84 commented Jul 28, 2022

Ok, for future reference, I found that that conversion is not complete.
The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.
I installed the package: {:tds_encoding, "~> 1.1"}
Tds.Encoding.decode(body, encoding)
And now it worked.
But this installed a lot of stuff and I am not that happy with such bigger dependency:

==> toml
Compiling 10 files (.ex)
Generated toml app
==> rustler
Compiling 7 files (.ex)
Generated rustler app
==> tds_encoding
Compiling 1 file (.ex)
    Updating crates.io index
Compiling lib/tds_encoding.ex (it's taking more than 10s)B/s
  Downloaded quote v1.0.9
  Downloaded void v1.0.2
  Downloaded unicode-xid v0.2.2
  Downloaded lazy_static v1.4.0
  Downloaded encoding-index-simpchinese v1.20141219.5
  Downloaded unicode-segmentation v1.6.0
  Downloaded encoding-index-singlebyte v1.20141219.5
  Downloaded rustler_sys v2.1.1
  Downloaded heck v0.3.1
  Downloaded rustler v0.22.0
  Downloaded rustler_codegen v0.22.0
  Downloaded encoding v0.2.33
  Downloaded proc-macro2 v1.0.29
  Downloaded syn v1.0.77
  Downloaded encoding_index_tests v0.1.4
  Downloaded unreachable v1.0.0
  Downloaded encoding-index-korean v1.20141219.5
  Downloaded encoding-index-japanese v1.20141219.5
  Downloaded encoding-index-tradchinese v1.20141219.5
  Downloaded 19 crates (1.1 MB) in 1.53s
Compiling crate tds_encoding in release mode (native/tds_encoding)
   Compiling encoding_index_tests v0.1.4
   Compiling proc-macro2 v1.0.29
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.77
   Compiling unicode-segmentation v1.6.0
   Compiling rustler_sys v2.1.1
   Compiling void v1.0.2
   Compiling rustler v0.22.0
   Compiling lazy_static v1.4.0
   Compiling encoding-index-tradchinese v1.20141219.5
   Compiling encoding-index-simpchinese v1.20141219.5
   Compiling encoding-index-korean v1.20141219.5
   Compiling encoding-index-japanese v1.20141219.5
   Compiling encoding-index-singlebyte v1.20141219.5
   Compiling unreachable v1.0.0
   Compiling heck v0.3.1
   Compiling encoding v0.2.33
   Compiling quote v1.0.9
   Compiling rustler_codegen v0.22.0
   Compiling tds_encoding v0.2.0 (C:\Users\Asus\Documents\Business\Phoenix\Projects\stageagenda_umbrella\deps\tds_encoding\native\tds_encoding)
    Finished release [optimized] target(s) in 23.12s
Generated tds_encoding app

Any suggestion? At least it seems to be working now.
Thank you

@philss
Copy link
Owner

philss commented Jul 28, 2022

Ok, for future reference, I found that that conversion is not complete.
The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.

Sorry, I swapped the things. Actually windows-1252 is a superset of ISO 8859-1.

Isn't the encoding on the meta in the head?
It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?

No, unfortunately it is not that simple. See the algorithm description here: https://html.spec.whatwg.org/#determining-the-character-encoding

I installed the package: {:tds_encoding, "~> 1.1"}
Tds.Encoding.decode(body, encoding)
And now it worked.
But this installed a lot of stuff and I am not that happy with such bigger dependency:

I see. This is because that dependency is using Rustler, but without precompilation. I think a solution would be to propose the usage of Rustler Precompiled there. I can help with that if you want :)
But should be really straightforward if you follow the examples.

I'm also planning to create another package for that, but I haven't been able to focus on that.

But I have one question: are you trying to parse random pages from the internet? Or do you have some specific target that uses this specific encoding (windows-1252)?

@nuno84
Copy link
Contributor

nuno84 commented Jul 29, 2022

I also thought if it was simple it would be done a long time ago.
I am parsing random pages, that is why some will eventually have "weird" encodings but I can specify each encoding by hand, no problem with that as the process will always be individually made.
Ok, I can try the precompiled lib you did, but I dont understand it:
I will add to deps:

...
      {:rustler_precompiled, "~> 0.5"},
      {:rustler, "~> 0.23.0", optional: true},
      {:tds_encoding, "~> 1.1"}
...

mix deps thwors an error:

Failed to use "rustler" because
  apps/my_app/mix.exs requires ~> 0.22.0
  rustler_precompiled (version 0.5.1) requires ~> 0.23
  mix.lock specifies 0.22.2

Added module:

defmodule MyApp.RustlerNative do
  version = Mix.Project.config()[:version]

  use RustlerPrecompiled,
    otp_app: :my_app,
    base_url:
      "https://github.com/philss/rustler_precompilation_example/releases/download/v#{version}",
    force_build: System.get_env("RUSTLER_PRECOMPILATION_EXAMPLE_BUILD") in ["1", "true"],
    version: version

  # When your NIF is loaded, it will override this function.
  def add(_a, _b), do: :erlang.nif_error(:nif_not_loaded)
end

Now I can call the function as usuall?
Tds.Encoding.decode(body, encoding)

Is this the process or am I missing something? I read the blog post and example you did. The deps are failing but I don't know if I should try a lower version on rustler_precompiled ??
Is the work of using precompiled worth it? Is it because CI tests start verything from ground up every single pass? Am I understanding it right?
Thank you once again Philip.

@nuno84
Copy link
Contributor

nuno84 commented Aug 4, 2022

For reference,
This is solved by using the lib: {:excoding, "~> 0.1.2"},
Excoding.decode(body, encoding)
More info https://github.com/philss/floki/issues/116#issuecomment-1205577086
Thank you once again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants