Encoding is not taken into account when parsing file #116

edevil · 2017-05-30T17:08:52Z

If we're parsing an XML file with an encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Example:

> Enum.at(Floki.find(body, "description"), 7)
{"description", [],
 [<<60, 112, 62, 65, 32, 66, 97, 114, 98, 97, 114, 97, 32, 77, 97, 114, 114, 32,
    77, 117, 114, 100, 101, 114, 32, 77, 121, 115, 116, 101, 114, 121, 44, 32,
    66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62, 60, 112, 62, ...>>]}

This example was taken from: "http://manybooks.net/index.xml"

The text was updated successfully, but these errors were encountered:

philss · 2017-06-01T03:49:17Z

Hi @edevil! Sorry for the delay. Today Floki uses the Mochiweb as the HTML parser. As mentioned in the Mochiweb project, it does not support other encodings there are not UTF-8.

Please try @mischov ov 's suggestion to convert your document to UTF-8: rusterlium/html5ever_elixir#6 (comment)

Thanks!

edevil · 2017-06-01T11:22:25Z

Ok, thanks!

nuno84 · 2022-07-27T16:59:33Z

Hi,
A few years passed by since this thread was posted... Is there any update to this issue?
I also need to parse a charset=windows-1252 page and am not sure how to do it.
This reply url is not very clear also, but is this the way to go?
Thank you

philss · 2022-07-27T22:26:08Z

Hey @nuno84 👋
This is still a problem, since Floki does not implement the algorithm for detecting the encoding of the page.

But what @mischov suggested there is that you could use the Codepagex Hex package to convert from your encoding to UTF8.
I think you can archive the same result without that package, by using the :unicode module from Erlang:

html = :unicode.characters_to_binary(your_html, :latin1)

Floki.parse_document!(html)

Since latin1 (or ISO 8859-1) is a superset of window-1252 this should work.

nuno84 · 2022-07-28T06:30:54Z

Hi again Filipe,
I tested and both solutions solve the issue. I will keep yours as it is one less dependency 👍
One last question: Is there any way to detect if I need to decode the HTML? Do I need to do some regex on the HTML head to look for encoding property or something similar? Any thoughts on that?
Thank you very much.

nuno84 · 2022-07-28T07:33:24Z

I was thinking about this.
Isn't the encoding on the meta in the head?
It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?
I could work on it... maybe through and apply_auto_encode option to make it optionable?
what are your thoughts?

nuno84 · 2022-07-28T12:52:11Z

Ok, for future reference, I found that that conversion is not complete.
The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.
I installed the package: {:tds_encoding, "~> 1.1"}
Tds.Encoding.decode(body, encoding)
And now it worked.
But this installed a lot of stuff and I am not that happy with such bigger dependency:

==> toml
Compiling 10 files (.ex)
Generated toml app
==> rustler
Compiling 7 files (.ex)
Generated rustler app
==> tds_encoding
Compiling 1 file (.ex)
    Updating crates.io index
Compiling lib/tds_encoding.ex (it's taking more than 10s)B/s
  Downloaded quote v1.0.9
  Downloaded void v1.0.2
  Downloaded unicode-xid v0.2.2
  Downloaded lazy_static v1.4.0
  Downloaded encoding-index-simpchinese v1.20141219.5
  Downloaded unicode-segmentation v1.6.0
  Downloaded encoding-index-singlebyte v1.20141219.5
  Downloaded rustler_sys v2.1.1
  Downloaded heck v0.3.1
  Downloaded rustler v0.22.0
  Downloaded rustler_codegen v0.22.0
  Downloaded encoding v0.2.33
  Downloaded proc-macro2 v1.0.29
  Downloaded syn v1.0.77
  Downloaded encoding_index_tests v0.1.4
  Downloaded unreachable v1.0.0
  Downloaded encoding-index-korean v1.20141219.5
  Downloaded encoding-index-japanese v1.20141219.5
  Downloaded encoding-index-tradchinese v1.20141219.5
  Downloaded 19 crates (1.1 MB) in 1.53s
Compiling crate tds_encoding in release mode (native/tds_encoding)
   Compiling encoding_index_tests v0.1.4
   Compiling proc-macro2 v1.0.29
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.77
   Compiling unicode-segmentation v1.6.0
   Compiling rustler_sys v2.1.1
   Compiling void v1.0.2
   Compiling rustler v0.22.0
   Compiling lazy_static v1.4.0
   Compiling encoding-index-tradchinese v1.20141219.5
   Compiling encoding-index-simpchinese v1.20141219.5
   Compiling encoding-index-korean v1.20141219.5
   Compiling encoding-index-japanese v1.20141219.5
   Compiling encoding-index-singlebyte v1.20141219.5
   Compiling unreachable v1.0.0
   Compiling heck v0.3.1
   Compiling encoding v0.2.33
   Compiling quote v1.0.9
   Compiling rustler_codegen v0.22.0
   Compiling tds_encoding v0.2.0 (C:\Users\Asus\Documents\Business\Phoenix\Projects\stageagenda_umbrella\deps\tds_encoding\native\tds_encoding)
    Finished release [optimized] target(s) in 23.12s
Generated tds_encoding app

Any suggestion? At least it seems to be working now.
Thank you

philss · 2022-07-28T19:00:06Z

Ok, for future reference, I found that that conversion is not complete.
The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.

Sorry, I swapped the things. Actually windows-1252 is a superset of ISO 8859-1.

Isn't the encoding on the meta in the head?
It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?

No, unfortunately it is not that simple. See the algorithm description here: https://html.spec.whatwg.org/#determining-the-character-encoding

I installed the package: {:tds_encoding, "~> 1.1"}
Tds.Encoding.decode(body, encoding)
And now it worked.
But this installed a lot of stuff and I am not that happy with such bigger dependency:

I see. This is because that dependency is using Rustler, but without precompilation. I think a solution would be to propose the usage of Rustler Precompiled there. I can help with that if you want :)
But should be really straightforward if you follow the examples.

I'm also planning to create another package for that, but I haven't been able to focus on that.

But I have one question: are you trying to parse random pages from the internet? Or do you have some specific target that uses this specific encoding (windows-1252)?

nuno84 · 2022-07-29T08:37:44Z

I also thought if it was simple it would be done a long time ago.
I am parsing random pages, that is why some will eventually have "weird" encodings but I can specify each encoding by hand, no problem with that as the process will always be individually made.
Ok, I can try the precompiled lib you did, but I dont understand it:
I will add to deps:

...
      {:rustler_precompiled, "~> 0.5"},
      {:rustler, "~> 0.23.0", optional: true},
      {:tds_encoding, "~> 1.1"}
...

mix deps thwors an error:

Failed to use "rustler" because
  apps/my_app/mix.exs requires ~> 0.22.0
  rustler_precompiled (version 0.5.1) requires ~> 0.23
  mix.lock specifies 0.22.2

Added module:

defmodule MyApp.RustlerNative do
  version = Mix.Project.config()[:version]

  use RustlerPrecompiled,
    otp_app: :my_app,
    base_url:
      "https://github.com/philss/rustler_precompilation_example/releases/download/v#{version}",
    force_build: System.get_env("RUSTLER_PRECOMPILATION_EXAMPLE_BUILD") in ["1", "true"],
    version: version

  # When your NIF is loaded, it will override this function.
  def add(_a, _b), do: :erlang.nif_error(:nif_not_loaded)
end

Now I can call the function as usuall?
Tds.Encoding.decode(body, encoding)

Is this the process or am I missing something? I read the blog post and example you did. The deps are failing but I don't know if I should try a lower version on rustler_precompiled ??
Is the work of using precompiled worth it? Is it because CI tests start verything from ground up every single pass? Am I understanding it right?
Thank you once again Philip.

nuno84 · 2022-08-04T17:46:14Z

For reference,
This is solved by using the lib: {:excoding, "~> 0.1.2"},
Excoding.decode(body, encoding)
More info https://github.com/philss/floki/issues/116#issuecomment-1205577086
Thank you once again.

philss closed this as completed Jun 1, 2017

philss mentioned this issue Aug 1, 2022

Consider using RustlerPrecompiled mjaric/tds-encoding#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding is not taken into account when parsing file #116

Encoding is not taken into account when parsing file #116

edevil commented May 30, 2017

philss commented Jun 1, 2017

edevil commented Jun 1, 2017

nuno84 commented Jul 27, 2022 •

edited

Loading

philss commented Jul 27, 2022

nuno84 commented Jul 28, 2022

nuno84 commented Jul 28, 2022

nuno84 commented Jul 28, 2022 •

edited

Loading

philss commented Jul 28, 2022

nuno84 commented Jul 29, 2022 •

edited

Loading

nuno84 commented Aug 4, 2022 •

edited

Loading

Encoding is not taken into account when parsing file #116

Encoding is not taken into account when parsing file #116

Comments

edevil commented May 30, 2017

philss commented Jun 1, 2017

edevil commented Jun 1, 2017

nuno84 commented Jul 27, 2022 • edited Loading

philss commented Jul 27, 2022

nuno84 commented Jul 28, 2022

nuno84 commented Jul 28, 2022

nuno84 commented Jul 28, 2022 • edited Loading

philss commented Jul 28, 2022

nuno84 commented Jul 29, 2022 • edited Loading

nuno84 commented Aug 4, 2022 • edited Loading

nuno84 commented Jul 27, 2022 •

edited

Loading

nuno84 commented Jul 28, 2022 •

edited

Loading

nuno84 commented Jul 29, 2022 •

edited

Loading

nuno84 commented Aug 4, 2022 •

edited

Loading