Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional meta Content-Type is added to HTML5 #1008

Closed
stefanneculai opened this issue Nov 21, 2013 · 6 comments
Closed

additional meta Content-Type is added to HTML5 #1008

stefanneculai opened this issue Nov 21, 2013 · 6 comments

Comments

@stefanneculai
Copy link

Hello guys,

I have the following HTML file test.html.

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
    <meta charset="UTF-8">
  </head>
  <body>
  </body>
</html>

I am opening the file with Nokogiri and then write it back to another file.

doc = Nokogiri::HTML::Document.parse(File.open('test.html').read)
File.open('output.html', 'w') do |f|
  f.puts doc.to_s
end

The output file will contain meta http-equiv="Content-Type" content="text/html; charset=UTF-8" although the meta charset for HTML5 is set.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
<meta charset="UTF-8">
</head>
<body>
  </body>
</html>

Any idea how I could fix it not to add that meta tag which is for HTML4 and not for HTML5?

I am using nokogiri-1.6.0.

Any help is appreciated. Thanks!

@jyw2116
Copy link

jyw2116 commented Dec 9, 2013

It looks like the following commit (86d1bfb) might have fixed the issue. Can you still reproduce this on master?

@pipboy3000
Copy link

I have same issue.

require "nokogiri"

doc = Nokogiri::HTML::Document.parse <<-EOHTML
<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
    <meta charset="UTF-8">
  </head>
  <body>
  </body>
</html>
EOHTML

puts doc.to_s

result

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
<meta charset="UTF-8">
</head>
<body>
  </body>
</html>

my environment

# Nokogiri (1.6.1)
    ---
    warnings: []
    nokogiri: 1.6.1
    ruby:
      version: 2.0.0
      platform: i386-mingw32
      description: ruby 2.0.0p247 (2013-06-27) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      source: system
      compiled: 2.8.0
      loaded: 2.8.0

@denisdefreyne
Copy link

Also having this issue.

@hnakamur
Copy link

I also have the same issue.
I found this is a problem with libxml2.
I created a patch to fix this issue and filed a bug report at
Bug 728436 – [PATCH] Write meta charset tag instead of meta http-equiv content-type

rgrove added a commit to rgrove/sanitize that referenced this issue May 20, 2014
The version of libxml2 used by Nokogiri forcibly adds a content-type meta
tag to all documents with a <head> element during serialization, which is
stupid.

See also: sparklemotion/nokogiri#1008
CaseyLeask added a commit to CaseyLeask/developers.whatwg.org that referenced this issue Jan 16, 2016
We need to remove the extra charset specification, since we're adding
our own from html/head.html. We can't remove the <meta
http-equiv="Content-Type">, since there's a nokogiri bug coming from
libxml2 that just re-adds it, even when we specify a valid <meta
charset> sparklemotion/nokogiri#1008
@XhmikosR
Copy link

Any news on this?

@flavorjones
Copy link
Member

Hi, apologies for the extremely slow reply. Nokogiri does not support HTML5. You may want to check out the Nokogumbo project, which aims for HTML5 compatibility with the Gumbo parser.

mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008
mysociety-pusher pushed a commit to mysociety/alaveteli that referenced this issue Aug 3, 2017
We can't use `#sanitize` here because it operates on a Loofah fragment
instead of a loofah document [1]. This results in the `<head>` and
`<body>` tags getting stripped and returning an invalid HTML page.

With Loofah's built in `:prune` scrubber we retain the old behaviour of
stripping out script tags.

`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008

[1] https://github.com/flavorjones/loofah#side-note-fragments-vs-documents
lizconlan pushed a commit to mysociety/alaveteli that referenced this issue Aug 4, 2017
We can't use `#sanitize` here because it operates on a Loofah fragment
instead of a loofah document [1]. This results in the `<head>` and
`<body>` tags getting stripped and returning an invalid HTML page.

With Loofah's built in `:prune` scrubber we retain the old behaviour of
stripping out script tags.

`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008

[1] https://github.com/flavorjones/loofah#side-note-fragments-vs-documents
lizconlan pushed a commit to mysociety/alaveteli that referenced this issue Aug 4, 2017
We can't use `#sanitize` here because it operates on a Loofah fragment
instead of a loofah document [1]. This results in the `<head>` and
`<body>` tags getting stripped and returning an invalid HTML page.

With Loofah's built in `:prune` scrubber we retain the old behaviour of
stripping out script tags.

`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008

[1] https://github.com/flavorjones/loofah#side-note-fragments-vs-documents
lizconlan pushed a commit to mysociety/alaveteli that referenced this issue Aug 4, 2017
We can't use `#sanitize` here because it operates on a Loofah fragment
instead of a loofah document [1]. This results in the `<head>` and
`<body>` tags getting stripped and returning an invalid HTML page.

With Loofah's built in `:prune` scrubber we retain the old behaviour of
stripping out script tags.

`:prune` removes unknown/unsafe tags and their contents (including their
subtrees):

    unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is
    <b>not</b></foo>"

    Loofah.fragment(unsafe_html).scrub!(:prune)
    # => "ohai! <div>div is safe</div> "

* Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a
  HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier
* Nokogiri also adds a `meta` tag to the output. Not much we can do
  about this: sparklemotion/nokogiri#1008

[1] https://github.com/flavorjones/loofah#side-note-fragments-vs-documents
adunkman added a commit to adunkman/dctech.tv that referenced this issue Dec 9, 2017
This is a little gross; the tag is forcibly added by nokogiri now: sparklemotion/nokogiri#1008

If it’s going to be forcibly added, at least it should be indented properly. :'(

Let’s add it to the layout for now, but if we can remove it in the future and just rely on the modern standard (<meta charset="utf-8">), let’s do it!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants