additional meta Content-Type is added to HTML5 #1008

stefanneculai · 2013-11-21T22:51:12Z

Hello guys,

I have the following HTML file test.html.

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
    <meta charset="UTF-8">
  </head>
  <body>
  </body>
</html>

I am opening the file with Nokogiri and then write it back to another file.

doc = Nokogiri::HTML::Document.parse(File.open('test.html').read)
File.open('output.html', 'w') do |f|
  f.puts doc.to_s
end

The output file will contain meta http-equiv="Content-Type" content="text/html; charset=UTF-8" although the meta charset for HTML5 is set.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
<meta charset="UTF-8">
</head>
<body>
  </body>
</html>

Any idea how I could fix it not to add that meta tag which is for HTML4 and not for HTML5?

I am using nokogiri-1.6.0.

Any help is appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

jyw2116 · 2013-12-09T01:12:47Z

It looks like the following commit (86d1bfb) might have fixed the issue. Can you still reproduce this on master?

pipboy3000 · 2014-02-19T03:31:09Z

I have same issue.

require "nokogiri"

doc = Nokogiri::HTML::Document.parse <<-EOHTML
<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
    <meta charset="UTF-8">
  </head>
  <body>
  </body>
</html>
EOHTML

puts doc.to_s

result

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
<meta charset="UTF-8">
</head>
<body>
  </body>
</html>

my environment

# Nokogiri (1.6.1)
    ---
    warnings: []
    nokogiri: 1.6.1
    ruby:
      version: 2.0.0
      platform: i386-mingw32
      description: ruby 2.0.0p247 (2013-06-27) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      source: system
      compiled: 2.8.0
      loaded: 2.8.0

denisdefreyne · 2014-03-30T17:22:22Z

Also having this issue.

hnakamur · 2014-04-17T16:50:21Z

I also have the same issue.
I found this is a problem with libxml2.
I created a patch to fix this issue and filed a bug report at
Bug 728436 – [PATCH] Write meta charset tag instead of meta http-equiv content-type

The version of libxml2 used by Nokogiri forcibly adds a content-type meta tag to all documents with a <head> element during serialization, which is stupid. See also: sparklemotion/nokogiri#1008

We need to remove the extra charset specification, since we're adding our own from html/head.html. We can't remove the <meta http-equiv="Content-Type">, since there's a nokogiri bug coming from libxml2 that just re-adds it, even when we specify a valid <meta charset> sparklemotion/nokogiri#1008

XhmikosR · 2016-07-20T06:23:51Z

Any news on this?

flavorjones · 2017-02-10T10:03:01Z

Hi, apologies for the extremely slow reply. Nokogiri does not support HTML5. You may want to check out the Nokogumbo project, which aims for HTML5 compatibility with the Gumbo parser.

`:prune` removes unknown/unsafe tags and their contents (including their subtrees): unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is <b>not</b></foo>" Loofah.fragment(unsafe_html).scrub!(:prune) # => "ohai! <div>div is safe</div> " * Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier * Nokogiri also adds a `meta` tag to the output. Not much we can do about this: sparklemotion/nokogiri#1008

We can't use `#sanitize` here because it operates on a Loofah fragment instead of a loofah document [1]. This results in the `<head>` and `<body>` tags getting stripped and returning an invalid HTML page. With Loofah's built in `:prune` scrubber we retain the old behaviour of stripping out script tags. `:prune` removes unknown/unsafe tags and their contents (including their subtrees): unsafe_html = "ohai! <div>div is safe</div> <foo>but foo is <b>not</b></foo>" Loofah.fragment(unsafe_html).scrub!(:prune) # => "ohai! <div>div is safe</div> " * Adds a `DOCTYPE` to the fixture file so that Nokogiri doesn't insert a HTML 4 `DOCTYPE` automatically, making comparison in the spec uglier * Nokogiri also adds a `meta` tag to the output. Not much we can do about this: sparklemotion/nokogiri#1008 [1] https://github.com/flavorjones/loofah#side-note-fragments-vs-documents

This is a little gross; the tag is forcibly added by nokogiri now: sparklemotion/nokogiri#1008 If it’s going to be forcibly added, at least it should be indented properly. :'( Let’s add it to the layout for now, but if we can remove it in the future and just rely on the modern standard (<meta charset="utf-8">), let’s do it!

This was referenced Jul 20, 2016

Use jekyll-mentions will add extra meta header jekyll/jekyll-mentions#37

Closed

site: fix validation errors. jekyll/jekyll#5118

Merged

flavorjones closed this as completed Feb 10, 2017

RubenVerborgh mentioned this issue Apr 12, 2017

Remove <meta> tag inserted by Nokogiri. nanoc/nanoc#1152

Closed

izawa mentioned this issue Apr 11, 2021

rake meetup:gen_index 実行時に _layouts 以下のファイルにリンクを追記するようにした kanazawarb/meetup#1209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional meta Content-Type is added to HTML5 #1008

additional meta Content-Type is added to HTML5 #1008

stefanneculai commented Nov 21, 2013

jyw2116 commented Dec 9, 2013

pipboy3000 commented Feb 19, 2014

denisdefreyne commented Mar 30, 2014

hnakamur commented Apr 17, 2014

XhmikosR commented Jul 20, 2016

flavorjones commented Feb 10, 2017

additional meta Content-Type is added to HTML5 #1008

additional meta Content-Type is added to HTML5 #1008

Comments

stefanneculai commented Nov 21, 2013

jyw2116 commented Dec 9, 2013

pipboy3000 commented Feb 19, 2014

denisdefreyne commented Mar 30, 2014

hnakamur commented Apr 17, 2014

XhmikosR commented Jul 20, 2016

flavorjones commented Feb 10, 2017