-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436
Comments
@davue Thanks for reporting this! That's certainly strange behavior, I'll take a look later today. |
OK, so here's a trick I use a lot to diagnose parser behavior like this: check the #! /usr/bin/env ruby
require "nokogiri"
xml = "<!DOCTYPE HTML><html><body><strong><s>Test</s></strong></body></html>"
doc = Nokogiri::HTML4::Document.parse(xml)
doc.to_xml
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<!DOCTYPE html >\n" +
# "<html><head/><body><strong/><s><strong>Test</strong></s></body></html>"
doc.errors
# => [#<Nokogiri::XML::SyntaxError: End element <s> automatically closes element <strong>.>] So we see that the parser used by JRuby (NekoHTML) thinks that So then the question becomes: is this broken markup in HTML4? It doesn't appear to be, at least to me (but I am not an expert despite playing one on TV). Here are the MDN docs and W3C spec:
Relevant, if we change This all leads me to believe this is a bug in NekoHTML, and unfortunately there's nothing we can easily do to change that behavior without patching or upstreaming a fix. NekoHTML is not really well-maintained upstream right now. Our dirty secret, though, is that Nokogiri has already patched NekoHTML to work around other bugs in the past, so we can probably look into fixing this, too. Related, there has been some work going on to use Maven to better-define the Java dependencies (see, for example, #2432) and as part of that we are planning to upload our version of NekoHTML and NekoDTD to maven: #2437 I'll keep this open and mark it as "blocked" on #2437 for now |
OK! Thanks for your patience, the work to use maven-distributed dependencies is done, which is good news. More good news, we're now using a well-maintained fork of nekohtml from https://github.com/HtmlUnit/htmlunit-neko. The bad news is that the htmlunit-neko library exhibits this same behavior. I'm going to report this upstream. |
Err, sorry, please ignore my previous update. This is fixed in the current version of nekohtml, and in fact was fixed in v1.13.4 (when we updated nekohtml in 0feac5a). Yay! Closing. |
Please describe the bug
If you parse an HTML string containing a
<s></s>
tag. The elements above the<s></s>
tag are duplicated into a seperate child element without content and the other child with the actual content will have inverted order. Strangely this only seems to happen with<s></s>
tags, all other tags are parsed just fine.Help us reproduce what you're seeing
Open a JRuby Rails console and run:
The output will show that the parser created an additional
strong
child even though it is not present in the HTML input and the order of tags is inverted for the other child:Expected behavior
I expect the output to look like in the CRuby implementation, where only one child element is present and the order is correct:
Environment
The text was updated successfully, but these errors were encountered: