[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

davue · 2022-01-26T10:39:42Z

Please describe the bug
If you parse an HTML string containing a <s></s> tag. The elements above the <s></s> tag are duplicated into a seperate child element without content and the other child with the actual content will have inverted order. Strangely this only seems to happen with <s></s> tags, all other tags are parsed just fine.

Help us reproduce what you're seeing
Open a JRuby Rails console and run:

Nokogiri::HTML4::DocumentFragment.parse("<!DOCTYPE HTML><html><body><strong><s>Test</s></strong></body></html>")

The output will show that the parser created an additional strong child even though it is not present in the HTML input and the order of tags is inverted for the other child:

#(DocumentFragment:0x888 {
  name = "#document-fragment",
  children = [
    #(Element:0x88a { name = "strong" }),
    #(Element:0x88c { name = "s", children = [ #(Element:0x88e { name = "strong", children = [ #(Text "Test")] })] })]
  })

Expected behavior
I expect the output to look like in the CRuby implementation, where only one child element is present and the order is correct:

#(DocumentFragment:0x13254 {
  name = "#document-fragment",
  children = [
    #(Element:0x13268 { name = "strong", children = [ #(Element:0x1327c { name = "s", children = [ #(Text "Test")] })] })]
  })

Environment

# Nokogiri (1.13.1)
    ---
    warnings: []
    nokogiri:
      version: 1.13.1
    ruby:
      version: 2.6.8
      platform: java
      gem_platform: universal-java-1.8
      description: jruby 9.3.2.0 (2.6.8) 2021-12-01 0b8223f905 OpenJDK 64-Bit Server VM
        25.302-b08 on 1.8.0_302-b08 +jit [linux-x86_64]
      engine: jruby
      jruby: 9.3.2.0
    other_libraries:
      xerces: Xerces-J 2.12.0
      nekohtml: NekoHTML 1.9.21

The text was updated successfully, but these errors were encountered:

flavorjones · 2022-01-26T15:17:06Z

@davue Thanks for reporting this! That's certainly strange behavior, I'll take a look later today.

flavorjones · 2022-01-26T18:12:51Z

OK, so here's a trick I use a lot to diagnose parser behavior like this: check the doc.errors to see what warnings and errors the parser noticed along the way.

#! /usr/bin/env ruby

require "nokogiri"

xml = "<!DOCTYPE HTML><html><body><strong><s>Test</s></strong></body></html>"
doc = Nokogiri::HTML4::Document.parse(xml)
doc.to_xml
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
#    "<!DOCTYPE html >\n" +
#    "<html><head/><body><strong/><s><strong>Test</strong></s></body></html>"

doc.errors
# => [#<Nokogiri::XML::SyntaxError: End element <s> automatically closes element <strong>.>]

So we see that the parser used by JRuby (NekoHTML) thinks that <s> should close an open <strong>, and it closes it. Later, it likely has some logic to wrap a preceding text node when it see a closing </strong> tag. This kind of behavior emerges when HTML parsers try to "fix up" what they consider to be broken markup.

So then the question becomes: is this broken markup in HTML4?

It doesn't appear to be, at least to me (but I am not an expert despite playing one on TV). Here are the MDN docs and W3C spec:

Relevant, if we change <s> to <strike> (or any of the other "font style" elements like tt) in the repro, it works fine and we get <body><strong><strike>Test</strike></strong></body>.

This all leads me to believe this is a bug in NekoHTML, and unfortunately there's nothing we can easily do to change that behavior without patching or upstreaming a fix.

NekoHTML is not really well-maintained upstream right now. Our dirty secret, though, is that Nokogiri has already patched NekoHTML to work around other bugs in the past, so we can probably look into fixing this, too.

Related, there has been some work going on to use Maven to better-define the Java dependencies (see, for example, #2432) and as part of that we are planning to upload our version of NekoHTML and NekoDTD to maven: #2437

I'll keep this open and mark it as "blocked" on #2437 for now

flavorjones · 2022-05-10T21:56:14Z

OK! Thanks for your patience, the work to use maven-distributed dependencies is done, which is good news. More good news, we're now using a well-maintained fork of nekohtml from https://github.com/HtmlUnit/htmlunit-neko. The bad news is that the htmlunit-neko library exhibits this same behavior.

I'm going to report this upstream.

flavorjones · 2022-05-10T22:06:51Z

Err, sorry, please ignore my previous update. This is fixed in the current version of nekohtml, and in fact was fixed in v1.13.4 (when we updated nekohtml in 0feac5a). Yay!

Closing.

davue added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 26, 2022

flavorjones mentioned this issue Jan 26, 2022

publish patched NekoDTD to Maven Central #2437

Closed

5 tasks

flavorjones added blocked platform/jruby and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 26, 2022

flavorjones mentioned this issue Mar 22, 2022

Bump xerces version to 2.12.2 #2482

Merged

flavorjones added vendored/nekohtml and removed blocked labels May 10, 2022

flavorjones closed this as completed May 10, 2022

flavorjones added this to the v1.13.x patch releases milestone May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

davue commented Jan 26, 2022

flavorjones commented Jan 26, 2022

flavorjones commented Jan 26, 2022

flavorjones commented May 10, 2022

flavorjones commented May 10, 2022

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

Comments

davue commented Jan 26, 2022

flavorjones commented Jan 26, 2022

flavorjones commented Jan 26, 2022

flavorjones commented May 10, 2022

flavorjones commented May 10, 2022