Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

Closed
davue opened this issue Jan 26, 2022 · 4 comments
Closed

[bug][jruby] Strikethrough tag causes incorrect HTML parsing in JRuby #2436

davue opened this issue Jan 26, 2022 · 4 comments

Comments

@davue
Copy link

davue commented Jan 26, 2022

Please describe the bug
If you parse an HTML string containing a <s></s> tag. The elements above the <s></s> tag are duplicated into a seperate child element without content and the other child with the actual content will have inverted order. Strangely this only seems to happen with <s></s> tags, all other tags are parsed just fine.

Help us reproduce what you're seeing
Open a JRuby Rails console and run:

Nokogiri::HTML4::DocumentFragment.parse("<!DOCTYPE HTML><html><body><strong><s>Test</s></strong></body></html>")

The output will show that the parser created an additional strong child even though it is not present in the HTML input and the order of tags is inverted for the other child:

#(DocumentFragment:0x888 {
  name = "#document-fragment",
  children = [
    #(Element:0x88a { name = "strong" }),
    #(Element:0x88c { name = "s", children = [ #(Element:0x88e { name = "strong", children = [ #(Text "Test")] })] })]
  })

Expected behavior
I expect the output to look like in the CRuby implementation, where only one child element is present and the order is correct:

#(DocumentFragment:0x13254 {
  name = "#document-fragment",
  children = [
    #(Element:0x13268 { name = "strong", children = [ #(Element:0x1327c { name = "s", children = [ #(Text "Test")] })] })]
  })

Environment

# Nokogiri (1.13.1)
    ---
    warnings: []
    nokogiri:
      version: 1.13.1
    ruby:
      version: 2.6.8
      platform: java
      gem_platform: universal-java-1.8
      description: jruby 9.3.2.0 (2.6.8) 2021-12-01 0b8223f905 OpenJDK 64-Bit Server VM
        25.302-b08 on 1.8.0_302-b08 +jit [linux-x86_64]
      engine: jruby
      jruby: 9.3.2.0
    other_libraries:
      xerces: Xerces-J 2.12.0
      nekohtml: NekoHTML 1.9.21
@davue davue added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 26, 2022
@flavorjones
Copy link
Member

@davue Thanks for reporting this! That's certainly strange behavior, I'll take a look later today.

@flavorjones
Copy link
Member

OK, so here's a trick I use a lot to diagnose parser behavior like this: check the doc.errors to see what warnings and errors the parser noticed along the way.

#! /usr/bin/env ruby

require "nokogiri"

xml = "<!DOCTYPE HTML><html><body><strong><s>Test</s></strong></body></html>"
doc = Nokogiri::HTML4::Document.parse(xml)
doc.to_xml
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
#    "<!DOCTYPE html >\n" +
#    "<html><head/><body><strong/><s><strong>Test</strong></s></body></html>"

doc.errors
# => [#<Nokogiri::XML::SyntaxError: End element <s> automatically closes element <strong>.>]

So we see that the parser used by JRuby (NekoHTML) thinks that <s> should close an open <strong>, and it closes it. Later, it likely has some logic to wrap a preceding text node when it see a closing </strong> tag. This kind of behavior emerges when HTML parsers try to "fix up" what they consider to be broken markup.

So then the question becomes: is this broken markup in HTML4?

It doesn't appear to be, at least to me (but I am not an expert despite playing one on TV). Here are the MDN docs and W3C spec:

Relevant, if we change <s> to <strike> (or any of the other "font style" elements like tt) in the repro, it works fine and we get <body><strong><strike>Test</strike></strong></body>.

This all leads me to believe this is a bug in NekoHTML, and unfortunately there's nothing we can easily do to change that behavior without patching or upstreaming a fix.

NekoHTML is not really well-maintained upstream right now. Our dirty secret, though, is that Nokogiri has already patched NekoHTML to work around other bugs in the past, so we can probably look into fixing this, too.

Related, there has been some work going on to use Maven to better-define the Java dependencies (see, for example, #2432) and as part of that we are planning to upload our version of NekoHTML and NekoDTD to maven: #2437

I'll keep this open and mark it as "blocked" on #2437 for now

@flavorjones flavorjones added blocked platform/jruby and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 26, 2022
@flavorjones
Copy link
Member

OK! Thanks for your patience, the work to use maven-distributed dependencies is done, which is good news. More good news, we're now using a well-maintained fork of nekohtml from https://github.com/HtmlUnit/htmlunit-neko. The bad news is that the htmlunit-neko library exhibits this same behavior.

I'm going to report this upstream.

@flavorjones
Copy link
Member

Err, sorry, please ignore my previous update. This is fixed in the current version of nekohtml, and in fact was fixed in v1.13.4 (when we updated nekohtml in 0feac5a). Yay!

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants