Replace HTML cleaner/parser with more robust one #3967

momijizukamori · 2021-01-27T02:45:51Z

Upgrade nokogumbo and sanitize
Add fixup processor based on nokogumbo HTML5 processor
Replace home-rolled renderer with one based on the redcarpet markdown renderer
Tweak a handful of unit tests to reflect fixed output
Add new unit tests for bugfixes

Fixes the following issues:

AO3-4599 - Parser: PRE tags insert extra spaces
AO3-3769 - HTML editor's parser won't give an open-paragraph tag to the paragraph after text enclosed in "center" tags, causing weird line spacing
AO3-3234 - using blockquote tags with paragraphs tags inside results in strange output
AO3-3565 - HTML parser will not allow formatting tags (italics, bold, emphasis, etc.) to span a line break tag
AO3-3313 - Using an image as a link causes the HTML parser to move the "close anchor" tag to eliminate that or another link in the same paragraph under certain circumstances
AO3-3464 - HTML parser eats text after a missing quote upon preview or posting (partial)
AO3-3916 - Carriage returns won't create line or paragraph breaks in the HTML editor if text has already been through the parser
AO3-3710 - Blockquote tags develop extra blank paragraphs before them if a work is edited

Pull Request Checklist

Have you read "How to write the perfect pull request"?
Have you read the contributing guidelines?
Have you added tests for any changed functionality?
Have you added the Jira issue number
as the first thing in your pull request title (e.g. AO3-1234 Fix thing) - Fixes multiple issues
Do you have fewer than 5 pull requests already open? If not, please wait
until they are reviewed and merged before creating new pull requests.

Issue

https://otwarchive.atlassian.net/browse/AO3-4599
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3234
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3313
https://otwarchive.atlassian.net/browse/AO3-3464
https://otwarchive.atlassian.net/browse/AO3-3916
https://otwarchive.atlassian.net/browse/AO3-3710

Purpose

This is more or less a ground-up rewrite of the formatter/cleaner code, which uses the combination of the nokogumbo HTML5 parser and the fast, lightweight redcarpet Markdown processor. The HTML5 parser is uses to correct what errors we can, and then the document is processed into Markdown compatible format, markdown special characters are escaped so the processor doesn't convert them, and then the processor is used to convert linebreaks to HTML paragraph and linebreak elements. It may seem a little silly to use a markdown processor while ignoring most of markdown syntax, but doing it this way gives a significant boost to reliability. Markdown is widely used and has a well-defined syntax. This means output is reliable, and that the libraries generating it are extensively tested, both for output and for security issues.

There are three kind of messy hacks in this, of varying severity. The first is that the Redcarpet library has some incorrect categorization of element types - namely, it thinks center is an inline element, and ins and del are block level ones. I've opened an MR with the project ( vmg/redcarpet#702 ) and have a workaround in place for testing, but if the redcarpet devs are slow about merging I think it would be better to vendorize the fixed version of the library instead of letting the workaround go to prod.

The second is the HTML fixer, which is fully half of this code. I put it in place because the existing version had it, and if we were going to do it, we should do it better, but trying to repair HTML programmatically is very fragile and error prone. The cleaner solution would be to return an error to the user, along with details about that error (nokogumbo's errors are largely fairly informative) so that they can make the fix themselves. I'd be happy to give that work a go, but it's out of scope for this, and ties into the last issue.

The last issue is the anti-pattern of sanitizing and formatting fields before they're saved to database. This results in errors introduced by buggy versions of the sanitzer or formatter being unfixable, because we can't reconstruct the original input, causes issues with editing due to multiple roundtrips through the sanitizer/formatter, results in text being improperly encoded for the context (such as HTML entities in email headers), and in combination with the auto-correcting attempts means we end up losing data sometimes. Standard best pratices are to save the raw user input, and parse/sanitize it as necessary for output, which may vary based on where it's being used - the edit field would show the raw input, with only enough escaping to keep it from breaking the field, text used in email subjects would be escaped to plaintext, and text in an HTML page would be fully HTML-sanitized and encoded. We would want to implement to bad HTML fix first, so that we're only putting well-formed user input into the db.

I realize this is a much bigger production, both in terms of making sure encoding/sanitizing is performance-optimized, and in terms of making sure we handle both old pre-sanitized data and new raw data properly. I've included an unparse function that replaces HTML linebreaks/paragraphs with raw linebreak characters, so that multiple edit/save cycles doesn't generate infinitely nested linebreak elements, but it can't do anything about other HTML changes we make, and it will process any user-input linebreak elements the same way, as it can't tell what the user added versus what the parser added.

Testing Instructions

I've done testing with a bunch of fics I have sitting around, along with some text files from Project Gutenberg. What this really needs more testing with is stuff with 'weird' mark-up - extensive use of divs/classes, tables, lots of embedded images, etc.

Credit

momijizukamori & they/them is fine :)

- Upgrade nokogumbo and sanitize - Add fixup processor based on nokogumbo HTML5 processor - Replace home-rolled renderer with one based on the redcarpet markdown renderer - Tweak a handful of unit tests to reflect fixed output - Add new unit tests for bugfixes Fixes the following issues: - AO3-4599 - Parser: PRE tags insert extra spaces - AO3-3769 - HTML editor's parser won't give an open-paragraph tag to the paragraph after text enclosed in "center" tags, causing weird line spacing - AO3-3234 - using blockquote tags with paragraphs tags inside results in strange output - AO3-3565 - HTML parser will not allow formatting tags (italics, bold, emphasis, etc.) to span a line break tag - AO3-3313 - Using an image as a link causes the HTML parser to move the "close anchor" tag to eliminate that or another link in the same paragraph under certain circumstances - AO3-3464 - HTML parser eats text after a missing quote upon preview or posting (partial) - AO3-3916 - Carriage returns won't create line or paragraph breaks in the HTML editor if text has already been through the parser - AO3-3710 - Blockquote tags develop extra blank paragraphs before them if a work is edited

hound · 2021-01-27T02:46:00Z

spec/lib/html_cleaner_spec.rb

@@ -978,33 +887,35 @@

    %w(&gt; &lt; &amp;).each do |entity|
      it "should handle #{entity}" do
-        result = add_paragraphs_to_text("#{entity}")
+        result = format_linebreaks("#{entity}")


Style/RedundantInterpolation: Prefer to_s over string interpolation.

hound · 2021-01-27T02:46:01Z

spec/lib/html_cleaner_spec.rb

      expect(doc.xpath("./p/strong/em").children.to_s.strip).to eq("unclosed")
    end
+
+    it "should close mismatched tags that wrap the same tag" do


RSpec/ExampleWording: Do not use should when describing your tests.

hound · 2021-01-27T02:46:01Z

spec/lib/html_cleaner_spec.rb

+    # AO3-3565
+    %w(b big cite code del dfn em i ins kbd q s samp
+      small span strike strong sub sup tt u var).each do |tag|
+       it "should handle #{tag} inline tags spanning <br> tags" do


RSpec/ExampleWording: Do not use should when describing your tests.

hound · 2021-01-27T02:46:01Z

spec/lib/html_cleaner_spec.rb

        doc = Nokogiri::HTML.fragment(result)
        expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some")
        expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text")
      end
    end

+    # AO3-3565
+    %w(b big cite code del dfn em i ins kbd q s samp
+      small span strike strong sub sup tt u var).each do |tag|


Layout/ArrayAlignment: Align the elements of an array literal if they span more than one line.

hound · 2021-01-27T02:46:01Z

spec/lib/html_cleaner_spec.rb

        doc = Nokogiri::HTML.fragment(result)
        expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some")
        expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text")
      end
    end

+    # AO3-3565
+    %w(b big cite code del dfn em i ins kbd q s samp


Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].

momijizukamori · 2021-01-27T02:56:32Z

Oh, as a quick note, I also ran some benchmarking between the old and new versions of the code, and they come out around the same - differences were between 2x as fast and 2x as slow. The old parser is faster at very long plaintext files (stuff around the per-chapter character max), the new one is faster at very long html-formatted files. If the recursive nokogumbo formatting step is removed, performance is the same or better as the old code.

- Fix failing unit tests - Remove now-unnecessary test for conversion of '<3' (handled by sanitzer/parser) - Re-add confusingly named 'strip_html_breaks_simple' function - Fix issue where indented lines were being parsed as codeblocks.

hound · 2021-01-30T19:46:57Z

Gemfile

-gem 'nokogumbo', '1.4.9'
+gem 'sanitize', '>= 5.0.0'
+gem 'nokogumbo', '> 2.0.0'
+gem 'redcarpet'


Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

hound · 2021-01-30T19:46:57Z

Gemfile

-# https://github.com/rubys/nokogumbo/issues/50
-gem 'nokogumbo', '1.4.9'
+gem 'sanitize', '>= 5.0.0'
+gem 'nokogumbo', '> 2.0.0'


Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

hound · 2021-01-30T19:46:57Z

Gemfile

-# https://otwarchive.atlassian.net/browse/AO3-4957
-# https://github.com/rubys/nokogumbo/issues/50
-gem 'nokogumbo', '1.4.9'
+gem 'sanitize', '>= 5.0.0'


Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

hound · 2021-01-30T19:46:58Z

lib/html_cleaner.rb

+    doc = MismatchedTagFixer.new.read(input)
+
+    # If the doc doesn't have p tags, convert br to newline for processing
+    if doc.children.all? { |n| n.name != "p"}


Style/IfUnlessModifier: Favor modifier if usage when having a single-line body. Another good alternative is the usage of control flow &&/||.
Layout/SpaceInsideBlockBraces: Space missing inside }.

hound · 2021-01-30T19:46:58Z

lib/html_cleaner.rb

+
+              content = node.to_html
+              chomped = node.to_html.chomp("</#{node.name}>")        
+              content = chomped.sub(/\n\n/, "</#{node.name}>\n\n") if content.include?("\n\n")


Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

hound · 2021-01-30T19:46:58Z

lib/html_cleaner.rb

+            # Gotta do this the hard way - search the tag nodes for one
+            # that contains our ending line
+            doc.search("#{bad_tag}").each do |node|
+              next unless node.inner_html.include?(end_block.chomp)


Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

… that is failing) - More Hound linting

hound · 2021-01-30T23:16:32Z

features/step_definitions/generic_steps.rb

@@ -212,6 +212,11 @@
  page.body.should =~ /#{Regexp.escape(text)}/m
 end

+# Unescaped match - needed to match linebreaks in selections.
+Then /^I should see the text with raw tags "(.*)"$/ do |text|
+  page.body.should =~ %r{#{text}}m


Style/RegexpLiteral: Use // around regular expression.

hound · 2021-01-30T23:16:33Z

lib/html_cleaner.rb

+
+          updated = true
+
+


Layout/EmptyLines: Extra blank line detected.

hound · 2021-01-30T23:16:33Z

lib/html_cleaner.rb

+              fixed = segment[cut..]
+
+              fixed_line = input.sub(segment, fixed)
+              input = input.sub(block, fixed_line)


Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.

hound · 2021-01-30T23:16:33Z

lib/html_cleaner.rb

+              tags_to_close = tags[(tag_i + 1)..]
+              close_str = "</#{tags_to_close.join('></')}>"
+              fixed_line = block.sub(segment, close_str + segment)
+              input = input.sub(block, fixed_line)


Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.

hound · 2021-01-30T23:16:33Z

lib/html_cleaner.rb

+            bad_tag = segment.sub(%r{</(\w*)(.|\n)*}, '\1')
+            tag_i = tags.rindex(bad_tag)
+
+            if tag_i


Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

hound · 2021-01-30T23:16:34Z

spec/lib/html_cleaner_spec.rb

      doc = Nokogiri::HTML.fragment(result)
      expect(doc.xpath("./p").size).to eq(3)
      expect(doc.xpath("./p[1]").children.to_s.strip).to eq("A")
      expect(doc.xpath("./p[2]").children.to_s.strip).to eq("B")
      expect(doc.xpath("./p[3]").children.to_s.strip).to eq("C")
    end

-    it "should not leave p inside i" do
-      result = add_paragraphs_to_text("<i><p>foo</p><p>bar</p></i>")
+   it "does not leave p inside i" do


Layout/IndentationConsistency: Inconsistent indentation detected.

hound · 2021-01-30T23:16:34Z

spec/lib/html_cleaner_spec.rb

+         doc = Nokogiri::HTML.fragment(result)
+         expect(doc.xpath("./p/#{tag}").children.to_s.strip).to eq("some<br>\ntext")
+       end
+     end


Layout/BlockAlignment: end at 771, 5 is not aligned with %w(b big cite code del dfn em i ins kbd q s samp at 764, 4 or small span strike strong sub sup tt u var).each do |tag| at 765, 6.

hound · 2021-01-30T23:16:34Z

spec/lib/html_cleaner_spec.rb

        doc = Nokogiri::HTML.fragment(result)
        expect(doc.xpath(".//p").size).to eq(2)
        expect(doc.xpath(".//#{tag}").children.to_s.strip).to eq("foo")
      end
    end

+      it "does not wrap tables in p tags" do


Layout/IndentationConsistency: Inconsistent indentation detected.

hound · 2021-01-30T23:16:34Z

spec/lib/html_cleaner_spec.rb

      doc = Nokogiri::HTML.fragment(result)
      expect(doc.xpath(".//p").size).to eq(3)
      expect(doc.xpath(".//br")).to be_empty
    end

-    %w(dl h1 h2 h3 h4 h5 h6 ol pre table ul).each do |tag|
+    %w(dl h1 h2 h3 h4 h5 h6 ol pre ul).each do |tag|


Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].

hound · 2021-01-30T23:16:34Z

spec/lib/html_cleaner_spec.rb

        expect(doc.xpath(".//br")).to be_empty
      end
    end

+    %w(blockquote center div).each do |tag|
+      it "should wrap text after #{tag} tags in paragraphs" do


RSpec/ExampleWording: Do not use should when describing your tests.

sarken · 2021-01-30T23:41:25Z

Hi, momijizukamori!

Thanks for taking a shot at fixing these issues! A rewrite of the parser is something that we've been exploring off and on as time permits, but we're still in the process of figuring out exactly what sort of functionality we want to offer (e.g. Markdown support, options regarding how whitespace is interpreted) and how best to implement it (e.g. by allowing both parsers to coexist for a time). Once we decide how to tackle a full rewrite, we'll have an issue -- probably several -- specifically for that.

However, for the time being, we'd prefer a piecemeal approach to fixing bugs to a total rewrite. If you'd like to take a shot at fixing any of the issues while keeping the current setup, or to work on something else from the issue tracker, you're welcome to create a Jira account and we'll give you permissions to claim and transition issues. (If you do that, we ask that the Full Name on the Jira account either match or include the name you wish to be credited by -- it helps us keep things organized.)

potpotkettle · 2023-06-02T10:33:56Z

FWIW, a fairly large, different rewrite was done in 2022: #4391

hound bot reviewed Jan 27, 2021

View reviewed changes

redsummernight added the Awaiting Review label Jan 29, 2021

- Linting to make Hound happy

a6716e5

- Fix failing unit tests - Remove now-unnecessary test for conversion of '<3' (handled by sanitzer/parser) - Re-add confusingly named 'strip_html_breaks_simple' function - Fix issue where indented lines were being parsed as codeblocks.

hound bot reviewed Jan 30, 2021

View reviewed changes

- Fixing failing cucumber tests for parser (no idea why stuff outside…

ad379f7

… that is failing) - More Hound linting

hound bot reviewed Jan 30, 2021

View reviewed changes

sarken closed this Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace HTML cleaner/parser with more robust one #3967

Replace HTML cleaner/parser with more robust one #3967

momijizukamori commented Jan 27, 2021 •

edited

Loading

hound bot Jan 27, 2021

hound bot Jan 27, 2021

hound bot Jan 27, 2021

hound bot Jan 27, 2021

hound bot Jan 27, 2021

momijizukamori commented Jan 27, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

hound bot Jan 30, 2021

sarken commented Jan 30, 2021

potpotkettle commented Jun 2, 2023

Replace HTML cleaner/parser with more robust one #3967

Replace HTML cleaner/parser with more robust one #3967

Conversation

momijizukamori commented Jan 27, 2021 • edited Loading

Pull Request Checklist

Issue

Purpose

Testing Instructions

Credit

hound bot Jan 27, 2021

Choose a reason for hiding this comment

hound bot Jan 27, 2021

Choose a reason for hiding this comment

hound bot Jan 27, 2021

Choose a reason for hiding this comment

hound bot Jan 27, 2021

Choose a reason for hiding this comment

hound bot Jan 27, 2021

Choose a reason for hiding this comment

momijizukamori commented Jan 27, 2021

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

hound bot Jan 30, 2021

Choose a reason for hiding this comment

sarken commented Jan 30, 2021

potpotkettle commented Jun 2, 2023

momijizukamori commented Jan 27, 2021 •

edited

Loading