Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HTML cleaner/parser with more robust one #3967

Closed
wants to merge 3 commits into from

Conversation

momijizukamori
Copy link

@momijizukamori momijizukamori commented Jan 27, 2021

  • Upgrade nokogumbo and sanitize
  • Add fixup processor based on nokogumbo HTML5 processor
  • Replace home-rolled renderer with one based on the redcarpet markdown renderer
  • Tweak a handful of unit tests to reflect fixed output
  • Add new unit tests for bugfixes

Fixes the following issues:

  • AO3-4599 - Parser: PRE tags insert extra spaces
  • AO3-3769 - HTML editor's parser won't give an open-paragraph tag to the paragraph after text enclosed in "center" tags, causing weird line spacing
  • AO3-3234 - using blockquote tags with paragraphs tags inside results in strange output
  • AO3-3565 - HTML parser will not allow formatting tags (italics, bold, emphasis, etc.) to span a line break tag
  • AO3-3313 - Using an image as a link causes the HTML parser to move the "close anchor" tag to eliminate that or another link in the same paragraph under certain circumstances
  • AO3-3464 - HTML parser eats text after a missing quote upon preview or posting (partial)
  • AO3-3916 - Carriage returns won't create line or paragraph breaks in the HTML editor if text has already been through the parser
  • AO3-3710 - Blockquote tags develop extra blank paragraphs before them if a work is edited

Pull Request Checklist

Issue

https://otwarchive.atlassian.net/browse/AO3-4599
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3234
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3313
https://otwarchive.atlassian.net/browse/AO3-3464
https://otwarchive.atlassian.net/browse/AO3-3916
https://otwarchive.atlassian.net/browse/AO3-3710

Purpose

This is more or less a ground-up rewrite of the formatter/cleaner code, which uses the combination of the nokogumbo HTML5 parser and the fast, lightweight redcarpet Markdown processor. The HTML5 parser is uses to correct what errors we can, and then the document is processed into Markdown compatible format, markdown special characters are escaped so the processor doesn't convert them, and then the processor is used to convert linebreaks to HTML paragraph and linebreak elements. It may seem a little silly to use a markdown processor while ignoring most of markdown syntax, but doing it this way gives a significant boost to reliability. Markdown is widely used and has a well-defined syntax. This means output is reliable, and that the libraries generating it are extensively tested, both for output and for security issues.

There are three kind of messy hacks in this, of varying severity. The first is that the Redcarpet library has some incorrect categorization of element types - namely, it thinks center is an inline element, and ins and del are block level ones. I've opened an MR with the project ( vmg/redcarpet#702 ) and have a workaround in place for testing, but if the redcarpet devs are slow about merging I think it would be better to vendorize the fixed version of the library instead of letting the workaround go to prod.

The second is the HTML fixer, which is fully half of this code. I put it in place because the existing version had it, and if we were going to do it, we should do it better, but trying to repair HTML programmatically is very fragile and error prone. The cleaner solution would be to return an error to the user, along with details about that error (nokogumbo's errors are largely fairly informative) so that they can make the fix themselves. I'd be happy to give that work a go, but it's out of scope for this, and ties into the last issue.

The last issue is the anti-pattern of sanitizing and formatting fields before they're saved to database. This results in errors introduced by buggy versions of the sanitzer or formatter being unfixable, because we can't reconstruct the original input, causes issues with editing due to multiple roundtrips through the sanitizer/formatter, results in text being improperly encoded for the context (such as HTML entities in email headers), and in combination with the auto-correcting attempts means we end up losing data sometimes. Standard best pratices are to save the raw user input, and parse/sanitize it as necessary for output, which may vary based on where it's being used - the edit field would show the raw input, with only enough escaping to keep it from breaking the field, text used in email subjects would be escaped to plaintext, and text in an HTML page would be fully HTML-sanitized and encoded. We would want to implement to bad HTML fix first, so that we're only putting well-formed user input into the db.

I realize this is a much bigger production, both in terms of making sure encoding/sanitizing is performance-optimized, and in terms of making sure we handle both old pre-sanitized data and new raw data properly. I've included an unparse function that replaces HTML linebreaks/paragraphs with raw linebreak characters, so that multiple edit/save cycles doesn't generate infinitely nested linebreak elements, but it can't do anything about other HTML changes we make, and it will process any user-input linebreak elements the same way, as it can't tell what the user added versus what the parser added.

Testing Instructions

I've done testing with a bunch of fics I have sitting around, along with some text files from Project Gutenberg. What this really needs more testing with is stuff with 'weird' mark-up - extensive use of divs/classes, tables, lots of embedded images, etc.

Credit

momijizukamori & they/them is fine :)

- Upgrade nokogumbo and sanitize
- Add fixup processor based on nokogumbo HTML5 processor
- Replace home-rolled renderer with one based on the redcarpet markdown renderer
- Tweak a handful of unit tests to reflect fixed output
- Add new unit tests for bugfixes

Fixes the following issues:
- AO3-4599 - Parser: PRE tags insert extra spaces
- AO3-3769 - HTML editor's parser won't give an open-paragraph tag to the paragraph after text enclosed in "center" tags, causing weird line spacing
- AO3-3234 - using blockquote tags with paragraphs tags inside results in strange output
- AO3-3565 - HTML parser will not allow formatting tags (italics, bold, emphasis, etc.) to span a line break tag
- AO3-3313 - Using an image as a link causes the HTML parser to move the "close anchor" tag to eliminate that or another link in the same paragraph under certain circumstances
- AO3-3464 - HTML parser eats text after a missing quote upon preview or posting (partial)
- AO3-3916 - Carriage returns won't create line or paragraph breaks in the HTML editor if text has already been through the parser
- AO3-3710 - Blockquote tags develop extra blank paragraphs before them if a work is edited
@@ -978,33 +887,35 @@

%w(> < &).each do |entity|
it "should handle #{entity}" do
result = add_paragraphs_to_text("#{entity}")
result = format_linebreaks("#{entity}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/RedundantInterpolation: Prefer to_s over string interpolation.

expect(doc.xpath("./p/strong/em").children.to_s.strip).to eq("unclosed")
end

it "should close mismatched tags that wrap the same tag" do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/ExampleWording: Do not use should when describing your tests.

# AO3-3565
%w(b big cite code del dfn em i ins kbd q s samp
small span strike strong sub sup tt u var).each do |tag|
it "should handle #{tag} inline tags spanning <br> tags" do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/ExampleWording: Do not use should when describing your tests.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some")
expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text")
end
end

# AO3-3565
%w(b big cite code del dfn em i ins kbd q s samp
small span strike strong sub sup tt u var).each do |tag|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/ArrayAlignment: Align the elements of an array literal if they span more than one line.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some")
expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text")
end
end

# AO3-3565
%w(b big cite code del dfn em i ins kbd q s samp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].

@momijizukamori
Copy link
Author

Oh, as a quick note, I also ran some benchmarking between the old and new versions of the code, and they come out around the same - differences were between 2x as fast and 2x as slow. The old parser is faster at very long plaintext files (stuff around the per-chapter character max), the new one is faster at very long html-formatted files. If the recursive nokogumbo formatting step is removed, performance is the same or better as the old code.

- Fix failing unit tests
- Remove now-unnecessary test for conversion of '<3' (handled by sanitzer/parser)
- Re-add confusingly named 'strip_html_breaks_simple' function
- Fix issue where indented lines were being parsed as codeblocks.
gem 'nokogumbo', '1.4.9'
gem 'sanitize', '>= 5.0.0'
gem 'nokogumbo', '> 2.0.0'
gem 'redcarpet'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

# https://github.com/rubys/nokogumbo/issues/50
gem 'nokogumbo', '1.4.9'
gem 'sanitize', '>= 5.0.0'
gem 'nokogumbo', '> 2.0.0'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

# https://otwarchive.atlassian.net/browse/AO3-4957
# https://github.com/rubys/nokogumbo/issues/50
gem 'nokogumbo', '1.4.9'
gem 'sanitize', '>= 5.0.0'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

doc = MismatchedTagFixer.new.read(input)

# If the doc doesn't have p tags, convert br to newline for processing
if doc.children.all? { |n| n.name != "p"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/IfUnlessModifier: Favor modifier if usage when having a single-line body. Another good alternative is the usage of control flow &&/||.
Layout/SpaceInsideBlockBraces: Space missing inside }.


content = node.to_html
chomped = node.to_html.chomp("</#{node.name}>")
content = chomped.sub(/\n\n/, "</#{node.name}>\n\n") if content.include?("\n\n")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

# Gotta do this the hard way - search the tag nodes for one
# that contains our ending line
doc.search("#{bad_tag}").each do |node|
next unless node.inner_html.include?(end_block.chomp)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

@@ -212,6 +212,11 @@
page.body.should =~ /#{Regexp.escape(text)}/m
end

# Unescaped match - needed to match linebreaks in selections.
Then /^I should see the text with raw tags "(.*)"$/ do |text|
page.body.should =~ %r{#{text}}m
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/RegexpLiteral: Use // around regular expression.


updated = true


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLines: Extra blank line detected.

fixed = segment[cut..]

fixed_line = input.sub(segment, fixed)
input = input.sub(block, fixed_line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.

tags_to_close = tags[(tag_i + 1)..]
close_str = "</#{tags_to_close.join('></')}>"
fixed_line = block.sub(segment, close_str + segment)
input = input.sub(block, fixed_line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.

bad_tag = segment.sub(%r{</(\w*)(.|\n)*}, '\1')
tag_i = tags.rindex(bad_tag)

if tag_i
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockNesting: Avoid more than 3 levels of block nesting.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath("./p").size).to eq(3)
expect(doc.xpath("./p[1]").children.to_s.strip).to eq("A")
expect(doc.xpath("./p[2]").children.to_s.strip).to eq("B")
expect(doc.xpath("./p[3]").children.to_s.strip).to eq("C")
end

it "should not leave p inside i" do
result = add_paragraphs_to_text("<i><p>foo</p><p>bar</p></i>")
it "does not leave p inside i" do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/IndentationConsistency: Inconsistent indentation detected.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath("./p/#{tag}").children.to_s.strip).to eq("some<br>\ntext")
end
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/BlockAlignment: end at 771, 5 is not aligned with %w(b big cite code del dfn em i ins kbd q s samp at 764, 4 or small span strike strong sub sup tt u var).each do |tag| at 765, 6.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath(".//p").size).to eq(2)
expect(doc.xpath(".//#{tag}").children.to_s.strip).to eq("foo")
end
end

it "does not wrap tables in p tags" do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/IndentationConsistency: Inconsistent indentation detected.

doc = Nokogiri::HTML.fragment(result)
expect(doc.xpath(".//p").size).to eq(3)
expect(doc.xpath(".//br")).to be_empty
end

%w(dl h1 h2 h3 h4 h5 h6 ol pre table ul).each do |tag|
%w(dl h1 h2 h3 h4 h5 h6 ol pre ul).each do |tag|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].

expect(doc.xpath(".//br")).to be_empty
end
end

%w(blockquote center div).each do |tag|
it "should wrap text after #{tag} tags in paragraphs" do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSpec/ExampleWording: Do not use should when describing your tests.

@sarken
Copy link
Collaborator

sarken commented Jan 30, 2021

Hi, momijizukamori!

Thanks for taking a shot at fixing these issues! A rewrite of the parser is something that we've been exploring off and on as time permits, but we're still in the process of figuring out exactly what sort of functionality we want to offer (e.g. Markdown support, options regarding how whitespace is interpreted) and how best to implement it (e.g. by allowing both parsers to coexist for a time). Once we decide how to tackle a full rewrite, we'll have an issue -- probably several -- specifically for that.

However, for the time being, we'd prefer a piecemeal approach to fixing bugs to a total rewrite. If you'd like to take a shot at fixing any of the issues while keeping the current setup, or to work on something else from the issue tracker, you're welcome to create a Jira account and we'll give you permissions to claim and transition issues. (If you do that, we ask that the Full Name on the Jira account either match or include the name you wish to be credited by -- it helps us keep things organized.)

@sarken sarken closed this Jan 30, 2021
@potpotkettle
Copy link
Contributor

FWIW, a fairly large, different rewrite was done in 2022: #4391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants