-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace HTML cleaner/parser with more robust one #3967
Conversation
- Upgrade nokogumbo and sanitize - Add fixup processor based on nokogumbo HTML5 processor - Replace home-rolled renderer with one based on the redcarpet markdown renderer - Tweak a handful of unit tests to reflect fixed output - Add new unit tests for bugfixes Fixes the following issues: - AO3-4599 - Parser: PRE tags insert extra spaces - AO3-3769 - HTML editor's parser won't give an open-paragraph tag to the paragraph after text enclosed in "center" tags, causing weird line spacing - AO3-3234 - using blockquote tags with paragraphs tags inside results in strange output - AO3-3565 - HTML parser will not allow formatting tags (italics, bold, emphasis, etc.) to span a line break tag - AO3-3313 - Using an image as a link causes the HTML parser to move the "close anchor" tag to eliminate that or another link in the same paragraph under certain circumstances - AO3-3464 - HTML parser eats text after a missing quote upon preview or posting (partial) - AO3-3916 - Carriage returns won't create line or paragraph breaks in the HTML editor if text has already been through the parser - AO3-3710 - Blockquote tags develop extra blank paragraphs before them if a work is edited
@@ -978,33 +887,35 @@ | |||
|
|||
%w(> < &).each do |entity| | |||
it "should handle #{entity}" do | |||
result = add_paragraphs_to_text("#{entity}") | |||
result = format_linebreaks("#{entity}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/RedundantInterpolation: Prefer to_s over string interpolation.
expect(doc.xpath("./p/strong/em").children.to_s.strip).to eq("unclosed") | ||
end | ||
|
||
it "should close mismatched tags that wrap the same tag" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/ExampleWording: Do not use should when describing your tests.
# AO3-3565 | ||
%w(b big cite code del dfn em i ins kbd q s samp | ||
small span strike strong sub sup tt u var).each do |tag| | ||
it "should handle #{tag} inline tags spanning <br> tags" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/ExampleWording: Do not use should when describing your tests.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some") | ||
expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text") | ||
end | ||
end | ||
|
||
# AO3-3565 | ||
%w(b big cite code del dfn em i ins kbd q s samp | ||
small span strike strong sub sup tt u var).each do |tag| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/ArrayAlignment: Align the elements of an array literal if they span more than one line.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath("./p[1]/#{tag}").children.to_s.strip).to eq("some") | ||
expect(doc.xpath("./p[2]/#{tag}").children.to_s.strip).to eq("text") | ||
end | ||
end | ||
|
||
# AO3-3565 | ||
%w(b big cite code del dfn em i ins kbd q s samp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].
Oh, as a quick note, I also ran some benchmarking between the old and new versions of the code, and they come out around the same - differences were between 2x as fast and 2x as slow. The old parser is faster at very long plaintext files (stuff around the per-chapter character max), the new one is faster at very long html-formatted files. If the recursive nokogumbo formatting step is removed, performance is the same or better as the old code. |
- Fix failing unit tests - Remove now-unnecessary test for conversion of '<3' (handled by sanitzer/parser) - Re-add confusingly named 'strip_html_breaks_simple' function - Fix issue where indented lines were being parsed as codeblocks.
gem 'nokogumbo', '1.4.9' | ||
gem 'sanitize', '>= 5.0.0' | ||
gem 'nokogumbo', '> 2.0.0' | ||
gem 'redcarpet' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.
# https://github.com/rubys/nokogumbo/issues/50 | ||
gem 'nokogumbo', '1.4.9' | ||
gem 'sanitize', '>= 5.0.0' | ||
gem 'nokogumbo', '> 2.0.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.
# https://otwarchive.atlassian.net/browse/AO3-4957 | ||
# https://github.com/rubys/nokogumbo/issues/50 | ||
gem 'nokogumbo', '1.4.9' | ||
gem 'sanitize', '>= 5.0.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/StringLiterals: Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.
doc = MismatchedTagFixer.new.read(input) | ||
|
||
# If the doc doesn't have p tags, convert br to newline for processing | ||
if doc.children.all? { |n| n.name != "p"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/IfUnlessModifier: Favor modifier if usage when having a single-line body. Another good alternative is the usage of control flow &&/||.
Layout/SpaceInsideBlockBraces: Space missing inside }.
|
||
content = node.to_html | ||
chomped = node.to_html.chomp("</#{node.name}>") | ||
content = chomped.sub(/\n\n/, "</#{node.name}>\n\n") if content.include?("\n\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics/BlockNesting: Avoid more than 3 levels of block nesting.
# Gotta do this the hard way - search the tag nodes for one | ||
# that contains our ending line | ||
doc.search("#{bad_tag}").each do |node| | ||
next unless node.inner_html.include?(end_block.chomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics/BlockNesting: Avoid more than 3 levels of block nesting.
… that is failing) - More Hound linting
@@ -212,6 +212,11 @@ | |||
page.body.should =~ /#{Regexp.escape(text)}/m | |||
end | |||
|
|||
# Unescaped match - needed to match linebreaks in selections. | |||
Then /^I should see the text with raw tags "(.*)"$/ do |text| | |||
page.body.should =~ %r{#{text}}m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/RegexpLiteral: Use // around regular expression.
|
||
updated = true | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/EmptyLines: Extra blank line detected.
fixed = segment[cut..] | ||
|
||
fixed_line = input.sub(segment, fixed) | ||
input = input.sub(block, fixed_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.
tags_to_close = tags[(tag_i + 1)..] | ||
close_str = "</#{tags_to_close.join('></')}>" | ||
fixed_line = block.sub(segment, close_str + segment) | ||
input = input.sub(block, fixed_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/IdenticalConditionalBranches: Move input = input.sub(block, fixed_line) out of the conditional.
bad_tag = segment.sub(%r{</(\w*)(.|\n)*}, '\1') | ||
tag_i = tags.rindex(bad_tag) | ||
|
||
if tag_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics/BlockNesting: Avoid more than 3 levels of block nesting.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath("./p").size).to eq(3) | ||
expect(doc.xpath("./p[1]").children.to_s.strip).to eq("A") | ||
expect(doc.xpath("./p[2]").children.to_s.strip).to eq("B") | ||
expect(doc.xpath("./p[3]").children.to_s.strip).to eq("C") | ||
end | ||
|
||
it "should not leave p inside i" do | ||
result = add_paragraphs_to_text("<i><p>foo</p><p>bar</p></i>") | ||
it "does not leave p inside i" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/IndentationConsistency: Inconsistent indentation detected.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath("./p/#{tag}").children.to_s.strip).to eq("some<br>\ntext") | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/BlockAlignment: end at 771, 5 is not aligned with %w(b big cite code del dfn em i ins kbd q s samp at 764, 4 or small span strike strong sub sup tt u var).each do |tag| at 765, 6.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath(".//p").size).to eq(2) | ||
expect(doc.xpath(".//#{tag}").children.to_s.strip).to eq("foo") | ||
end | ||
end | ||
|
||
it "does not wrap tables in p tags" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layout/IndentationConsistency: Inconsistent indentation detected.
doc = Nokogiri::HTML.fragment(result) | ||
expect(doc.xpath(".//p").size).to eq(3) | ||
expect(doc.xpath(".//br")).to be_empty | ||
end | ||
|
||
%w(dl h1 h2 h3 h4 h5 h6 ol pre table ul).each do |tag| | ||
%w(dl h1 h2 h3 h4 h5 h6 ol pre ul).each do |tag| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style/PercentLiteralDelimiters: %w-literals should be delimited by [ and ].
expect(doc.xpath(".//br")).to be_empty | ||
end | ||
end | ||
|
||
%w(blockquote center div).each do |tag| | ||
it "should wrap text after #{tag} tags in paragraphs" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSpec/ExampleWording: Do not use should when describing your tests.
Hi, momijizukamori! Thanks for taking a shot at fixing these issues! A rewrite of the parser is something that we've been exploring off and on as time permits, but we're still in the process of figuring out exactly what sort of functionality we want to offer (e.g. Markdown support, options regarding how whitespace is interpreted) and how best to implement it (e.g. by allowing both parsers to coexist for a time). Once we decide how to tackle a full rewrite, we'll have an issue -- probably several -- specifically for that. However, for the time being, we'd prefer a piecemeal approach to fixing bugs to a total rewrite. If you'd like to take a shot at fixing any of the issues while keeping the current setup, or to work on something else from the issue tracker, you're welcome to create a Jira account and we'll give you permissions to claim and transition issues. (If you do that, we ask that the Full Name on the Jira account either match or include the name you wish to be credited by -- it helps us keep things organized.) |
FWIW, a fairly large, different rewrite was done in 2022: #4391 |
Fixes the following issues:
Pull Request Checklist
as the first thing in your pull request title (e.g.
AO3-1234 Fix thing
) - Fixes multiple issuesuntil they are reviewed and merged before creating new pull requests.
Issue
https://otwarchive.atlassian.net/browse/AO3-4599
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3234
https://otwarchive.atlassian.net/browse/AO3-3565
https://otwarchive.atlassian.net/browse/AO3-3313
https://otwarchive.atlassian.net/browse/AO3-3464
https://otwarchive.atlassian.net/browse/AO3-3916
https://otwarchive.atlassian.net/browse/AO3-3710
Purpose
This is more or less a ground-up rewrite of the formatter/cleaner code, which uses the combination of the nokogumbo HTML5 parser and the fast, lightweight redcarpet Markdown processor. The HTML5 parser is uses to correct what errors we can, and then the document is processed into Markdown compatible format, markdown special characters are escaped so the processor doesn't convert them, and then the processor is used to convert linebreaks to HTML paragraph and linebreak elements. It may seem a little silly to use a markdown processor while ignoring most of markdown syntax, but doing it this way gives a significant boost to reliability. Markdown is widely used and has a well-defined syntax. This means output is reliable, and that the libraries generating it are extensively tested, both for output and for security issues.
There are three kind of messy hacks in this, of varying severity. The first is that the Redcarpet library has some incorrect categorization of element types - namely, it thinks
center
is an inline element, andins
anddel
are block level ones. I've opened an MR with the project ( vmg/redcarpet#702 ) and have a workaround in place for testing, but if the redcarpet devs are slow about merging I think it would be better to vendorize the fixed version of the library instead of letting the workaround go to prod.The second is the HTML fixer, which is fully half of this code. I put it in place because the existing version had it, and if we were going to do it, we should do it better, but trying to repair HTML programmatically is very fragile and error prone. The cleaner solution would be to return an error to the user, along with details about that error (nokogumbo's errors are largely fairly informative) so that they can make the fix themselves. I'd be happy to give that work a go, but it's out of scope for this, and ties into the last issue.
The last issue is the anti-pattern of sanitizing and formatting fields before they're saved to database. This results in errors introduced by buggy versions of the sanitzer or formatter being unfixable, because we can't reconstruct the original input, causes issues with editing due to multiple roundtrips through the sanitizer/formatter, results in text being improperly encoded for the context (such as HTML entities in email headers), and in combination with the auto-correcting attempts means we end up losing data sometimes. Standard best pratices are to save the raw user input, and parse/sanitize it as necessary for output, which may vary based on where it's being used - the edit field would show the raw input, with only enough escaping to keep it from breaking the field, text used in email subjects would be escaped to plaintext, and text in an HTML page would be fully HTML-sanitized and encoded. We would want to implement to bad HTML fix first, so that we're only putting well-formed user input into the db.
I realize this is a much bigger production, both in terms of making sure encoding/sanitizing is performance-optimized, and in terms of making sure we handle both old pre-sanitized data and new raw data properly. I've included an
unparse
function that replaces HTML linebreaks/paragraphs with raw linebreak characters, so that multiple edit/save cycles doesn't generate infinitely nested linebreak elements, but it can't do anything about other HTML changes we make, and it will process any user-input linebreak elements the same way, as it can't tell what the user added versus what the parser added.Testing Instructions
I've done testing with a bunch of fics I have sitting around, along with some text files from Project Gutenberg. What this really needs more testing with is stuff with 'weird' mark-up - extensive use of divs/classes, tables, lots of embedded images, etc.
Credit
momijizukamori & they/them is fine :)