Skip to content

Commit

Permalink
Implement support for htmllaundry/lxml.cleaner
Browse files Browse the repository at this point in the history
By passing the HTML code through both htmllaundry and bleach
we can achieve a much better result than with either of these
libraries individually.
  • Loading branch information
glutanimate committed May 26, 2017
1 parent 05ec38c commit 88eff17
Show file tree
Hide file tree
Showing 177 changed files with 79,353 additions and 25 deletions.
13 changes: 10 additions & 3 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,16 @@
.gitattributes export-ignore
.gitignore export-ignore
docs export-ignore
/screenshots export-ignore
/tools export-ignore
screenshots export-ignore
tools export-ignore
html_cleaner/test.py export-ignore
obsolete export-ignore
# Adjust GitHub linguist settings:
ANKIWEB.md linguist-documentation
ANKIWEB.md linguist-documentation
html_cleaner/bleach/* linguist-vendored
html_cleaner/html5lib/* linguist-vendored
html_cleaner/htmllaundry/* linguist-vendored
html_cleaner/LICENSES/* linguist-vendored
html_cleaner/lxml/* linguist-vendored
html_cleaner/webencodings/* linguist-vendored
html_cleaner/six.py linguist-vendored
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ The add-on comes with a button and two hotkeys:

The add-on's HTML processing is highly configurable. All options can be accessed by editing the configuration section of `html_cleaner/main.py`.

## Platform Support

HTML processing is provided by the Bleach library on all platforms. Additionally, the add-on also utilizes the [`htmllaundry` library](https://github.com/wichert/htmllaundry) which can improve the cleaning results under under some circumstances.

`htmllaundry` depends on `lxml` which Anki unfortunately does not ship with. In contrast to the other libraries included in this add-on, `lxml` cannot be easily be packaged for all platforms because it needs to be compiled. For that reason `htmllaundry` support is only available on Windows and Linux right now.

## License and Credits

*Cloze Overlapper* is *Copyright © 2016-2017 [Aristotelis P.](https://github.com/Glutanimate)*
Expand All @@ -24,5 +30,7 @@ This add-on would not not have been possible without the following open-source l

- [Bleach](https://github.com/mozilla/bleach) 2.0.0. Copyright (c) 2014-2017 Mozilla Foundation. Licensed under the Apache License 2.0
- [html5lib](https://github.com/html5lib/) 0.999999999. Copyright (c) 2006-2013 James Graham and other contributors. Licensed under the MIT license.
- [htmllaundry](https://github.com/wichert/htmllaundry) 2.1. Copyright (c) 2010-2016 Wichert Akkerman. Licensed under the BSD license.
- [lxml](http://lxml.de/) 3.7.3. Copyright (c) Infrae. Licensed under the BSD license.
- [webencodings](https://github.com/gsnedders/python-webencodings) 0.5.1. Copyright (c) 2012-2017 Geoffrey Sneddon. Licensed under the BSD license.
- [six](https://github.com/benjaminp/six) 1.10.0. Copyright (c) 2010-2015 Benjamin Peterson. Licensed under the MIT license
26 changes: 26 additions & 0 deletions html_cleaner/LICENSES/LICENSE_HTMLLAUNDRY
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Copyright (c) 2010-2016, Wichert Akkerman
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies,
either expressed or implied, of the FreeBSD Project.
29 changes: 29 additions & 0 deletions html_cleaner/LICENSES/LICENSE_LXML
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
lxml is copyright Infrae and distributed under the BSD license (see
doc/licenses/BSD.txt), with the following exceptions:

Some code, such a selftest.py, selftest2.py and
src/lxml/_elementpath.py are derived from ElementTree and
cElementTree. See doc/licenses/elementtree.txt for the license text.

lxml.cssselect and lxml.html are copyright Ian Bicking and distributed
under the BSD license (see doc/licenses/BSD.txt).

test.py, the test-runner script, is GPL and copyright Shuttleworth
Foundation. See doc/licenses/GPL.txt. It is believed the unchanged
inclusion of test.py to run the unit test suite falls under the
"aggregation" clause of the GPL and thus does not affect the license
of the rest of the package.

The isoschematron implementation uses several XSL and RelaxNG resources:
* The (XML syntax) RelaxNG schema for schematron, copyright International
Organization for Standardization (see
src/lxml/isoschematron/resources/rng/iso-schematron.rng for the license
text)
* The skeleton iso-schematron-xlt1 pure-xslt schematron implementation
xsl stylesheets, copyright Rick Jelliffe and Academia Sinica Computing
Center, Taiwan (see the xsl files here for the license text:
src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/)
* The xsd/rng schema schematron extraction xsl transformations are unlicensed
and copyright the respective authors as noted (see
src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl and
src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl)
6 changes: 6 additions & 0 deletions html_cleaner/htmllaundry/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from htmllaundry.utils import sanitize
from htmllaundry.utils import strip_markup
from htmllaundry.utils import StripMarkup


__all__ = ['sanitize', 'strip_markup', 'StripMarkup']
77 changes: 77 additions & 0 deletions html_cleaner/htmllaundry/cleaners.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
from lxml.html.clean import Cleaner
from lxml.html.clean import _find_external_links


marker = []


class LaundryCleaner(Cleaner):
link_target = marker

def __call__(self, doc):
super(LaundryCleaner, self).__call__(doc)
if self.link_target is not marker:
self.force_link_target(doc, self.link_target)

def force_link_target(self, doc, target):
for el in _find_external_links(doc):
if target is None:
if 'target' in el.attrib:
del el.attrib['target']
else:
el.set('target', target)


DocumentCleaner = LaundryCleaner(
page_structure=False,
remove_unknown_tags=False,
allow_tags=['blockquote', 'a', 'img', 'em', 'p', 'strong',
'h3', 'h4', 'h5', 'ul', 'ol', 'li', 'sub', 'sup',
'abbr', 'acronym', 'dl', 'dt', 'dd', 'cite',
'dft', 'br', 'table', 'tr', 'td', 'th', 'thead',
'tbody', 'tfoot'],
safe_attrs_only=True,
add_nofollow=True,
scripts=True,
javascript=True,
comments=False,
style=True,
links=False,
meta=False,
processing_instructions=False,
frames=False,
annoying_tags=False)


# Useful for line fields such as titles
LineCleaner = LaundryCleaner(
page_structure=False,
safe_attrs_only=True,
remove_unknown_tags=False, # Weird API..
allow_tags=['em', 'strong'],
add_nofollow=True,
scripts=True,
javascript=True,
comments=False,
style=True,
processing_instructions=False,
frames=False,
annoying_tags=False)

CommentCleaner = LaundryCleaner(
page_structure=False,
safe_attrs_only=True,
remove_unknown_tags=False, # Weird API..
allow_tags=['blockquote', 'a', 'em', 'p', 'strong'],
add_nofollow=True,
scripts=False,
javascript=True,
comments=False,
style=True,
processing_instructions=False,
frames=False,
annoying_tags=False,
link_target="_blank")


__all__ = ['DocumentCleaner', 'LineCleaner', 'CommentCleaner']
178 changes: 178 additions & 0 deletions html_cleaner/htmllaundry/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
import re
import six
from lxml import etree
from lxml import html
from lxml.html import defs
from htmllaundry.cleaners import DocumentCleaner


INLINE_TAGS = defs.special_inline_tags | defs.phrase_tags | defs.font_style_tags
TAG = re.compile(six.u('<.*?>'))
ANCHORS = etree.XPath('descendant-or-self::a | descendant-or-self::x:a',
namespaces={'x': html.XHTML_NAMESPACE})
ALL_WHITESPACE = re.compile(r'^\s*$', re.UNICODE)


def is_whitespace(txt):
"""Utility method to test if txt is all whitespace or None."""
return txt is None or bool(ALL_WHITESPACE.match(txt))


def strip_markup(markup):
"""Strip all markup from a HTML fragment."""
return TAG.sub(six.u(""), markup)


StripMarkup = strip_markup # BBB for htmllaundry <2.0


def remove_element(el):
parent = el.getparent()
if el.tail:
previous = el.getprevious()
if previous is not None:
if previous.tail:
previous.tail += el.tail
else:
previous.tail = el.tail
else:
if parent.text:
parent.text += el.tail
else:
parent.text = el.tail

parent.remove(el)


def remove_empty_tags(doc, extra_empty_tags=[]):
"""Removes all empty tags from a HTML document. Javascript editors
and browsers have a nasty habit of leaving stray tags around after
their contents have been removed. This function removes all such
empty tags, leaving only valid empty tags.
In addition consecutive <br/> tags are folded into a single tag.
This forces whitespace styling to be done using CSS instead of via an
editor, which almost always produces better and more consistent results.
"""
empty_tags = set(['br', 'hr', 'img', 'input'])
empty_tags.update(set(extra_empty_tags))
legal_empty_tags = frozenset(empty_tags)

if hasattr(doc, 'getroot'):
doc = doc.getroot()

def clean(doc):
victims = []
for el in doc.iter():
if el.tag == 'br':
preceding = el.getprevious()
parent = el.getparent()

if (preceding is None and not parent.text) or \
(preceding is not None and preceding.tag == el.tag
and not preceding.tail) or \
(not el.tail and el.getnext() is None):
victims.append(el)
continue

if el.tag in legal_empty_tags:
continue

# Empty <a> can be used as anchor.
if (el.tag == 'a') and (('name' in el.attrib) or ('id' in el.attrib)):
continue

if len(el) == 0 and is_whitespace(el.text):
victims.append(el)
continue

if victims and victims[0] == doc:
doc.clear()
return 0
else:
for victim in victims:
remove_element(victim)

return len(victims)

while clean(doc):
pass

return doc


def strip_outer_breaks(doc):
"""Remove any toplevel break elements."""
victims = []

for i in range(len(doc)):
el = doc[i]
if el.tag == 'br':
victims.append(el)

for victim in victims:
remove_element(victim)


MARKER = 'LAUNDRY-INSERT'


def wrap_text(doc, element='p'):
"""Make sure there is no unwrapped text at the top level. Any bare text
found is wrapped in a `<p>` element.
"""
def par(text):
el = etree.Element(element, {MARKER: ''})
el.text = text
return el

if doc.text:
doc.insert(0, par(doc.text))
doc.text = None

while True:
for (i, el) in enumerate(doc):
if html._nons(el.tag) in INLINE_TAGS and i and MARKER in doc[i - 1].attrib:
doc[i - 1].append(el)
break
if not is_whitespace(el.tail):
doc.insert(i + 1, par(el.tail))
el.tail = None
break
else:
break

for el in doc:
if MARKER in el.attrib:
del el.attrib[MARKER]


def sanitize(input, cleaner=DocumentCleaner, wrap='p'):
"""Cleanup markup using a given cleanup configuration.
Unwrapped text will be wrapped with wrap parameter.
"""
if 'body' not in cleaner.allow_tags:
cleaner.allow_tags.append('body')

input = six.u("<html><body>%s</body></html>") % input
document = html.document_fromstring(input)
bodies = [e for e in document if html._nons(e.tag) == 'body']
body = bodies[0]

cleaned = cleaner.clean_html(body)
remove_empty_tags(cleaned)
strip_outer_breaks(cleaned)

if wrap is not None:
if wrap in html.defs.tags:
wrap_text(cleaned, wrap)
else:
raise ValueError(
'Invalid html tag provided for wrapping the sanitized text')

output = six.u('').join([etree.tostring(fragment, encoding=six.text_type)
for fragment in cleaned.iterchildren()])
if wrap is None and cleaned.text:
output = cleaned.text + output

return output
Loading

0 comments on commit 88eff17

Please sign in to comment.