-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace lxml.html.clean_html with bleach; drop lxml dependency #1854
Conversation
I'm a bit concerned about whether it is a drop in replacement. There is one test that is failing that looks to be a result of different filtering. |
@blink1073 I'll take a look. Thing is the filter isn't quite documented in any way. |
@blink1073 The single test failure was due to a different quoting; |
I changed the setting so you won't need approval to run the CI workflows. If there is a slight tweak that can be made to the expected output that doesn't affect security then that seems reasonable. |
5ecd89d
to
be116fa
Compare
@blink1073 Thanks – that makes running CI a little easier. Please see the updated PR description, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thank you!
element = str(element) | ||
return bleach.clean( | ||
element, | ||
tags=[*bleach.ALLOWED_TAGS, "div", "pre", "code", "span"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clean_html
function is clearly unsuitable to be called on image/svg+xml
elements, since it will escape all xml tags. Unfortunately the base.html.j2
templates call clean_html
in this context. See #1849.
Rather than listing tags that are allowed, would it be possible to list tags that are disallowed? (Like I suspect <script>
and <iframe>
are unwelcome. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a bummer. No tests clearly covered this, so that's a regression. Sorry for the trouble... SVG tags are well-known though so they could be added to the allowlist...
It looks like ![]() With the workaround specified in #1892 (comment), it looks ok again: ![]() |
pygments emits line numbers via tables, with tr and td elements (https://pygments.org/docs/formatters/#HtmlFormatter). lxml's clean_html considers table, tr and td as safe elements, but with jupyter#1854 they are now considered unsafe. So instead of displaying line numbers, the table, tr and td elements are escaped, and show up as literal HTML if trying to enable line numbers via the method introduced in jupyter#1683. This PR adds table, tr and td as safe elements so that line numbers can continue to work. I know that there are probably plans to move away from bleach (jupyter#1892), but this is a small and focused change so hopefully doesn't need to block on
Opened #2083 to add those :) |
There's no need to pull in all of lxml for just
clean_html
, especially since nbconvert already depends on bleach anyway.This PR:
notebook1.html
file fromnbconvert/tests/files
lxml
dependency in favor ofbleach
; see below.So turns out
bleach
was a bit stricter by default than lxml'sclean_html
, so I had to tweak the allowed tags rules a little:div
,pre
,code
,span
tags (used by highlighting) are explicitly allowedclass
andid
attributes are explicitly allowed.Beyond that, I manually checked that all of the .ipynb files under
nbconvert/examples/tests
are equivalently generated aside from"
being"
and'
being'
.