Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

akx · 2022-09-01T10:32:26Z

There's no need to pull in all of lxml for just clean_html, especially since nbconvert already depends on bleach anyway.

This PR:

removes the unused, outdated, auto-generated notebook1.html file from nbconvert/tests/files
simplifies a test that would fail with the next commit
drops the lxml dependency in favor of bleach; see below.

So turns out bleach was a bit stricter by default than lxml's clean_html, so I had to tweak the allowed tags rules a little:

The div, pre, code, span tags (used by highlighting) are explicitly allowed
The class and id attributes are explicitly allowed.

Beyond that, I manually checked that all of the .ipynb files under nbconvert/examples/tests are equivalently generated aside from " being " and ' being '.

blink1073 · 2022-09-01T15:31:11Z

I'm a bit concerned about whether it is a drop in replacement. There is one test that is failing that looks to be a result of different filtering.

akx · 2022-09-05T08:07:40Z

@blink1073 I'll take a look. Thing is the filter isn't quite documented in any way.

akx · 2022-09-05T09:19:37Z

@blink1073 The single test failure was due to a different quoting; bleach doesn't let a bare ' through but converts it into an entity reference.

blink1073 · 2022-09-05T13:28:44Z

I changed the setting so you won't need approval to run the CI workflows. If there is a slight tweak that can be made to the expected output that doesn't affect security then that seems reasonable.

akx · 2022-09-06T12:11:30Z

@blink1073 Thanks – that makes running CI a little easier. Please see the updated PR description, too.

blink1073

Nice work, thank you!

adl · 2022-09-23T15:14:14Z

nbconvert/filters/strings.py

+        element = str(element)
+    return bleach.clean(
+        element,
+        tags=[*bleach.ALLOWED_TAGS, "div", "pre", "code", "span"],


This clean_html function is clearly unsuitable to be called on image/svg+xml elements, since it will escape all xml tags. Unfortunately the base.html.j2 templates call clean_html in this context. See #1849.

Rather than listing tags that are allowed, would it be possible to list tags that are disallowed? (Like I suspect <script> and <iframe> are unwelcome. )

Ah, that's a bummer. No tests clearly covered this, so that's a regression. Sorry for the trouble... SVG tags are well-known though so they could be added to the allowlist...

Fixes jupyter#1849 Refs jupyter#1854

yuvipanda · 2023-12-18T05:17:52Z

It looks like table and friends is also something that bleach maybe filters out while lxml passes through. Since pygments supports linenumbers via tables (see #1683), with a new enough version of nbconvert, enabling line numbers as described in the linked PR makes them display like this:

With the workaround specified in #1892 (comment), it looks ok again:

pygments emits line numbers via tables, with tr and td elements (https://pygments.org/docs/formatters/#HtmlFormatter). lxml's clean_html considers table, tr and td as safe elements, but with jupyter#1854 they are now considered unsafe. So instead of displaying line numbers, the table, tr and td elements are escaped, and show up as literal HTML if trying to enable line numbers via the method introduced in jupyter#1683. This PR adds table, tr and td as safe elements so that line numbers can continue to work. I know that there are probably plans to move away from bleach (jupyter#1892), but this is a small and focused change so hopefully doesn't need to block on

yuvipanda · 2023-12-18T05:53:27Z

Opened #2083 to add those :)

blink1073 added the maintenance label Sep 1, 2022

akx force-pushed the no-lxml branch from 8b2d35f to 8f060e3 Compare September 5, 2022 09:19

akx added 2 commits September 6, 2022 14:56

Remove and ignore generated HTML file

c3bee04

Deduplicate test_no_input

765285e

akx force-pushed the no-lxml branch 2 times, most recently from 5ecd89d to be116fa Compare September 6, 2022 12:09

Replace lxml.html.clean_html with bleach; drop lxml dependency

32bb6e8

akx force-pushed the no-lxml branch from a6baab6 to 32bb6e8 Compare September 6, 2022 12:59

blink1073 approved these changes Sep 6, 2022

View reviewed changes

blink1073 merged commit b40bb13 into jupyter:main Sep 6, 2022

adl reviewed Sep 23, 2022

View reviewed changes

adl mentioned this pull request Sep 23, 2022

Incorrect conversion of matplotlib SVG plots #1849

Closed

jhancke mentioned this pull request Sep 29, 2022

Cell type raw in notebooks cause Document is empty exception #1873

Closed

akx added a commit to akx/nbconvert that referenced this pull request Oct 26, 2022

clean_html: allow SVG tags and SVG attributes

d83b26b

Fixes jupyter#1849 Refs jupyter#1854

akx mentioned this pull request Oct 26, 2022

clean_html: allow SVG tags and SVG attributes #1890

Merged

akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022

clean_html: allow SVG tags and SVG attributes

9809c53

Fixes jupyter#1849 Refs jupyter#1854

akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022

clean_html: allow SVG tags and SVG attributes

c7ae18d

Fixes jupyter#1849 Refs jupyter#1854

akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022

clean_html: allow SVG tags and SVG attributes

67718c9

Fixes jupyter#1849 Refs jupyter#1854

MgenGlder mentioned this pull request Oct 27, 2022

Bleach seems to be significantly slower than lxml in 7.1.x+ #1892

Open

yuvipanda mentioned this pull request Dec 18, 2023

Add table, td, tr to allowed list of tags #2083

Merged

kloczek mentioned this pull request Apr 6, 2024

RFE: drop use bleach as this module s marked as deprecated #1952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

akx commented Sep 1, 2022 •

edited

Loading

blink1073 commented Sep 1, 2022

akx commented Sep 5, 2022

akx commented Sep 5, 2022

blink1073 commented Sep 5, 2022

akx commented Sep 6, 2022

blink1073 left a comment

adl Sep 23, 2022

akx Oct 16, 2022 •

edited

Loading

yuvipanda commented Dec 18, 2023

yuvipanda commented Dec 18, 2023

Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

Conversation

akx commented Sep 1, 2022 • edited Loading

blink1073 commented Sep 1, 2022

akx commented Sep 5, 2022

akx commented Sep 5, 2022

blink1073 commented Sep 5, 2022

akx commented Sep 6, 2022

blink1073 left a comment

Choose a reason for hiding this comment

adl Sep 23, 2022

Choose a reason for hiding this comment

akx Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

yuvipanda commented Dec 18, 2023

yuvipanda commented Dec 18, 2023

akx commented Sep 1, 2022 •

edited

Loading

akx Oct 16, 2022 •

edited

Loading