Replace `strip_html` with `markdown_excerpt` #1406

charmander · 2024-05-15T00:01:03Z

markdown_excerpt is more complete (renders Markdown, normalizes whitespace), correct (produces plain text), and robust (uses lxml.html, which doesn’t crash on the same malformed input that html.parser does).

Even when I originally wrote this in [2013-09-25 :/Replaced BBCode with Markdown], the output of `strip_html` was used escaped, so it wasn’t correct then. Making the parser class local to the call with a `text_parts` class variable was also questionable style. As of Python 3.5, `HTMLParser` never calls `handle_entityref` or `handle_charref` by default.

… to text

…ons in Markdown

….parser` Seen in the wild.

… and submissions

charmander added 6 commits May 14, 2024 12:30

Add basic test for strip_html

5aec7c2

Replace strip_html with a slightly better attempt at rendering HTML…

a438d21

… to text

Avoid generating redundant alt text when rendering user links with ic…

ef5aac4

…ons in Markdown

Add failing test for parsing invalid HTML with standard library `html…

19a6150

….parser` Seen in the wild.

Reuse markdown_excerpt for <meta name="description">s on profiles…

bfe1277

… and submissions

charmander added the cleanup label May 15, 2024

charmander merged commit 4d0aa2c into Weasyl:main May 15, 2024
4 checks passed

charmander deleted the strip-html-refactor branch May 15, 2024 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `strip_html` with `markdown_excerpt` #1406

Replace `strip_html` with `markdown_excerpt` #1406

charmander commented May 15, 2024

Replace strip_html with markdown_excerpt #1406

Replace strip_html with markdown_excerpt #1406

Conversation

charmander commented May 15, 2024

Replace `strip_html` with `markdown_excerpt` #1406

Replace `strip_html` with `markdown_excerpt` #1406