How to strip CSS/style nodes? #100

GeeCastro · 2023-12-20T16:02:16Z

>>> from markdownify import markdownify as md
>>> md('<!DOCTYPE html><html><head><style>body {background-color: powderblue;}h1 {color: blue;}p {color: red;}</style></head><body><p>This is text</p></body></html>') 
'body {background-color: powderblue;}h1 {color: blue;}p {color: red;}This is text\n\n'

I may be misusing the option strip but I can't get rid of the style bits

The text was updated successfully, but these errors were encountered:

chrispy-snps · 2024-01-14T14:32:37Z

@Chichilele - It looks like the implementation for strip is incomplete. It prevents markup generation for the containing element, but not for its contents:

>>> from markdownify import markdownify as md
>>> md('<pre>1</pre><pre>2</pre>')
'\n```\n1\n```\n\n```\n2\n```\n'
>>> md('<pre>1</pre><pre>2</pre>', strip=['pre'])
'12'

>>> md('<ol><li>a</li><li>b</li></ol>')
'1. a\n2. b\n'
>>> md('<ol><li>a</li><li>b</li></ol>', strip=['ol', 'li'])
'ab'

so I think you are just getting the text content of your <style> elements.

Perhaps simply stripping the markup of a tag is the intended behavior, and we need a new ignore argument to completely ignore a tag and its contents.

aytey · 2024-02-16T14:19:42Z

I had exactly the same problem as this; I hacked it as follows:

diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index bf765ec..cb27487 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -117,6 +117,9 @@ class MarkdownConverter(object):
                                       'table', 'thead', 'tbody', 'tfoot',
                                       'tr', 'td', 'th']

+        if node.name == "style":
+            return ""
+
         if is_nested_node(node):
             for el in node.children:
                 # Only extract (remove) whitespace-only text node if any of the

This is likely by no means perfect, but worked 100% correctly for the multiple inputs I was trying to convert.

tlk · 2024-02-16T21:27:47Z

Another simple workaround:

from markdownify import markdownify as md
import re
my_html = '...'
stripped_html = re.sub(r'<style>(.*)</style>', '', my_html, flags=re.DOTALL)
markdown = md(stripped_html)

https://docs.python.org/3/library/re.html#re.DOTALL

matthewwithanm · 2024-02-17T04:23:09Z

@aytey @tlk Couldn't you just use a custom converter with a convert_style method, like shown in the README?

matthewwithanm · 2024-02-17T04:37:35Z

Perhaps simply stripping the markup of a tag is the intended behavior, and we need a new ignore argument to completely ignore a tag and its contents.

Yeah, it's not about stripping content, but stripping tags. Generally, the body of the elements is actual content being marked up; style is a bit of an outlier because the body isn't really content.

A new option makes sense. We could probably also do with a README update, though; I don't think people realize that you can write custom convert_blah functions instead of resorting to preprocessing the input.

tlk · 2024-02-18T17:29:07Z

@matthewwithanm I suggest you apply a bugfix along the lines of what was suggested by @aytey.

I read the readme section about custom converters and it starts with "If you have a special usecase". The thing is, I do not think this style-issue is a special use case at all - in fact, I cannot think of a use case where the current behaviour is desirable. Maybe I am missing something?

matthewwithanm · 2024-02-18T20:11:18Z

@tlk Yeah, I'm not saying the library shouldn't handle this.

All I'm saying is that, if you find a case where a tag isn't being handled like you want, you don't need a patch like @aytey's suggestion, or to preprocess like you suggested. You just need to subclass and add a convert_style method. And this should be made clearer in the README so people know they don't have to modify the library or do regex stuff.

GeeCastro · 2024-02-18T22:40:31Z

I'm glad this opened a conversation. Although I went down another route for my project, here is what I would have used to solve this issue (based on the suggestions above):

from markdownify import MarkdownConverter

class MarkdownConverterNoStyle(MarkdownConverter):
    def convert_style(self, el, text, convert_as_inline):
        return ''

html = '<!DOCTYPE html><html><head><style>body {background-color: powderblue;}h1 {color: blue;}p {color: red;}</style></head><body><p>This is text</p></body></html>'
MarkdownConverterNoStyle().convert(html)

Returns a nice 'This is text\n\n'.

The convert_<tag name> template wasn't obvious to me but is in fact very handy @matthewwithanm. Perhaps adding this example to the readme would be good? Or replacing the existing?

@tlk I share your opinion that style should be left out by default

tlk · 2024-02-19T20:32:02Z

@matthewwithanm Got it - good point!

Come to think of it, the script tag would probably also need to be dealt with.

PR #112 will make the markdownify library ignore both the script tag and the style tag.

chrispy-snps · 2024-04-13T10:36:50Z

Fixed in 0.12.1.

GeeCastro mentioned this issue Feb 18, 2024

Add no css example to readme #111

Merged

chrispy-snps closed this as completed Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to strip CSS/style nodes? #100

How to strip CSS/style nodes? #100

GeeCastro commented Dec 20, 2023

chrispy-snps commented Jan 14, 2024 •

edited

Loading

aytey commented Feb 16, 2024

tlk commented Feb 16, 2024 •

edited

Loading

matthewwithanm commented Feb 17, 2024 •

edited

Loading

matthewwithanm commented Feb 17, 2024

tlk commented Feb 18, 2024

matthewwithanm commented Feb 18, 2024

GeeCastro commented Feb 18, 2024

tlk commented Feb 19, 2024

chrispy-snps commented Apr 13, 2024

How to strip CSS/style nodes? #100

How to strip CSS/style nodes? #100

Comments

GeeCastro commented Dec 20, 2023

chrispy-snps commented Jan 14, 2024 • edited Loading

aytey commented Feb 16, 2024

tlk commented Feb 16, 2024 • edited Loading

matthewwithanm commented Feb 17, 2024 • edited Loading

matthewwithanm commented Feb 17, 2024

tlk commented Feb 18, 2024

matthewwithanm commented Feb 18, 2024

GeeCastro commented Feb 18, 2024

tlk commented Feb 19, 2024

chrispy-snps commented Apr 13, 2024

chrispy-snps commented Jan 14, 2024 •

edited

Loading

tlk commented Feb 16, 2024 •

edited

Loading

matthewwithanm commented Feb 17, 2024 •

edited

Loading