Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to strip CSS/style nodes? #100

Closed
GeeCastro opened this issue Dec 20, 2023 · 10 comments
Closed

How to strip CSS/style nodes? #100

GeeCastro opened this issue Dec 20, 2023 · 10 comments

Comments

@GeeCastro
Copy link
Contributor

>>> from markdownify import markdownify as md
>>> md('<!DOCTYPE html><html><head><style>body {background-color: powderblue;}h1 {color: blue;}p {color: red;}</style></head><body><p>This is text</p></body></html>') 
'body {background-color: powderblue;}h1 {color: blue;}p {color: red;}This is text\n\n'

I may be misusing the option strip but I can't get rid of the style bits

@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Jan 14, 2024

@Chichilele - It looks like the implementation for strip is incomplete. It prevents markup generation for the containing element, but not for its contents:

>>> from markdownify import markdownify as md
>>> md('<pre>1</pre><pre>2</pre>')
'\n```\n1\n```\n\n```\n2\n```\n'
>>> md('<pre>1</pre><pre>2</pre>', strip=['pre'])
'12'

>>> md('<ol><li>a</li><li>b</li></ol>')
'1. a\n2. b\n'
>>> md('<ol><li>a</li><li>b</li></ol>', strip=['ol', 'li'])
'ab'

so I think you are just getting the text content of your <style> elements.

Perhaps simply stripping the markup of a tag is the intended behavior, and we need a new ignore argument to completely ignore a tag and its contents.

@aytey
Copy link

aytey commented Feb 16, 2024

I had exactly the same problem as this; I hacked it as follows:

diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index bf765ec..cb27487 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -117,6 +117,9 @@ class MarkdownConverter(object):
                                       'table', 'thead', 'tbody', 'tfoot',
                                       'tr', 'td', 'th']

+        if node.name == "style":
+            return ""
+
         if is_nested_node(node):
             for el in node.children:
                 # Only extract (remove) whitespace-only text node if any of the

This is likely by no means perfect, but worked 100% correctly for the multiple inputs I was trying to convert.

@tlk
Copy link
Contributor

tlk commented Feb 16, 2024

Another simple workaround:

from markdownify import markdownify as md
import re
my_html = '...'
stripped_html = re.sub(r'<style>(.*)</style>', '', my_html, flags=re.DOTALL)
markdown = md(stripped_html)

https://docs.python.org/3/library/re.html#re.DOTALL

@matthewwithanm
Copy link
Owner

matthewwithanm commented Feb 17, 2024

@aytey @tlk Couldn't you just use a custom converter with a convert_style method, like shown in the README?

@matthewwithanm
Copy link
Owner

Perhaps simply stripping the markup of a tag is the intended behavior, and we need a new ignore argument to completely ignore a tag and its contents.

Yeah, it's not about stripping content, but stripping tags. Generally, the body of the elements is actual content being marked up; style is a bit of an outlier because the body isn't really content.

A new option makes sense. We could probably also do with a README update, though; I don't think people realize that you can write custom convert_blah functions instead of resorting to preprocessing the input.

@tlk
Copy link
Contributor

tlk commented Feb 18, 2024

@matthewwithanm I suggest you apply a bugfix along the lines of what was suggested by @aytey.

I read the readme section about custom converters and it starts with "If you have a special usecase". The thing is, I do not think this style-issue is a special use case at all - in fact, I cannot think of a use case where the current behaviour is desirable. Maybe I am missing something?

@matthewwithanm
Copy link
Owner

@tlk Yeah, I'm not saying the library shouldn't handle this.

All I'm saying is that, if you find a case where a tag isn't being handled like you want, you don't need a patch like @aytey's suggestion, or to preprocess like you suggested. You just need to subclass and add a convert_style method. And this should be made clearer in the README so people know they don't have to modify the library or do regex stuff.

@GeeCastro
Copy link
Contributor Author

I'm glad this opened a conversation. Although I went down another route for my project, here is what I would have used to solve this issue (based on the suggestions above):

from markdownify import MarkdownConverter

class MarkdownConverterNoStyle(MarkdownConverter):
    def convert_style(self, el, text, convert_as_inline):
        return ''

html = '<!DOCTYPE html><html><head><style>body {background-color: powderblue;}h1 {color: blue;}p {color: red;}</style></head><body><p>This is text</p></body></html>'
MarkdownConverterNoStyle().convert(html)

Returns a nice 'This is text\n\n'.

The convert_<tag name> template wasn't obvious to me but is in fact very handy @matthewwithanm. Perhaps adding this example to the readme would be good? Or replacing the existing?

@tlk I share your opinion that style should be left out by default

@tlk
Copy link
Contributor

tlk commented Feb 19, 2024

@matthewwithanm Got it - good point!

Come to think of it, the script tag would probably also need to be dealt with.

PR #112 will make the markdownify library ignore both the script tag and the style tag.

@chrispy-snps
Copy link
Collaborator

Fixed in 0.12.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants