-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to strip CSS/style nodes? #100
Comments
@Chichilele - It looks like the implementation for
so I think you are just getting the text content of your Perhaps simply stripping the markup of a tag is the intended behavior, and we need a new |
I had exactly the same problem as this; I hacked it as follows: diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index bf765ec..cb27487 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -117,6 +117,9 @@ class MarkdownConverter(object):
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th']
+ if node.name == "style":
+ return ""
+
if is_nested_node(node):
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the This is likely by no means perfect, but worked 100% correctly for the multiple inputs I was trying to convert. |
Another simple workaround: from markdownify import markdownify as md
import re
my_html = '...'
stripped_html = re.sub(r'<style>(.*)</style>', '', my_html, flags=re.DOTALL)
markdown = md(stripped_html) |
Yeah, it's not about stripping content, but stripping tags. Generally, the body of the elements is actual content being marked up; style is a bit of an outlier because the body isn't really content. A new option makes sense. We could probably also do with a README update, though; I don't think people realize that you can write custom |
@matthewwithanm I suggest you apply a bugfix along the lines of what was suggested by @aytey. I read the readme section about custom converters and it starts with "If you have a special usecase". The thing is, I do not think this style-issue is a special use case at all - in fact, I cannot think of a use case where the current behaviour is desirable. Maybe I am missing something? |
@tlk Yeah, I'm not saying the library shouldn't handle this. All I'm saying is that, if you find a case where a tag isn't being handled like you want, you don't need a patch like @aytey's suggestion, or to preprocess like you suggested. You just need to subclass and add a |
I'm glad this opened a conversation. Although I went down another route for my project, here is what I would have used to solve this issue (based on the suggestions above): from markdownify import MarkdownConverter
class MarkdownConverterNoStyle(MarkdownConverter):
def convert_style(self, el, text, convert_as_inline):
return ''
html = '<!DOCTYPE html><html><head><style>body {background-color: powderblue;}h1 {color: blue;}p {color: red;}</style></head><body><p>This is text</p></body></html>'
MarkdownConverterNoStyle().convert(html) Returns a nice The @tlk I share your opinion that style should be left out by default |
@matthewwithanm Got it - good point! Come to think of it, the script tag would probably also need to be dealt with. PR #112 will make the markdownify library ignore both the script tag and the style tag. |
Fixed in 0.12.1. |
I may be misusing the option
strip
but I can't get rid of the style bitsThe text was updated successfully, but these errors were encountered: