-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to retrieve text content #128
Comments
Hey @frederik-elwert! This is being worked on here: #127 :) |
Please consider this as basic feature and add It. |
+1 |
Any progress on this issue? |
Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests. |
Any progress on this issue? |
This still hasn't been addressed? |
One working option Is to use.. chaining css calls with from parsel import Selector
text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''
sel = Selector(text=text)
# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']
print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']
print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n'] It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).
I just realized that Selector.root - is lxml's html object created by it's print(sel.root.text_content())
'''
This is the new trend!
Published by newbieon Sept 17
''' Cases when Selector query return print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']
print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']
Applying bind to lxml's As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created. |
Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its |
As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the
::text
pseudo-element or XPathtext()
. Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:
lxml.html
has the convenience method.text_content()
that collects all of the text content of an element. Somethings similar could be added to theSelector
andSelectorList
classes. I could imagine two ways to approach the required API:.extract_text()
/.get_text()
methods. This seems clean and easy to use, but would lead to potentially convoluted method names like.extract_first_text()
(or.extract_text_first()
?)..extract*()
/.get()
, similar to the proposal in Add format_as to extract() methods #101. This could be.extract(format_as='text')
. This is less intrusive, but maybe less easy to discover.Would such an addition be welcome? I could prepare a patch.
The text was updated successfully, but these errors were encountered: