Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will this work on xml documents? #73

Open
chovyprognos opened this issue Nov 24, 2021 · 6 comments
Open

Will this work on xml documents? #73

chovyprognos opened this issue Nov 24, 2021 · 6 comments

Comments

@chovyprognos
Copy link

I will be parsing RSS feeds and am wondering if I can use querySelectorAll on an xml doc

@b-fuze
Copy link
Owner

b-fuze commented Nov 24, 2021

I've never tried, but if anything, it will parse the XML as if it is HTML which you may not want. I plan to add a proper XML parser at some point, but I haven't gotten around to it yet. You can try it, and it may be good enough for what you need.

@b-fuze
Copy link
Owner

b-fuze commented Nov 24, 2021

I should add that parseFromString only accepts text/html for now

@amundo
Copy link

amundo commented Jun 2, 2022

Thanks for the explanation, interested in this too. I have routinely parsed XML with .parseFromString(xml, 'text/html') and it often seems to work; could you explain what it means to “parse XML as if it is HTML”?

Thanks for this awesome library.

@0kku
Copy link
Collaborator

0kku commented Jun 2, 2022

There are some subtle differences. For example, HTML doesn't support self-closing elements other than void HTML elements, while XML does. Also if the XML looks like HTML, the parser might shuffle around nodes without warning to fit with the rules of HTML, while XML should never do that.

@amundo
Copy link

amundo commented Jun 3, 2022

I see, thanks.

@Siltaar
Copy link

Siltaar commented Jul 9, 2024

Hi, I'm also facing the need to parse RSS feeds for a CLI version of Meta-Press.es (press meta-search-engine).

Parsing RSS XML files with text/html content-type allows to get the content of some elements (such as titles) but not links nor pubDates (which aren't HTML).

In my use case, as long as I can querySelector() elements and reach their textContent it's OK, I don't need strict parsing nor element order.

(well, to be true, over the nearly 1000 scrapped newspaper websites sometimes I need to reach attributes and sometimes I use XPath)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants