No matches when searching by tagName in XMLDOM's text/html DOM #27

moll · 2015-02-17T18:47:19Z

Hey,

Giving Xpath a spin with XMLDOM. I can't seem to get it to match any elements by tag name when the DOM's MIME type is set to text/html. Is it supposed to work?

var Xpath = require("xpath")
var DomParser = require("xmldom").DOMParser

var html = "<html><head><title>Harry Potter</title></head></html>"
var doc = new DomParser().parseFromString(html, "text/html")
console.log(Xpath.select("//title", doc))

The above returns an empty array. Add a random attribute to an element, try to match by that instead (//*[@foo=123]) and it works.

Thanks!

The text was updated successfully, but these errors were encountered:

moll · 2015-02-17T19:59:20Z

I've tracked this down to the difference between namespaceURI. DOM elements all have it set to http://www.w3.org/1999/xhtml, but the Xpath query doesn't. Xpath also doesn't apply the default namespace to query tag names.

Would you mind changing Xpath so it would? Prefixing every tag in the query with the XHTML namespace is rather cumbersome. Thanks!

JLRishe · 2015-02-18T17:46:56Z

@moll What is it exactly that you are suggesting be changed? If the nodes have a namespace, then named node tests for them must use a namespace prefix:

http://www.w3.org/TR/xpath/#node-tests

A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. _This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded)._

moll · 2015-02-18T18:02:13Z

Perhaps a way to set a default namespace for unqualified names. The useNamespaces feature is already there. It's just not available for queries.

Try the example above. It's rather cumbersome to prefix every tag name with either the full namespace URI or an alias -> //div/h3/span turns into //html:div/html:h3/html:span in the best case.

JLRishe · 2015-02-18T18:19:00Z

As someone who uses XPath a lot, I don't see it as all that cumbersome to use prefixes most of the time, but you definitely have a valid point in that the original API had a useNamespaces option.

(Incidentally, the best case is //h:div/h:h3/h:span; you should not be limited to using any particular prefix)

I don't currently have collaborator access to this repo and the owner doesn't seem to be checking pull requests, but I have actually just sent an e-mail to the owner asking whether they could make me a collaborator. If that happens, I would be willing to expose the useNamespaces option via the external wrapper.

moll · 2015-02-18T18:36:33Z

TheuseNamespaces is already available here, too. It's just that its support for the default namespace isn't enabled for query names. I created a wrapper with a regexp replace to auto-prefix names, but it's all rather silly when all you've got are nodes in the (x)HTML namespace.

JLRishe · 2015-02-18T20:20:10Z

I see, I confused myself. The XPath evaluator has a "case insensitive mode" (which was never exposed externally), and I had misremembered this as an "ignore namespaces mode".

I would not be in favor of outright supporting non-null default namespaces, because as silly as it may seem, that would violate the XPath 1.0 spec. Most professional XPath implementations (.NET, most XSLT engines, etc.) do not allow using a default namespace because the spec is quite clear on the point that XPath 1.0 has no such capacity.

moll · 2015-02-18T22:19:16Z

What's the default namespace feature for then? Querying HTML DOMs with Xpath here is simply too cumbersome out of the box. Why not make it more convenient is what I'm proposing. That's what computers are for — to handle repetitive tasks for us. ;-)

JLRishe · 2015-02-19T05:59:33Z

What's the default namespace feature for then?

Which feature are you referring to? I see that the Utilities.resolveQName method has a useDefault parameter, but it is always passed in as false. Perhaps the original developer was using this at some point and then realized that doing so was incorrect.

Querying HTML DOMs with Xpath here is simply too cumbersome out of the box.

I don't really see what's so terribly cumbersome about adding an extra two characters to each step of an XPath. People who work with XML do this all the time.

Why not make it more convenient is what I'm proposing.

Because, as I already said, it would violate the XPath 1.0 spec. I think spec compliance is a worthwhile thing, especially with a library like this. It could also certainly have unintended consequences for someone using the library and expecting it to behave according to the spec.

That's what computers are for — to handle repetitive tasks for us.

Yes, but not at the expense of spec compliance.

blahah · 2015-05-06T09:40:27Z

I don't really see what's so terribly cumbersome about adding an extra two characters to each step of an XPath. People who work with XML do this all the time.

It's only practical to add namespaces to queries when working with tightly controlled corpora. An example scenario is when you have a set of XPath selectors that you want to apply to a large set of HTML documents. Those documents may or may not be explicitly namespaced, and namespaces may differ between documents (HTML4/xhtml etc.). In this case it's not possible to add namespaces to queries so that they apply to the whole corpus, but by using default namespaces you can resolve the problem with minimal hacking.

indianscout · 2016-04-08T05:58:31Z

Same issue here. I agree with @moss. The following sould work, but it does not:

> var xmldom = require('xmldom');
> var xpath = require('xpath')
> var parser = new xmldom.DOMParser();
> var doc = parser.parseFromString('<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 'text/html');
> xpath.select('//body', doc);
[]
> var doc = parser.parseFromString('<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 'text/xml');
> xpath.select('//body', doc);
[ Element {
    _nsMap: {},
    attributes: { _ownerElement: [Circular] },
    childNodes: {},
    ownerDocument: 
     Document {
...
>

In my opinion xpath should return the body element also on documents with text/html... text/html is a valid mime type (https://developer.mozilla.org/en/docs/Web/API/DOMParser).

Or is there any good reason for xpath to refuse to operate on text/html?

JLRishe · 2016-04-09T10:05:40Z

@indianscout

Or is there any good reason for xpath to refuse to operate on text/html?

This library operates on text/html just fine. It is simply strict with regards to namespaces, and html elements are in the http://www.w3.org/1999/xhtml namespace.

The following works:

var xmldom = require('xmldom');
var xpath = require('xpath')

var parser = new xmldom.DOMParser();
var doc = parser.parseFromString(
              '<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 
              'text/html');

var select = xpath.useNamespaces({ h: 'http://www.w3.org/1999/xhtml' });
var body = select('//h:body', doc);
console.log(body);

This reverts commit dece058.

JLRishe · 2017-11-25T05:28:00Z

I'm going to keep this open a little longer for a few more potential changes.

sdeprez · 2022-02-16T10:48:54Z

👋 hi, just to add some content to that because it also got me confused (and initially I thought it was a bug of xmldom and I opened xmldom/xmldom#377). To summarize the issue:

when you parse doc with type text/html, xmldom (but also the browser) adds a default namespace http://www.w3.org/1999/xhtml
when you query that doc, a simple xpath //input won't work because of namespace

IIUC, 2. won't work because this library is strict with xpath 1.0 specs (which makes sense), so we need to provide the namespace. The reason it's working in browsers, is that browsers follow the the html spec, which adds a modification ("willful violation" as they say) to this spec to specifically handle that case.

I'm not sure how feasible is it to add a boolean param / mode to also support that modified spec here, but at the very least, it'd be nice to document that behaviour, because it's very easy to fall into that trap, and it's easy to get lost because browsers work differently. I can work on a PR if you agree on the analysis.

JLRishe · 2022-02-16T11:13:44Z

@sdeprez I can't find any place where this is documented or mentioned explicitly, but cae87df added an isHtml option which allows you to use this library in "HTML Mode" and not have to explicitly declare the html namespace. It requires using parsed expressions.

const path = xpath.parse('//input');
const nodes = path.select({ node: myHtmlDom, isHtml: true });

You can see some examples of this in this repo's test.js file.

sdeprez · 2022-02-16T14:53:39Z

Thanks! It looks like isHtml is indeed what I need.

On the other hand, I needed to use .evaluate and it took me some time to understand the usage of require('xpath').evaluate(... vs xpath.parse('//input').evaluate(... since they behave differently, I also noticed that typescript typings are not mentioning .parse methods. Would it be possible to add:

a note about this html quirk and isHtml on the main readme
add .evaluate in api docs
add parse and following types (including isHtml) in typescript typings

I can work on a PR if you agree on the principle.

[EDIT]: actually about using xpath.parse('//input').evaluate, I'm not sure about the type of the result. It looks like it can return a list of nodes (for instance with //input), or a XNumber (for instance count(//input)), etc., but there is no resultType that I can check to know what is returned, do you know how could I do it?

JLRishe · 2022-03-08T04:44:03Z

@sdeprez Sorry for not getting back to you sooner.

We actually have an API page that mentions the .evaluate() method, but it takes a bit of digging to get there:

api docs -> link under xpath.parse -> "documentation page" link

Maybe it would be better to have XPathEvaluator on the Parsed Expressions page to link to the API doc for XPath Evaluator, and provide a bit more explanatory text and examples on the Parsed Expressions page. I'm open to suggestions you have to make the information more discoverable.

The XPathEvaluator page also doesn't include information about the isHtml option, so it would be good to add that.

I'd be happy to look at a PR if you can submit one. Thanks!

[EDIT]: actually about using xpath.parse('//input').evaluate, I'm not sure about the type of the result. It looks like it can return a list of nodes (for instance with //input), or a XNumber (for instance count(//input)), etc., but there is no resultType that I can check to know what is returned, do you know how could I do it?

As indicated on the page linked above, it follows the same behavior as xpath.select. If you would like to select into a specific type, you can use one of the other evaluateNNN() methods mentioned on the same page.

moll mentioned this issue Feb 18, 2015

Compatible query selector library inikulin/parse5#42

Closed

blahah mentioned this issue Apr 10, 2015

fix xpath selector namespace handling for HTML DOMs ContentMine/thresher#17

Open

JLRishe mentioned this issue Sep 2, 2016

does not work with mime type text/html #56

Closed

JLRishe closed this as completed in dece058 Nov 25, 2017

JLRishe added a commit that referenced this issue Nov 25, 2017

Revert "Add .allowAnyNamespaceForNoPrefix option. Fixes #27"

cc56179

This reverts commit dece058.

JLRishe added a commit that referenced this issue Nov 25, 2017

Add allowAnyNamespaceForNoPrefix option. Fixes #27

6394ffd

JLRishe reopened this Nov 25, 2017

JLRishe closed this as completed in cae87df Dec 4, 2017

JLRishe mentioned this issue Jan 12, 2018

Make allowAnyNamespaceForNoPrefix=true the default #77

Closed

sdeprez mentioned this issue Feb 16, 2022

Parsing with type text/html leads to unwanted default namespace xmldom/xmldom#377

Closed

karfau mentioned this issue Feb 21, 2022

chore: Bump xmldom to 0.8.0 bbyars/mountebank#660

Merged

JLRishe mentioned this issue Jun 20, 2022

An example of using it with regular HTML? #115

Closed

marvinruder mentioned this issue Oct 26, 2023

No type definition available for xpath.parse(…) #133

Open

chris48s mentioned this issue Oct 11, 2024

[DynamicXml] parse doc as html if served with text/html content type badges/shields#10607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No matches when searching by tagName in XMLDOM's text/html DOM #27

No matches when searching by tagName in XMLDOM's text/html DOM #27

moll commented Feb 17, 2015

moll commented Feb 17, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 19, 2015

blahah commented May 6, 2015

indianscout commented Apr 8, 2016

JLRishe commented Apr 9, 2016

JLRishe commented Nov 25, 2017

sdeprez commented Feb 16, 2022 •

edited

Loading

JLRishe commented Feb 16, 2022

sdeprez commented Feb 16, 2022 •

edited

Loading

JLRishe commented Mar 8, 2022 •

edited

Loading

No matches when searching by tagName in XMLDOM's text/html DOM #27

No matches when searching by tagName in XMLDOM's text/html DOM #27

Comments

moll commented Feb 17, 2015

moll commented Feb 17, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 18, 2015

moll commented Feb 18, 2015

JLRishe commented Feb 19, 2015

blahah commented May 6, 2015

indianscout commented Apr 8, 2016

JLRishe commented Apr 9, 2016

JLRishe commented Nov 25, 2017

sdeprez commented Feb 16, 2022 • edited Loading

JLRishe commented Feb 16, 2022

sdeprez commented Feb 16, 2022 • edited Loading

JLRishe commented Mar 8, 2022 • edited Loading

sdeprez commented Feb 16, 2022 •

edited

Loading

sdeprez commented Feb 16, 2022 •

edited

Loading

JLRishe commented Mar 8, 2022 •

edited

Loading