Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No matches when searching by tagName in XMLDOM's text/html DOM #27

Closed
moll opened this issue Feb 17, 2015 · 16 comments
Closed

No matches when searching by tagName in XMLDOM's text/html DOM #27

moll opened this issue Feb 17, 2015 · 16 comments

Comments

@moll
Copy link

moll commented Feb 17, 2015

Hey,

Giving Xpath a spin with XMLDOM. I can't seem to get it to match any elements by tag name when the DOM's MIME type is set to text/html. Is it supposed to work?

var Xpath = require("xpath")
var DomParser = require("xmldom").DOMParser

var html = "<html><head><title>Harry Potter</title></head></html>"
var doc = new DomParser().parseFromString(html, "text/html")
console.log(Xpath.select("//title", doc))

The above returns an empty array. Add a random attribute to an element, try to match by that instead (//*[@foo=123]) and it works.

Thanks!

@moll
Copy link
Author

moll commented Feb 17, 2015

I've tracked this down to the difference between namespaceURI. DOM elements all have it set to http://www.w3.org/1999/xhtml, but the Xpath query doesn't. Xpath also doesn't apply the default namespace to query tag names.

Would you mind changing Xpath so it would? Prefixing every tag in the query with the XHTML namespace is rather cumbersome. Thanks!

@JLRishe
Copy link
Collaborator

JLRishe commented Feb 18, 2015

@moll What is it exactly that you are suggesting be changed? If the nodes have a namespace, then named node tests for them must use a namespace prefix:

http://www.w3.org/TR/xpath/#node-tests

A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. _This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded)._

@moll
Copy link
Author

moll commented Feb 18, 2015

Perhaps a way to set a default namespace for unqualified names. The useNamespaces feature is already there. It's just not available for queries.

Try the example above. It's rather cumbersome to prefix every tag name with either the full namespace URI or an alias -> //div/h3/span turns into //html:div/html:h3/html:span in the best case.

@JLRishe
Copy link
Collaborator

JLRishe commented Feb 18, 2015

As someone who uses XPath a lot, I don't see it as all that cumbersome to use prefixes most of the time, but you definitely have a valid point in that the original API had a useNamespaces option.

(Incidentally, the best case is //h:div/h:h3/h:span; you should not be limited to using any particular prefix)

I don't currently have collaborator access to this repo and the owner doesn't seem to be checking pull requests, but I have actually just sent an e-mail to the owner asking whether they could make me a collaborator. If that happens, I would be willing to expose the useNamespaces option via the external wrapper.

@moll
Copy link
Author

moll commented Feb 18, 2015

TheuseNamespaces is already available here, too. It's just that its support for the default namespace isn't enabled for query names. I created a wrapper with a regexp replace to auto-prefix names, but it's all rather silly when all you've got are nodes in the (x)HTML namespace.

@JLRishe
Copy link
Collaborator

JLRishe commented Feb 18, 2015

I see, I confused myself. The XPath evaluator has a "case insensitive mode" (which was never exposed externally), and I had misremembered this as an "ignore namespaces mode".

I would not be in favor of outright supporting non-null default namespaces, because as silly as it may seem, that would violate the XPath 1.0 spec. Most professional XPath implementations (.NET, most XSLT engines, etc.) do not allow using a default namespace because the spec is quite clear on the point that XPath 1.0 has no such capacity.

@moll
Copy link
Author

moll commented Feb 18, 2015

What's the default namespace feature for then? Querying HTML DOMs with Xpath here is simply too cumbersome out of the box. Why not make it more convenient is what I'm proposing. That's what computers are for — to handle repetitive tasks for us. ;-)

@JLRishe
Copy link
Collaborator

JLRishe commented Feb 19, 2015

What's the default namespace feature for then?

Which feature are you referring to? I see that the Utilities.resolveQName method has a useDefault parameter, but it is always passed in as false. Perhaps the original developer was using this at some point and then realized that doing so was incorrect.

Querying HTML DOMs with Xpath here is simply too cumbersome out of the box.

I don't really see what's so terribly cumbersome about adding an extra two characters to each step of an XPath. People who work with XML do this all the time.

Why not make it more convenient is what I'm proposing.

Because, as I already said, it would violate the XPath 1.0 spec. I think spec compliance is a worthwhile thing, especially with a library like this. It could also certainly have unintended consequences for someone using the library and expecting it to behave according to the spec.

That's what computers are for — to handle repetitive tasks for us.

Yes, but not at the expense of spec compliance.

@blahah
Copy link

blahah commented May 6, 2015

I don't really see what's so terribly cumbersome about adding an extra two characters to each step of an XPath. People who work with XML do this all the time.

It's only practical to add namespaces to queries when working with tightly controlled corpora. An example scenario is when you have a set of XPath selectors that you want to apply to a large set of HTML documents. Those documents may or may not be explicitly namespaced, and namespaces may differ between documents (HTML4/xhtml etc.). In this case it's not possible to add namespaces to queries so that they apply to the whole corpus, but by using default namespaces you can resolve the problem with minimal hacking.

@indianscout
Copy link

Same issue here. I agree with @moss. The following sould work, but it does not:

> var xmldom = require('xmldom');
> var xpath = require('xpath')
> var parser = new xmldom.DOMParser();
> var doc = parser.parseFromString('<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 'text/html');
> xpath.select('//body', doc);
[]
> var doc = parser.parseFromString('<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 'text/xml');
> xpath.select('//body', doc);
[ Element {
    _nsMap: {},
    attributes: { _ownerElement: [Circular] },
    childNodes: {},
    ownerDocument: 
     Document {
...
> 

In my opinion xpath should return the body element also on documents with text/html... text/html is a valid mime type (https://developer.mozilla.org/en/docs/Web/API/DOMParser).

Or is there any good reason for xpath to refuse to operate on text/html?

@JLRishe
Copy link
Collaborator

JLRishe commented Apr 9, 2016

@indianscout

Or is there any good reason for xpath to refuse to operate on text/html?

This library operates on text/html just fine. It is simply strict with regards to namespaces, and html elements are in the http://www.w3.org/1999/xhtml namespace.

The following works:

var xmldom = require('xmldom');
var xpath = require('xpath')

var parser = new xmldom.DOMParser();
var doc = parser.parseFromString(
              '<html lang=\"en\"><head><meta charset=\"utf-8\" /></head><body></body></html>', 
              'text/html');

var select = xpath.useNamespaces({ h: 'http://www.w3.org/1999/xhtml' });
var body = select('//h:body', doc);
console.log(body);

@JLRishe
Copy link
Collaborator

JLRishe commented Nov 25, 2017

I'm going to keep this open a little longer for a few more potential changes.

@sdeprez
Copy link

sdeprez commented Feb 16, 2022

👋 hi, just to add some content to that because it also got me confused (and initially I thought it was a bug of xmldom and I opened xmldom/xmldom#377). To summarize the issue:

  1. when you parse doc with type text/html, xmldom (but also the browser) adds a default namespace http://www.w3.org/1999/xhtml
  2. when you query that doc, a simple xpath //input won't work because of namespace

IIUC, 2. won't work because this library is strict with xpath 1.0 specs (which makes sense), so we need to provide the namespace. The reason it's working in browsers, is that browsers follow the the html spec, which adds a modification ("willful violation" as they say) to this spec to specifically handle that case.

I'm not sure how feasible is it to add a boolean param / mode to also support that modified spec here, but at the very least, it'd be nice to document that behaviour, because it's very easy to fall into that trap, and it's easy to get lost because browsers work differently. I can work on a PR if you agree on the analysis.

@JLRishe
Copy link
Collaborator

JLRishe commented Feb 16, 2022

@sdeprez I can't find any place where this is documented or mentioned explicitly, but cae87df added an isHtml option which allows you to use this library in "HTML Mode" and not have to explicitly declare the html namespace. It requires using parsed expressions.

const path = xpath.parse('//input');
const nodes = path.select({ node: myHtmlDom, isHtml: true });

You can see some examples of this in this repo's test.js file.

@sdeprez
Copy link

sdeprez commented Feb 16, 2022

Thanks! It looks like isHtml is indeed what I need.

On the other hand, I needed to use .evaluate and it took me some time to understand the usage of require('xpath').evaluate(... vs xpath.parse('//input').evaluate(... since they behave differently, I also noticed that typescript typings are not mentioning .parse methods. Would it be possible to add:

  • a note about this html quirk and isHtml on the main readme
  • add .evaluate in api docs
  • add parse and following types (including isHtml) in typescript typings

I can work on a PR if you agree on the principle.

[EDIT]: actually about using xpath.parse('//input').evaluate, I'm not sure about the type of the result. It looks like it can return a list of nodes (for instance with //input), or a XNumber (for instance count(//input)), etc., but there is no resultType that I can check to know what is returned, do you know how could I do it?

@JLRishe
Copy link
Collaborator

JLRishe commented Mar 8, 2022

@sdeprez Sorry for not getting back to you sooner.

We actually have an API page that mentions the .evaluate() method, but it takes a bit of digging to get there:

api docs -> link under xpath.parse -> "documentation page" link

Maybe it would be better to have XPathEvaluator on the Parsed Expressions page to link to the API doc for XPath Evaluator, and provide a bit more explanatory text and examples on the Parsed Expressions page. I'm open to suggestions you have to make the information more discoverable.

The XPathEvaluator page also doesn't include information about the isHtml option, so it would be good to add that.

I'd be happy to look at a PR if you can submit one. Thanks!

[EDIT]: actually about using xpath.parse('//input').evaluate, I'm not sure about the type of the result. It looks like it can return a list of nodes (for instance with //input), or a XNumber (for instance count(//input)), etc., but there is no resultType that I can check to know what is returned, do you know how could I do it?

As indicated on the page linked above, it follows the same behavior as xpath.select. If you would like to select into a specific type, you can use one of the other evaluateNNN() methods mentioned on the same page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants