Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected result while using .find(">") on a string containing <html> tag #780

Closed
Delgan opened this issue Nov 14, 2015 · 6 comments
Closed

Comments

@Delgan
Copy link
Contributor

Delgan commented Nov 14, 2015

Hello.

I had not had the opportunity to say it so far, but I wanted first of all to thank you for the wonderful tool that is Cheerio.js. You have done an amazing job, I can not express how much your work is valuable.
Really, thank you.

I recently encountered an unexpected behavior of Cheerio.js which I suspect to be a bug.

I opened a question on StackOverflow but I decided to come here and report this as an issue.

I try to parse HTML:

<html>
    <head></head>
    <body>
        <div>
            <table>
                <tr>
                    <td>
                        <a href="/link_1.php">Link 1</a>
                    </td>
                    <td>
                        <a href="/link_2.php">Link 2</a>
                        <a href="/link_3.php">Link 3</a>
                    </td>
                    <td>
                        <a href="/link_4.php">Link 4</a>
                        <a href="/link_5.php">Link 5</a>
                    </td>
                </tr>
            </table>
        </div>
    </body>
</html>

Using this code:

var cheerio = require("cheerio");
var $ = cheerio.load(html);
var page = $.root();

var tr = page.find("tr");

console.log(tr.find("> :nth-child(2) a").length);

What happens is that this code returns 5 whereas the second child of the <tr> tag contains only 2 links.

A test case can be found here.
I also tried to remove $.root() and to use $.parseHTML() but the issue is still the same.

As you can see here, jQuery works as expected.
Maybe is it related to a limitation of selecting direct children using querySelector()?

What is strange is that removing <html> tag seems to work, it returns 2.
Moreover, tr.find("> :nth-child(2)").find("a") works just fine too.

I really do not see what is wrong with my first code, so I opened this issue hoping you can help me and confirm whether or not it is a bug.

@Delgan Delgan changed the title Unexpected result while using .find(">") on a string containing <html> tag Unexpected result while using .find(">") on a string containing <html> tag Nov 14, 2015
@fb55
Copy link
Member

fb55 commented Nov 15, 2015

That's due to the shipped version of css-select not having proper support for contexts. The current version supports this, but has a severe bug left.

@Delgan
Copy link
Contributor Author

Delgan commented Dec 22, 2015

Hey @fb55, I am sorry to bother you again with this closed issue but I have a problem: the update of css-select do not seems to fix it. 😨

What I did is that I browserified (just to being able to share a jsFiddle) the last committed version of Cheerio, but with 3 differentes release of css-select:

v1.0.0 is before the fix, 1.2.0 is the last release and I added the 1.1.0 because I wanted to be sure that it was not my changes which caused a regression.

Now, back to the example I provided in jsFiddle:

We can see that using v1.0.0 returns 5 as stated before. But the two last versions returns 0 instead of 2.

So, as this issue seemed very close to this one, I made the tests.

The first case returns 2, which is wrong. The two others versions return 1 which is the correct result.
But if we look closer, we can see that the returned element is not the expected one.
$('foo').find('> bar') should return the immediate children of <foo>, but instead, it returns the deepest one which as no other descendant. So the fix to css-select did not seem to work when used into Cheerio.

Finally, I installed css-select module only and I tried the same tests by using CSSselect(), without involving Cheerio. As expected, it worked just fine.

I looked into Cheerio code and I saw this:

exports.find = function(selectorOrHaystack) {
  var elems = _.reduce(this, function(memo, elem) {
    return memo.concat(_.filter(elem.children, isTag));
  }, []);

  (...)

  return this._make(select(selectorOrHaystack, elems, this.options));
};

What happens is that elems contains the immediate children of the element to which is called .find(). For example in my case, it is an array containg the three <td> elements.
From what I understand, if these elements are passed to select() (which is CSSselect()) with the selector "> " then it will not work because each <td> will be considered as the context while it should be <tr>, and as there is no <td> immediate children of a <td>, it will return nothing.

So, did I misunderstand something or is this a issue which should be fixed?
You wrote that this was fixed in the css-select repo and indeed it is. But I think my mistake could be that I overinterpreted your words. In fact, it is not enough to just update the version of css-select, there are other changes to the code of Cheerio that should be implemented in order to obtain the expected behavior, is it?
Do you (or others members of CheerioJS) know which these changes are, or do you need someone to look at it?

Thank you for your time.

@fb55
Copy link
Member

fb55 commented Dec 22, 2015

@Delgan You're right. We need to pass the current elements to select as the context to make this work. Want to send a PR?

@fb55 fb55 reopened this Dec 22, 2015
@Delgan
Copy link
Contributor Author

Delgan commented Dec 23, 2015

@fb55 I do not know, I do not have as much knowledge about Cheerio than you, I do not want to break everything. 😅

It seems that root elements are special and still need .children. So I ended up with something like that:

var elems = _.reduce(this, function(memo, elem) {
  return memo.concat(_.filter(elem.type === 'root' ? elem.children : [elem], isTag))
}, [])

It looks that it solves the issue and passes all the tests, but is it the way you want this issue to be fixed?

@fb55
Copy link
Member

fb55 commented Dec 23, 2015

This way, the selected elements can be part of the result of .find, which should not be the case (and it's surprising that there is no test case for it).

One solution would be to pass {__proto__: this.options, context: this.toArray()} as the options to select.

@Delgan
Copy link
Contributor Author

Delgan commented Dec 23, 2015

@fb55 My bad, there is such a test case, I missed the file while running the tests, sorry. And indeed, it fails. But your workaround works just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants