Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scrapeIt.fromStream function #83

Closed
3 tasks
ComFreek opened this issue Dec 23, 2017 · 7 comments
Closed
3 tasks

Add scrapeIt.fromStream function #83

ComFreek opened this issue Dec 23, 2017 · 7 comments

Comments

@ComFreek
Copy link
Contributor

ComFreek commented Dec 23, 2017

What do you think of a fromStream function which feeds the HTML parser chunk by chunk instead of allocating one big string in memory?

I have already implemented it and am using it in one of my projects: master...ComFreek:master.

I also have a unit test at hand, which requires an upgrade of the package tester, as already mentioned at IonicaBizau/tester#15.

t.it("scrape from stream", cb => {
    return t.expect(http.get(URL, response => {
        scrapeIt.fromStream(response, {
            title: "h1.title"
          , description: ".description"
          , date: {
                selector: ".date"
              , convert: x => new Date(x)
            }
        })
    })).resolves.toEqual({
        title: "Title"
      , description: "Lorem ipsum"
      , date: new Date("1988-01-01")
    })
})

Open points:

@IonicaBizau
Copy link
Owner

Nice work!

However:

This function feeds the employed DOM parser chunk by chunk and therefore avoids
fully buffering the HTML in memory.

When passing the dom variable to scrapeHTML, doesn't that mean it was stored in memory already?

That's what the tinyreq does, pretty much: https://github.com/IonicaBizau/tinyreq/blob/a7b01faf2b725953d353a6dc3543279469a4c941/lib/index.js#L87-L92

I'm not sure how this helps, since the HTML will be passed as a variable anyways.

@ComFreek
Copy link
Contributor Author

ComFreek commented Dec 23, 2017

My code only stores the DOM tree in memory, not the HTML string. I think the DOM is necessary for CSS-like selectors in scrape-it anyway.*

*) Actually, this exercise seems quite interesting: "Given a set of CSS selectors, does a program exist, which extracts the results from sequential HTML parsing?" or "Can we pre-compile a set of selectors, so that SAX-like parsing suffices to extract all the matches?"

@IonicaBizau
Copy link
Owner

My code only stores the DOM tree in memory, not the HTML string. I think the DOM is necessary for CSS-like selectors in scrape-it anyway.*

That sounds good. But, is that tree compatible with Cheerio? It's still unclear to me if dom is in the end stringified again and then processed by cheerio.

@ComFreek
Copy link
Contributor Author

Description Source
We call Cheerio.load(dom). Cheerio.load function
It calls the function parse(dom, someOptions). line with call
parse(content, options) now calls evaluate(content, options). line with call
evaluate(content, options) will just pass through content, if a DOM tree has already been provided. an if condition catching DOM trees

In the course of creating that table myself, I realized that Cheerio indeed adds its <root> tag in table line 3. So my warning in the .fromStream docblock is wrong and unnecessary.

The only thing to be still wary of is that Cheerio.load's public API only states that it accepts a string, not a DOM. I have now created an issue for that: cheeriojs/cheerio#1126

@IonicaBizau
Copy link
Owner

@ComFreek Sounds good! 👍 Should we wait for the stream function to be implemented in Cheerio?

@ComFreek
Copy link
Contributor Author

ComFreek commented Jan 14, 2018

According to this comment, my code should actually use parse5 instead of old htmlparser2, which has apparently only been retained for backwards compatbility in Cheerio.

I think we could adopt a scrapeIt.fromStream function (rewritten with parse5) and open a PR at Cheerio at the same time. (I don't think I get to do it this week, feel free to go ahead.)

Maybe at this point it should also be considered how scrapeIt handles HTML fragments. Should there be a difference in the API call, e.g. scrapeIt(<URL to full doc>) and scrapeIt.fromFragment(<URL to HTML part only>)? The module parse5 does draw a line there: inikulin/parse5#227. I am not sure how it would parse HTML neither <html> nor <body> in its 'normal' mode.

@IonicaBizau
Copy link
Owner

I am going to close this, but contributions are welcome via PRs! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants