Add scrapeIt.fromStream function #83

ComFreek · 2017-12-23T14:44:49Z

What do you think of a fromStream function which feeds the HTML parser chunk by chunk instead of allocating one big string in memory?

I have already implemented it and am using it in one of my projects: master...ComFreek:master.

I also have a unit test at hand, which requires an upgrade of the package tester, as already mentioned at IonicaBizau/tester#15.

t.it("scrape from stream", cb => {
    return t.expect(http.get(URL, response => {
        scrapeIt.fromStream(response, {
            title: "h1.title"
          , description: ".description"
          , date: {
                selector: ".date"
              , convert: x => new Date(x)
            }
        })
    })).resolves.toEqual({
        title: "Title"
      , description: "Lorem ipsum"
      , date: new Date("1988-01-01")
    })
})

Open points:

The normal scrapeIt method uses cheerio-req, which actually could benefit from the chunk by chunk loading as well.
Investigate how htmlparser2 deals with encodings.
[from a later comment] Wait for Cheerio adopting/discussing the Cheerio.load(dom) public API: Guarantee a Cheerio.load(dom) overload cheeriojs/cheerio#1126

The text was updated successfully, but these errors were encountered:

IonicaBizau · 2017-12-23T15:47:49Z

Nice work!

However:

This function feeds the employed DOM parser chunk by chunk and therefore avoids
fully buffering the HTML in memory.

When passing the dom variable to scrapeHTML, doesn't that mean it was stored in memory already?

That's what the tinyreq does, pretty much: https://github.com/IonicaBizau/tinyreq/blob/a7b01faf2b725953d353a6dc3543279469a4c941/lib/index.js#L87-L92

I'm not sure how this helps, since the HTML will be passed as a variable anyways.

ComFreek · 2017-12-23T17:45:44Z

My code only stores the DOM tree in memory, not the HTML string. I think the DOM is necessary for CSS-like selectors in scrape-it anyway.*

*) Actually, this exercise seems quite interesting: "Given a set of CSS selectors, does a program exist, which extracts the results from sequential HTML parsing?" or "Can we pre-compile a set of selectors, so that SAX-like parsing suffices to extract all the matches?"

IonicaBizau · 2017-12-23T18:50:38Z

My code only stores the DOM tree in memory, not the HTML string. I think the DOM is necessary for CSS-like selectors in scrape-it anyway.*

That sounds good. But, is that tree compatible with Cheerio? It's still unclear to me if dom is in the end stringified again and then processed by cheerio.

ComFreek · 2017-12-25T08:50:07Z

Description	Source
We call `Cheerio.load(dom)`.	`Cheerio.load` function
It calls the function `parse(dom, someOptions)`.	line with call
`parse(content, options)` now calls `evaluate(content, options)`.	line with call
`evaluate(content, options)` will just pass through `content`, if a DOM tree has already been provided.	an if condition catching DOM trees

In the course of creating that table myself, I realized that Cheerio indeed adds its <root> tag in table line 3. So my warning in the .fromStream docblock is wrong and unnecessary.

The only thing to be still wary of is that Cheerio.load's public API only states that it accepts a string, not a DOM. I have now created an issue for that: cheeriojs/cheerio#1126

IonicaBizau · 2018-01-14T07:09:39Z

@ComFreek Sounds good! 👍 Should we wait for the stream function to be implemented in Cheerio?

ComFreek · 2018-01-14T19:28:10Z

According to this comment, my code should actually use parse5 instead of old htmlparser2, which has apparently only been retained for backwards compatbility in Cheerio.

I think we could adopt a scrapeIt.fromStream function (rewritten with parse5) and open a PR at Cheerio at the same time. (I don't think I get to do it this week, feel free to go ahead.)

Maybe at this point it should also be considered how scrapeIt handles HTML fragments. Should there be a difference in the API call, e.g. scrapeIt(<URL to full doc>) and scrapeIt.fromFragment(<URL to HTML part only>)? The module parse5 does draw a line there: inikulin/parse5#227. I am not sure how it would parse HTML neither <html> nor <body> in its 'normal' mode.

IonicaBizau · 2020-01-01T17:43:59Z

I am going to close this, but contributions are welcome via PRs! 🚀

ComFreek mentioned this issue Dec 25, 2017

Guarantee a Cheerio.load(dom) overload cheeriojs/cheerio#1126

Closed

IonicaBizau closed this as completed Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scrapeIt.fromStream function #83

Add scrapeIt.fromStream function #83

ComFreek commented Dec 23, 2017 •

edited

Loading

IonicaBizau commented Dec 23, 2017

ComFreek commented Dec 23, 2017 •

edited

Loading

IonicaBizau commented Dec 23, 2017

ComFreek commented Dec 25, 2017

IonicaBizau commented Jan 14, 2018

ComFreek commented Jan 14, 2018 •

edited

Loading

IonicaBizau commented Jan 1, 2020

Add scrapeIt.fromStream function #83

Add scrapeIt.fromStream function #83

Comments

ComFreek commented Dec 23, 2017 • edited Loading

IonicaBizau commented Dec 23, 2017

ComFreek commented Dec 23, 2017 • edited Loading

IonicaBizau commented Dec 23, 2017

ComFreek commented Dec 25, 2017

IonicaBizau commented Jan 14, 2018

ComFreek commented Jan 14, 2018 • edited Loading

IonicaBizau commented Jan 1, 2020

ComFreek commented Dec 23, 2017 •

edited

Loading

ComFreek commented Dec 23, 2017 •

edited

Loading

ComFreek commented Jan 14, 2018 •

edited

Loading