Parse from stream #618

LinusU · 2014-12-30T11:08:29Z

Something I often do with Cheerio is loading a page from the internet to extract some information from. Currently I have to do something like this:

function loadCheerio(options, cb) {

  var req = http.request(options);

  req.on('error', cb);
  req.on('response', function (res) {

    var chunks = [];

    res.on('error', cb);
    res.on('data', chunks.push.bind(chunks));
    res.on('end', function () {

      var data = Buffer.concat(chunks);
      var str = data.toString();

      cb(null, cheerio.load(str));
    });

  });

  req.end();
}

This isn't very good because it buffers up the entire html in memory before parsing it, making the entire DOM and the entire string being loaded in memory at the same time.

The parser that Cheerio uses has support for streaming to it. If used this would only allocate one chunk of html, then parse it and throw it away, freeing up memory right away. It would also allow the parser to start the parsing as soon as the first chunk of html is available, making the parsing complete sooner.

I would like to be able to do something like this.

function loadCheerio(options, cb) {

  var req = http.request(options);

  req.on('error', cb);
  req.on('response', function (res) {

    var parser = res.pipe(cheerio.load());

    parser.on('error', cb);
    parser.on('ready', function ($) { cb(null, $); });
  });

  req.end();
}

Maybe even something more convenient like: cheerio.loadStream(res, cb);

The text was updated successfully, but these errors were encountered:

Soviut · 2014-12-31T08:31:15Z

+1

fb55 · 2014-12-31T15:25:51Z

We already discussed this in #442. As you said, the parser already supports streaming, and it would even be possible (using promises) to do querying while data is received. However, due to a bug in V8's garbage collector (slices force the original string to be preserved), the entire document would still be present in memory, as well as the parsed DOM structure, leaving improved querying and parsing performance (due to earlier execution) as the only advantage.

When it comes to querying, selectors would need to be analysed, to figure out at which point they can be evaluated, and promises add some additional overhead, possibly diminishing the advantage. Also, the implementation is quite challenging.

What could definitely be done is to add a .stream method, as I suggested in #99, which simply returns a parser instance. If someone wants to send a pull request, I'll be the first to support it. (The implementation suggested in the issue is fine (a shorter method name would be nice), but it would require at least one simple test case.)

LinusU · 2015-01-01T21:00:10Z

I don't think that we would be affected by the V8 bug since the strings aren't slices of a larger string. The chunks are given to us from the OS and have no relation to each other.

Personally I don't see a use case for querying before the entire dom is ready. #99 is exactly what I want thought. Hopefully I can get some time to take a stab at that as well.

Being able to do cheerio.load('http://google.com/', function (err, $) { ... }); would be really cool. It could easily be implemented with https://github.com/feross/simple-get which since a few hours ago supports POST and all other methods. It would let us keep the same api that the bultin http module provides while still being super easy to use.

LinusU · 2015-01-01T21:58:10Z

See #619 .createWriteStream(options)

LinusU · 2015-01-01T22:30:40Z

See #620 .load(url, cb)

ComFreek · 2017-12-06T16:04:17Z

I would like to dig this issue up again.
Personally, I still miss the functionality of loading a document directly from a stream with the help of cheerio. I know this is possible with htmlparser2 as has been implemented in the unmerged PR #619: https://github.com/LinusU/cheerio/blob/6f7da8f1815cebd2307273d56258bb1ccaa77376/lib/parse.js#L87

I agree with the conclusions of #620, though: Actually creating a readable stream from a URI should not be Cheerio's responsibility.

fb55 closed this as completed Feb 22, 2015

ComFreek mentioned this issue Dec 25, 2017

Guarantee a Cheerio.load(dom) overload #1126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse from stream #618

Parse from stream #618

LinusU commented Dec 30, 2014

Soviut commented Dec 31, 2014

fb55 commented Dec 31, 2014

LinusU commented Jan 1, 2015

LinusU commented Jan 1, 2015

LinusU commented Jan 1, 2015

ComFreek commented Dec 6, 2017

Parse from stream #618

Parse from stream #618

Comments

LinusU commented Dec 30, 2014

Soviut commented Dec 31, 2014

fb55 commented Dec 31, 2014

LinusU commented Jan 1, 2015

LinusU commented Jan 1, 2015

LinusU commented Jan 1, 2015

ComFreek commented Dec 6, 2017