Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse from stream #618

Closed
LinusU opened this issue Dec 30, 2014 · 6 comments
Closed

Parse from stream #618

LinusU opened this issue Dec 30, 2014 · 6 comments

Comments

@LinusU
Copy link

LinusU commented Dec 30, 2014

Something I often do with Cheerio is loading a page from the internet to extract some information from. Currently I have to do something like this:

function loadCheerio(options, cb) {

  var req = http.request(options);

  req.on('error', cb);
  req.on('response', function (res) {

    var chunks = [];

    res.on('error', cb);
    res.on('data', chunks.push.bind(chunks));
    res.on('end', function () {

      var data = Buffer.concat(chunks);
      var str = data.toString();

      cb(null, cheerio.load(str));
    });

  });

  req.end();
}

This isn't very good because it buffers up the entire html in memory before parsing it, making the entire DOM and the entire string being loaded in memory at the same time.

The parser that Cheerio uses has support for streaming to it. If used this would only allocate one chunk of html, then parse it and throw it away, freeing up memory right away. It would also allow the parser to start the parsing as soon as the first chunk of html is available, making the parsing complete sooner.

I would like to be able to do something like this.

function loadCheerio(options, cb) {

  var req = http.request(options);

  req.on('error', cb);
  req.on('response', function (res) {

    var parser = res.pipe(cheerio.load());

    parser.on('error', cb);
    parser.on('ready', function ($) { cb(null, $); });
  });

  req.end();
}

Maybe even something more convenient like: cheerio.loadStream(res, cb);

@Soviut
Copy link

Soviut commented Dec 31, 2014

+1

@fb55
Copy link
Member

fb55 commented Dec 31, 2014

We already discussed this in #442. As you said, the parser already supports streaming, and it would even be possible (using promises) to do querying while data is received. However, due to a bug in V8's garbage collector (slices force the original string to be preserved), the entire document would still be present in memory, as well as the parsed DOM structure, leaving improved querying and parsing performance (due to earlier execution) as the only advantage.

When it comes to querying, selectors would need to be analysed, to figure out at which point they can be evaluated, and promises add some additional overhead, possibly diminishing the advantage. Also, the implementation is quite challenging.

What could definitely be done is to add a .stream method, as I suggested in #99, which simply returns a parser instance. If someone wants to send a pull request, I'll be the first to support it. (The implementation suggested in the issue is fine (a shorter method name would be nice), but it would require at least one simple test case.)

@LinusU
Copy link
Author

LinusU commented Jan 1, 2015

I don't think that we would be affected by the V8 bug since the strings aren't slices of a larger string. The chunks are given to us from the OS and have no relation to each other.

Personally I don't see a use case for querying before the entire dom is ready. #99 is exactly what I want thought. Hopefully I can get some time to take a stab at that as well.

Being able to do cheerio.load('http://google.com/', function (err, $) { ... }); would be really cool. It could easily be implemented with https://github.com/feross/simple-get which since a few hours ago supports POST and all other methods. It would let us keep the same api that the bultin http module provides while still being super easy to use.

@LinusU
Copy link
Author

LinusU commented Jan 1, 2015

See #619 .createWriteStream(options)

@LinusU
Copy link
Author

LinusU commented Jan 1, 2015

See #620 .load(url, cb)

@fb55 fb55 closed this as completed Feb 22, 2015
@ComFreek
Copy link

ComFreek commented Dec 6, 2017

I would like to dig this issue up again.
Personally, I still miss the functionality of loading a document directly from a stream with the help of cheerio. I know this is possible with htmlparser2 as has been implemented in the unmerged PR #619: https://github.com/LinusU/cheerio/blob/6f7da8f1815cebd2307273d56258bb1ccaa77376/lib/parse.js#L87

I agree with the conclusions of #620, though: Actually creating a readable stream from a URI should not be Cheerio's responsibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants