Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.stream(cb) method #99

Closed
fb55 opened this issue Sep 9, 2012 · 9 comments
Closed

.stream(cb) method #99

fb55 opened this issue Sep 9, 2012 · 9 comments

Comments

@fb55
Copy link
Member

fb55 commented Sep 9, 2012

Just as an idea: The parser could do much more when it would actually get a stream of data. This would allow the creation of the DOM while IO is happening, which will speed up initial loading (and more stuff could be done inside of DomHandler).

There is already a WritableStream.js file shipped with htmlparser2 (it's accessible via require("htmlparser2").WritableStream) that pretty much solves all problems. The implementation of the cheerio method could look like this:

cheerio.createWritableStream = function(cb, options){
  var handler = new DomHandler(function(dom){ cb(cheerio(dom)); }, options);
  return new WritableStream(handler, options);
};
@matthewmueller
Copy link
Member

Cool, I like this idea - I'm just not sure how useful this would be. It would produce unexpected results if I tried to run a $('li') on a partially streamed file.

@fb55
Copy link
Member Author

fb55 commented Sep 13, 2012

Well, as far as I know, most people are currently using eg. request to
download a file, only to open it directly with cheerio. Having a streaming
method would allow a much nicer and speedier creation of the DOM, and a
much more node-y interface.

2012/9/12 Matt Mueller notifications@github.com

Cool, I like this idea - I'm just not sure how useful this would be. It
would produce unexpected results if I tried to run a $('li') on a
partially streamed file.


Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-8510858.

@matthewmueller
Copy link
Member

Right, but how would you actually run queries on a half-parsed DOM?

The only use case I could see is if you're looking for something specific, ex. $("title").text()), and as it get's parsed you could return it and stop. That would require some major rework to the library to support this feature though, and for something like that, it might be better to just use node-htmlparser2 directly.

@fb55
Copy link
Member Author

fb55 commented Sep 23, 2012

You misunderstood me: The idea was to parse data while the user is still waiting for the next chunk to arrive. This way, the DOM will be available immediately after the download of the page is complete.

Running queries isn't hard, though: I solved it yesterday with fb55/node-cornet :)

@matthewmueller
Copy link
Member

I've been thinking about this more and more lately. It would be awesome to select queries as they come through. Right now I'm thinking the API could be:

var $ = cheerio.stream('http://google.com');
$.on('.logo', function($) {
   console.log($.html());
})

@fb55 do you think this is feasible?

@davidchambers
Copy link
Contributor

Well, as far as I know, most people are currently using eg. request to download a file, only to open it directly with cheerio.

Irrespective of the streaming functionality, it would be great if cheerio provided a way to create a "DOM" from a URL. As @fb55 stated, this is no doubt a very common use case.

@matthewmueller
Copy link
Member

looking back at my example, I kind of think adding URL fetching functionality is a bit leaky (do we then support headers, what kind of request methods, etc).

It would be nice to add a streaming interface though, as @fb55 did with cornet. Perhaps more along the lines of:

var $ = cheerio.stream();
minreq.get("http://github.com/fb55").pipe($)
$.on(...)

@fb55
Copy link
Member Author

fb55 commented Jun 9, 2013

@matthewmueller First of all, on is probably the last name that method should have :) (edit: how about find?)

Secondly, cheerio would have to wait until the entire DOM is present, as it calls the method with an array of results (cornet only passes a single element at a time). That would stop people from getting confused, with the benefit of the pauses between IO being used for actual work.

Finally, the implementation of this should be pretty straight-forward, probably as complex as cornet (which has 30 LOC).

@fb55 fb55 added the Feature label Apr 8, 2014
@fb55 fb55 mentioned this issue Dec 31, 2014
@fb55 fb55 changed the title .createWritableStream(cb) method .stream(cb) method May 3, 2022
@fb55
Copy link
Member Author

fb55 commented May 11, 2022

Closing in favour of #2051.

@fb55 fb55 closed this as completed May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants