Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee a Cheerio.load(dom) overload #1126

Closed
ComFreek opened this issue Dec 25, 2017 · 8 comments
Closed

Guarantee a Cheerio.load(dom) overload #1126

ComFreek opened this issue Dec 25, 2017 · 8 comments

Comments

@ComFreek
Copy link

ComFreek commented Dec 25, 2017

Since there is no built-in stream-reading method in Cheerio (see the discussion), I have built my own:

function fromStream(stream) {
	return new Promise((resolve, reject) => {
		const parser = new htmlparser.Parser(new htmlparser.DomHandler((err, dom) => {
			if (err) {
				reject(err);
			} else {
				resolve(cheerio.load(dom)); // <-- Not public API!
			}
		}));

		stream.on('error', reject)
			.pipe(parser)
			.on('error', reject);
	});
}

Even though the call cheerio.load(dom) works*, it actually does not conform to Cheerio's public API, which states that load only accepts a string (cf. README, code).

Could the public API be extended to include a Cheerio.load(dom) overload, where dom is a DOM tree compatible to the output produced by htmlparser.DomHandler?

*) see IonicaBizau/scrape-it#83 (comment).

@fb55
Copy link
Member

fb55 commented Dec 25, 2017

This would need to use http://inikulin.github.io/parse5/classes/parserstream.html for HTML, otherwise happy to add this as an additional method (.stream or something)!

@ComFreek
Copy link
Author

Great to hear!

I think .stream should also support streaming HTML fragments. This is something parse5 seems to be missing at the moment in ParserStream (cf. parseFragment), see inikulin/parse5#227.

PS: I've just realized that I constantly referred to the "old" master branch in my previous comment. Maybe it would be a good idea to directly link from NPM to the v1.0.0 branch or to mention it in master's README.

@coryarmbrecht
Copy link

Glad to see there's development here! I just hit this snag as I have been changing my sync node script to streams. @ComFreek I have looked at your nested links, but it is beyond my knowledge-

Is fragments support a requirement for streaming to Cheerio selectors? Like $('a.new-link').each? I guess it comes down to how chunks are separated, and it makes sense that you need to wait for certain tags (large containers) to be closed.

If I wanted to start going in your direction and try get Cheerio to work with streams (I was thinking a through stream), where should I start? It sounds like without fragment support, I can't just do something like:

const links = []
let readStream = fs.createReadStream(htmlFile);
    let chunks = []

    // Listen for data
    readStream.on('data', chunk => {
        //chunks.push(chunk)
        $('a.new-link').each(function(i, elem) { 
            links[i] = elem
        })
    });

@ComFreek
Copy link
Author

ComFreek commented Feb 8, 2018

@coryarmbrecht The streams I mentioned above and (afaik) parse5's ParserStream only deal with the problem that you would need to store all the HTML in memory if you had not such streaming approaches. Why would you need to store all the HTML in memory if you were to feed it into the parser chunk-by-chunk anyway?

What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.

@coryarmbrecht
Copy link

coryarmbrecht commented Feb 9, 2018

@ComFreek, ok I think I figured out my disconnect. I was thinking that if a single chunk has an opening element tag <span>, but doesn't have the closing tag </span>, then Cheerio (or another DOM selector lib) can't read it properly- and you're going to need to store that element's tag in memory until it closes </span>. Once it finally closes, then you can use a selector func. I was thinking I would need the entire containing element to finish streaming in order to retrieve the children. Silly me.

But, I guess all you really need is the opening tag, and the closing tag is just a sign of where to stop. I was thinking about the chunks as needing to be complete objects in order to parse correctly, and not how I just need the beginning tag.

@ch-lukas
Copy link

ch-lukas commented Mar 4, 2018

There is also the parse5.SAXParser option. Should we try and create a streaming solution based on this - anyone up for it ?

@Sytten
Copy link

Sytten commented Nov 23, 2018

Is there any news on that @fb55? I think the community could use an official API for streams.

@fb55
Copy link
Member

fb55 commented Dec 22, 2020

This overload is now properly documented, with an example in the README using it. Let's keep the streams discussion to #99.

@fb55 fb55 closed this as completed Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants