-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse from stream #618
Comments
+1 |
We already discussed this in #442. As you said, the parser already supports streaming, and it would even be possible (using promises) to do querying while data is received. However, due to a bug in V8's garbage collector (slices force the original string to be preserved), the entire document would still be present in memory, as well as the parsed DOM structure, leaving improved querying and parsing performance (due to earlier execution) as the only advantage. When it comes to querying, selectors would need to be analysed, to figure out at which point they can be evaluated, and promises add some additional overhead, possibly diminishing the advantage. Also, the implementation is quite challenging. What could definitely be done is to add a |
I don't think that we would be affected by the V8 bug since the strings aren't slices of a larger string. The chunks are given to us from the OS and have no relation to each other. Personally I don't see a use case for querying before the entire dom is ready. #99 is exactly what I want thought. Hopefully I can get some time to take a stab at that as well. Being able to do |
See #619 |
See #620 |
I would like to dig this issue up again. I agree with the conclusions of #620, though: Actually creating a readable stream from a URI should not be Cheerio's responsibility. |
Something I often do with Cheerio is loading a page from the internet to extract some information from. Currently I have to do something like this:
This isn't very good because it buffers up the entire html in memory before parsing it, making the entire DOM and the entire string being loaded in memory at the same time.
The parser that Cheerio uses has support for streaming to it. If used this would only allocate one chunk of html, then parse it and throw it away, freeing up memory right away. It would also allow the parser to start the parsing as soon as the first chunk of html is available, making the parsing complete sooner.
I would like to be able to do something like this.
Maybe even something more convenient like:
cheerio.loadStream(res, cb);
The text was updated successfully, but these errors were encountered: