-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible to use parse5 as optional(?) parser #863
Comments
@inikulin Would you mind sending a PR? As I haven't worked on htmlparser2 for a long while, it is probably better to drop it entirely. |
Sure. Unfortunately, I'm quite busy at the moment, however I believe I'll be able to get my hands on it somewhere at the end of the month. |
Hi, what is the status on this? It would be huge improvement. |
@dypsilon Currently I'm extremely busy. As soon as I'm done with my job duties (I hope at the mid-August) I will start working on this one. Meanwhile, PRs are welcome if someone else would like to start work on this. |
If htmlparser2 is removed, what will be used to parse XML? sax? |
You will still be able to modify htmlparser2 dom stuctures, so you might have to parse xml in a separate step (one additional line). |
@fb55 but which tool will/should be used in that separate step? |
htmlparser2 directly :)
|
Ok, so then cheerio will lose the ability to parse XML out-of-the-box. Perhaps an XML-only fork of htmlparser2 is on the distant horizon? |
parse5 will parse xml as well. Its what whacko is using |
@luanmuniz XML is case-sensitive, HTML is not. |
@stevenvachon https://github.com/inikulin/parse5 @fb55 i would like todo this. I'm sing |
@luanmuniz Just because it parses it, does not mean that it does so accurately nor reliably. It will perform HTML operations on XML, which definitely could result in errors. |
@stevenvachon I use whacko, that is a fork of cheerio using parse5. I manipulate a few XML and it works fine. But anyway, cheerio is for HTML. XML is secondary. The project intend to imitate the jQuery api for html, not be a xml parser. I will try to make it work with xml as well. I will keep that in mind when doing this. |
@luanmuniz That's awesome! I guess using parse5 in document mode for For now, I'd like to keep using htmlparser2 if xmlMode is enabled (and follow jsdom afterwards with using sax.js for xml). A second step would be to also adopt parse5's serializer. I would like to preserve the option to output ASCII-only strings (which is the current behavior), not sure if that's supported by the serializer or needs another step. Finally, and way out of scope for this, cheerio should adopt a better DOM representation with names from the DOM spec. parse5's own DOM is substantially better here, but we would need to add some layer for backwards compatibility (at least until a 1.0.0 release). You can create a PR early on and ping me for help and feedback. |
Hi guys. I was thinking a lot about how this should be implemented and I would like to share my concerns.
The problem with full-featured HTML parsing is that it's not round-tripable: once document parsed you will get perfectly valid DOM (like in browsers), but if you will try to serialize it back you may end up with broken markup (e.g. whatwg/html#1280). The only way to avoid such behavior is to have perfectly valid HTML as an input. That's why I believe full featured algorithm will not fit well for HTML reprocessing scenario. The best thing that can be done here is to build LexicalTreeParser built on top of For scraping full-featured parsing will fit better, since more likely you expect to deal with DOM that produced by the browser when it opens a page. The question is how to introduce this two parsing modes in Regarding XML: it can't be parsed with |
Hi there! So I actually ended up needing to branch off from posthtml due to fundamental differences in visions for the future, and posthtml is not using the work I did on the parser at all. I think they are planning on building their own parser from scratch now for some reason. You can find the work I did with parse5 here - https://github.com/reshape/parser. It's working quite well, thank you for such a great project and for your assistance, @inikulin! I used the SAXParser instead of parse5 directly because I needed a bit more forgiving of a parser in order to be able to accommodate some common use-cases such as layouts and partials that are implemented on top of reshape. It was pretty straightforward working with the SAXParser, took less than a day to build it out, and it has been working reliably since then. I also did some work with htmlparser2, so I'll speak quickly to differences I noticed between them. Parse5's SAXParser has a remarkably similar API and could easily produce an output tree that is the same as what htmlparser2 produces. The main difference I have run into is that it's a bit more granular in the way it breaks down doctype and attributes containing In addition, I have been able to parse large trees of custom tags, and SVG structures without issue using the SAXParser. I am not super knowledgeable about XML and the finer details of its spec, but I would estimate that it would be able to parse most basic XML structures without a problem. Hope this helps, and let me know if there's any other way I can help out! |
@inikulin Very interesting problem. I feel like we could get away with ignoring that issue for the common case and offering a way to fall back to htmlparser2 otherwise. I don't see a clear advantage of using parse5 without the tree builder stuff, but that's just personal preference. At the same time, I would like to adopt parse5's serializer asap. Would it perhaps be possible to split it into a separate module that cheerio can switch to right away? |
The advantage is to have proper tokenization. Even if you switch to SAXParser-based parser without full featured html parsing it should leverage existing parsing experience (+ you will have parse5 location info, including attributes location) Regarding serializer: I'm not sure if it will make much sense if you going to switch to parse5 anyway. There is some shared code beetwen parser and serializer, so I'd like to keep them in one package |
@inikulin That makes a lot of sense. How do you feel about adding an option to parse5's default DOM implementation that allows adding a custom |
@fb55 I believe you can accomplish this task by extending one of existing tree adapters: you can create new adapter via |
Any updates on this? It seems like the blatant memory leaks from the existing parser are being given very little priority even though they can be extremely inconvenient. |
We need some bike shedding on #863 (comment) I would like to have some feedback on this topic before I start implementing anything. |
IMHO full parsing is the way to go, with a fallback in case that does not work for someone (whether that's another mode, htmlparser2 or something else should not matter – especially for XML this should be useful). Perhaps we should always default to fragment parsing for now and switch on a major version change, but otherwise this should work the same way as Whacko did.
|
Is anything happening here? It is pretty tiring explaining to people that cheerio is problematic because of htmlparser2 and for most of my use cases, I cannot use cheerio. |
@stevenvachon Yeah, PR is issued and approved. It will be landed as RC at first and then in cheerio 1.0.0 |
How far along until the RC? :) |
Getting a lot of heap out of memory with parse5 |
Today I'm deprecating whacko, so, if cheerio maintainers don't mind, I'll send those who is interested in this feature here.
The text was updated successfully, but these errors were encountered: