-
-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LexicalTreeParser #132
Comments
Seems bad... very un-parse5 like |
@domenic Well, content-modifying proxy scenario really bothers me. I guess we can go with separate package named |
I guess maybe the serialization algorithm is what needs fixing for the content-modifying proxy scenario? |
@inikulin Is there a possibility to do this via custom external adapter or, if not, expose APIs that could allow it to do this? (Ideally, I'd like to fix the serialization in spec, but apparently that's unlikely) |
@RReverser Yes, I have an idea how we can do this. |
I think this is actually a reasonable approach for parsing things you know has been serialized. Maybe it can fatal error if it hits an unexpected end tag? Also you still need to handle the same special cases as the HTML serializer does, at least - void elements, title/textarea/style/script, plaintext (for this one you need to change the serializer also to just stop serializing before If you want to make form association survive, you could, after the first parse, check each form control that is not a descendant of a If you want to make the document mode survive (quirks mode, almost quirks or no-quirks), that also needs something for cases like |
Hmm maybe shouldn't try to fix
|
We have ParserFeedbackSimulator which can handle all these cases. My idea was to run tokenizer + feedback simulator and just maintain simple open elements stack (end tag closes all elements in the stack up to the element with this tag name, void elements automatically popped out) |
Hi there! Coming on over from #144 to add my support for this. Working on posthtml, which is essentially a plugin-based html transform system for html, much like (you guessed it) postcss is for css. I really like parse5, it's extremely thorough and stable, and its line location info is a huge assistance for error messages from plugins and source maps. Right now this is a blocking issue for me, I'm working full time trying to get to a stable release. I'm more than happy to help out with the implementation here if we could make this happen faster, if someone could give me a little walkthrough of the codebase! |
I’m also in need of the lexical tree. Plus, I want it to patch that tree with automatically inserted* elements, optionally. I’m not sure if it’s possible to unlink the two, but if it is, I don’t see how it’s “un-parse5 like” to do that if the core API can stay the same? * What’s the proper term here? |
Well, a HTML parser is something that follows the HTML Standard and produces a DOM tree. I guess you might be looking for something like a HTML lexer or tokenizer, although I don't know what kind of object that would produce (not a DOM), and there's no standard governing its behavior. That's why it's fairly un-parse5-like to attempt to add such features to parse5, which is a HTML parser library. |
I'll return to this topic later with better solution. Meanwhile for such scenarios I suggest to use https://github.com/reshape/parser |
What?
Parser which preserves lexical nesting order of nodes in source document. Speaking clearly, we will just run simple nesting logic on top of SAXParser machinery.
Why?
For some use scenarios (e.g. content-modifying proxy) it's important to preserve page semantics / layout, which may broke during reparsing (see examples in whatwg/html#1280). Other use cases that comes to my mind: code folding in editors, element attributes instrumentation, text data extraction, also Cheerio willing to replace
htmlparser2
withparse5
, so I guess it would be nice to have optional "forgiving parsing" (I know... 🐐) mode.The text was updated successfully, but these errors were encountered: