Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LexicalTreeParser #132

Closed
inikulin opened this issue May 19, 2016 · 12 comments
Closed

LexicalTreeParser #132

inikulin opened this issue May 19, 2016 · 12 comments
Labels

Comments

@inikulin
Copy link
Owner

inikulin commented May 19, 2016

What?

Parser which preserves lexical nesting order of nodes in source document. Speaking clearly, we will just run simple nesting logic on top of SAXParser machinery.

Why?

For some use scenarios (e.g. content-modifying proxy) it's important to preserve page semantics / layout, which may broke during reparsing (see examples in whatwg/html#1280). Other use cases that comes to my mind: code folding in editors, element attributes instrumentation, text data extraction, also Cheerio willing to replace htmlparser2 with parse5, so I guess it would be nice to have optional "forgiving parsing" (I know... 🐐) mode.

@domenic
Copy link

domenic commented May 19, 2016

Seems bad... very un-parse5 like

@inikulin
Copy link
Owner Author

@domenic Well, content-modifying proxy scenario really bothers me. I guess we can go with separate package named forgive-me-whatwg which will use SAXParser + tree adapter + black magic under the hood. But maybe someone has some better ideas?

@domenic
Copy link

domenic commented May 19, 2016

I guess maybe the serialization algorithm is what needs fixing for the content-modifying proxy scenario?

@RReverser
Copy link
Collaborator

@inikulin Is there a possibility to do this via custom external adapter or, if not, expose APIs that could allow it to do this? (Ideally, I'd like to fix the serialization in spec, but apparently that's unlikely)

@inikulin
Copy link
Owner Author

@RReverser Yes, I have an idea how we can do this.

@zcorpan
Copy link

zcorpan commented May 20, 2016

I think this is actually a reasonable approach for parsing things you know has been serialized. Maybe it can fatal error if it hits an unexpected end tag? Also you still need to handle the same special cases as the HTML serializer does, at least - void elements, title/textarea/style/script, plaintext (for this one you need to change the serializer also to just stop serializing before </plaintext>), template, etc. But you also need to deal with foreign content, consider <title><x></x></title><svg><title><x></x></title></svg> - the HTML title has a text node child, the SVG title has an element x child.

If you want to make form association survive, you could, after the first parse, check each form control that is not a descendant of a form element and doesn't have a form attribute already; if it has a form owner, set a form attribute on it that points to the form element (and set an id on that form if it doesn't have one already).

If you want to make the document mode survive (quirks mode, almost quirks or no-quirks), that also needs something for cases like <!doctype html />.

@zcorpan
Copy link

zcorpan commented May 20, 2016

Hmm maybe shouldn't try to fix plaintext, I think it's unfixable for the foster parenting case.

<table><tr><td>foo</td></tr>
<plaintext>bar

@inikulin
Copy link
Owner Author

inikulin commented May 20, 2016

Also you still need to handle the same special cases as the HTML serializer does, at least - void elements, title/textarea/style/script, plaintext (for this one you need to change the serializer also to just stop serializing before </plaintext>), template, etc. But you also need to deal with foreign content, consider <title></title><title></title> - the HTML title has a text node child, the SVG title has an element x child.

We have ParserFeedbackSimulator which can handle all these cases. My idea was to run tokenizer + feedback simulator and just maintain simple open elements stack (end tag closes all elements in the stack up to the element with this tag name, void elements automatically popped out)

@jescalan
Copy link

Hi there! Coming on over from #144 to add my support for this. Working on posthtml, which is essentially a plugin-based html transform system for html, much like (you guessed it) postcss is for css. I really like parse5, it's extremely thorough and stable, and its line location info is a huge assistance for error messages from plugins and source maps.
But it needs just this small amount more flexibility in order to be viable for plugin authors and users.

Right now this is a blocking issue for me, I'm working full time trying to get to a stable release. I'm more than happy to help out with the implementation here if we could make this happen faster, if someone could give me a little walkthrough of the codebase!

@wooorm
Copy link
Collaborator

wooorm commented Sep 4, 2016

I’m also in need of the lexical tree. Plus, I want it to patch that tree with automatically inserted* elements, optionally.

I’m not sure if it’s possible to unlink the two, but if it is, I don’t see how it’s “un-parse5 like” to do that if the core API can stay the same?

* What’s the proper term here?

@domenic
Copy link

domenic commented Sep 4, 2016

Well, a HTML parser is something that follows the HTML Standard and produces a DOM tree. I guess you might be looking for something like a HTML lexer or tokenizer, although I don't know what kind of object that would produce (not a DOM), and there's no standard governing its behavior. That's why it's fairly un-parse5-like to attempt to add such features to parse5, which is a HTML parser library.

@inikulin
Copy link
Owner Author

I'll return to this topic later with better solution. Meanwhile for such scenarios I suggest to use https://github.com/reshape/parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants