Fragment / Document Question #144

jescalan · 2016-07-28T15:22:05Z

Hi there! Thanks so much for making this fantastic library first of all 💖

So I have a use case where I am wrapping it around a library that essentially allows partials/includes, so you could do something like this:

<doctype html>
<html>
  <include src='./head.html'>
  <body>
    <p>hello world!</p>
  </body>
</html>

And let's say for example that head.html was:

<head>
  <title>Example Page</title>
  <!-- other meta info -->
</head>

I'm trying to figure out how I can get parse5 to be able to handle this situation. Using the html fragment parse appears to remove head, body, and html tags, but using a normal document parse adds in a bunch of extra tags (doctype, head, body) that are not really necessary for this situation (although I do understand why they are added).

Is it possible to use parse5 for a task like this? Is there some type of parse mode that won't alter the tags, or a way for me to get the fragment parse not to strip tags? Also is there documentation anywhere on which tags are stripped by fragment parse mode, and/or added by full document mode?

The text was updated successfully, but these errors were encountered:

inikulin · 2016-07-28T15:46:47Z

Hi!
Thank you for the kind words, really appreciate it!

Is it possible to use parse5 for a task like this? Is there some type of parse mode that won't alter the tags, or a way for me to get the fragment parse not to strip tags?

We discussing such thing in #132. It's still debatable, but you can give your upvote for this feature and drop your scenario as an argument there.

Also is there documentation anywhere on which tags are stripped by fragment parse mode, and/or added by full document mode?

As far as I remember this is not documented explicitly. Regarding fragment parsing, if you're use parsing without context element, <template> context will be used. In that case <html>, <head> and <body> will be stripped.

Following tags will always be implicitly added by full document parsing mode:

<html> if missing
<head> if missing
<body> if missing
<p> - if </p> occurs without open tag
<colgroup> if <col> added directly to <table>
<tbody> if <td>, <th> or <tr> added directly to <table>
<tr> if <td> or <th> added directly to <table>

I hope i didn't forget anything.

jescalan · 2016-07-28T16:07:34Z

@inikulin perfect, thank you for the quick and thorough response! Just pitched in at the linked issue. Would be happy to help out as well if someone is willing to hand-hold me a bit at the beginning, just bc this is a large and unfamiliar code base.

Also about what you were saying with the context, would it be possible to work around the issue in the meantime by providing a different context explicitly, maybe something like an <html> element? I feel like probably not, but worth a shot!

inikulin · 2016-07-28T16:16:47Z

would it be possible to work around the issue in the meantime by providing a different context explicitly, maybe something like an element?

Yeah, you can pass <html> as context, but <body> and <head> will be generated implicitly anyway if they are missing.

perfect, thank you for the quick and thorough response! Just pitched in at the linked issue. Would be happy to help out as well if someone is willing to hand-hold me a bit at the beginning, just bc this is a large and unfamiliar code base.

Need some time to figure out how it will be actually done (more likely it will be a separate package on top of parse5, but we need to expose some API first). Unfortunately, I'm extremely busy right now and stepped away from parse5 development for some time. I hope I'll be back in late August, but meanwhile maybe @RReverser could help?

jescalan · 2016-08-02T19:40:59Z

Hi, just following up quickly, @RReverser would you be able to help a little? This issue is time sensitive for me, but I am willing to put time into helping 😁

inikulin · 2016-08-02T19:54:51Z

@jescalan I'll try to release new parse5 version on Thursday which includes some great updates to our SAXParser made by @RReverser and we will try to build some basic solution on top of it.

jescalan · 2016-08-02T19:56:01Z

@inikulin would be amazing. even if it's just a patch for now that's ok 😁

I'm looking through the code now and there's really quite a lot to navigate. I'll keep trying though!

jescalan · 2016-08-02T20:08:06Z

Wait I'm messing with the SAXParser right now, is there any reason that I wouldn't be able to build what I'm after here out of the SAXParser without any additional updates? It seems like it handles all tags already...

inikulin · 2016-08-02T20:54:53Z

@jescalan Yeah, new release will not bring any API changes, but it contains some important fixes for the SAXParser. Anyway, you can already start prototyping. The idea is quite simple: maintain own open element stack, on startTag event of SAXParser create element using tree adapter and if it's not in list of void elements put in into stack. Append nodes and elements to the top element on the stack (add document or documentFragment to the stack before you start parsing). Once you encounter end tag - pop elements up to matching element or until only document left on the stack. To deal with tree you can use one of provided tree adapters, you can find their API description in the docs.

jescalan · 2016-08-02T21:14:21Z

@inikulin Great, I have this mostly built out and it's working pretty well 🎉 Will post here when it's entirely finished. I'm running into an issue with self-closing tags though. It seems like it will only detect them if using the closing slash like <br />. If it's missing the closing slash (which is still valid html), it doesn't mark the tag as self-closing. Is this a bug, or am I missing something?

EDIT: It's only doing this when I don't have a doctype set in the same fragment. This is still an issue for me though, as it's possible that I'll need to parse a fragment which doesn't explicitly contain a doctype. Is there a way to set the doctype manually? I don't see one in the docs...

inikulin · 2016-08-02T21:29:49Z

@jescalan It shouldn't be related to doctype. Regarding self-closing tags: https://github.com/inikulin/parse5/wiki/Documentation#q-im-parsing-img-srcfoo-with-the-saxparser-and-i-expect-the-selfclosing-flag-to-be-true-for-the-img-tag-but-its-not-is-there-something-wrong-with-the-parser - you need to check against the list of void elements as I mentioned in comment above.

jescalan · 2016-08-02T21:33:59Z

@inikulin ah, i didn't really get what you meant with the void elements at first, now it makes a lot more sense. thanks for clearing that up!

jescalan · 2016-08-02T21:54:41Z

@inikulin Ok working well with the void tags 😄 One more question -- I have a test in here to see if it will parse plain text that's not inside any tags, and I'm not getting anything back from the parser on this one. Does the SAXParser parse plain text, or does it need to be contained inside a tag?

inikulin · 2016-08-02T22:03:18Z

Hmm, it should parse plain text as is: https://tonicdev.com/57a10ab6594ef21300a7a1ad/57a11834d2ab3913009ee831

jescalan · 2016-08-02T22:15:26Z

Working now with your method of pushing a string into the stream. I was using a different way that was not, for some reason 👍

stevenvachon · 2016-08-04T15:17:38Z

@inikulin will SAXParser be on par with the regular parser in terms of "back-checking" DOM corrections?

<html>
<body>

<div><body class="addition"></body></div>

</body>
</html>

inikulin · 2016-08-04T15:20:11Z

@stevenvachon It will not perform any tree structure correction, that's the point of this whole thread.

jescalan · 2016-08-30T20:33:10Z

Just as a wrap-up, did end up getting this working in the end, thanks to the brilliant @inikulin's help. Result can be seen here: https://github.com/reshape/parser 🎉

thisconnect · 2016-09-02T10:26:21Z

@jescalan have you considere link rel=import instead of a custom include element?

http://webcomponents.org/articles/introduction-to-html-imports/

jescalan · 2016-09-02T14:04:52Z

@thisconnect absolutely, but you need to be using http/2 (preferably with server push) in order for that to make sense, and not everyone has fully made that transition yet. As soon as http/2 with push becomes more standard, the include element will probably be used much less often, if ever.

inikulin · 2016-09-02T14:17:05Z

@thisconnect @jescalan Guys, I have a feeling that this discussion doesn't belong to parse5. Can you choose another medium to proceed with your conversation to not spam those who watching this repo, please?

thisconnect · 2016-09-02T15:32:59Z

Sure sorry

inikulin added the question label Jul 28, 2016

jescalan mentioned this issue Jul 28, 2016

LexicalTreeParser #132

Closed

jescalan mentioned this issue Aug 2, 2016

v1.0 Proposed Changes posthtml/posthtml#159

Closed

inikulin mentioned this issue Aug 26, 2016

Make it possible to use parse5 as optional(?) parser cheeriojs/cheerio#863

Closed

inikulin closed this as completed Aug 30, 2016

inikulin mentioned this issue Aug 31, 2016

White space between nodes in #document, html #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fragment / Document Question #144

Fragment / Document Question #144

jescalan commented Jul 28, 2016

inikulin commented Jul 28, 2016 •

edited

Loading

jescalan commented Jul 28, 2016

inikulin commented Jul 28, 2016

jescalan commented Aug 2, 2016

inikulin commented Aug 2, 2016 •

edited

Loading

jescalan commented Aug 2, 2016

jescalan commented Aug 2, 2016 •

edited

Loading

inikulin commented Aug 2, 2016 •

edited

Loading

jescalan commented Aug 2, 2016 •

edited

Loading

inikulin commented Aug 2, 2016

jescalan commented Aug 2, 2016

jescalan commented Aug 2, 2016

inikulin commented Aug 2, 2016

jescalan commented Aug 2, 2016

stevenvachon commented Aug 4, 2016

inikulin commented Aug 4, 2016

jescalan commented Aug 30, 2016

thisconnect commented Sep 2, 2016

jescalan commented Sep 2, 2016

inikulin commented Sep 2, 2016

thisconnect commented Sep 2, 2016

Fragment / Document Question #144

Fragment / Document Question #144

Comments

jescalan commented Jul 28, 2016

inikulin commented Jul 28, 2016 • edited Loading

jescalan commented Jul 28, 2016

inikulin commented Jul 28, 2016

jescalan commented Aug 2, 2016

inikulin commented Aug 2, 2016 • edited Loading

jescalan commented Aug 2, 2016

jescalan commented Aug 2, 2016 • edited Loading

inikulin commented Aug 2, 2016 • edited Loading

jescalan commented Aug 2, 2016 • edited Loading

inikulin commented Aug 2, 2016

jescalan commented Aug 2, 2016

jescalan commented Aug 2, 2016

inikulin commented Aug 2, 2016

jescalan commented Aug 2, 2016

stevenvachon commented Aug 4, 2016

inikulin commented Aug 4, 2016

jescalan commented Aug 30, 2016

thisconnect commented Sep 2, 2016

jescalan commented Sep 2, 2016

inikulin commented Sep 2, 2016

thisconnect commented Sep 2, 2016

inikulin commented Jul 28, 2016 •

edited

Loading

inikulin commented Aug 2, 2016 •

edited

Loading

jescalan commented Aug 2, 2016 •

edited

Loading

inikulin commented Aug 2, 2016 •

edited

Loading

jescalan commented Aug 2, 2016 •

edited

Loading